chore(tools): add sse-timeout-probe for UI-01 / UI-02 empirical trace by jamesbroadhead · Pull Request #318 · databricks/appkit

jamesbroadhead · 2026-04-27T17:11:56Z

Summary

Adds tools/sse-timeout-probe/ — a deterministic SSE / streaming-response probe for Databricks Apps. Built to characterize the cap reported in ES-1742245 / UI-01 / UI-02; landing as a reusable diagnostic and the regression test for the platform fix.

What we found

SSE streams through Databricks Apps Container were terminated at exactly 301.55s ± 18ms regardless of heartbeat traffic — variance below 100ms across four hold values (300s, 400s, 500s, 600s) confirms an absolute, not idle, timer.

Source: apps/oauth2-proxy/proxy.cfg upstream_timeout = "5m", mapped to Go's transport.ResponseHeaderTimeout at apps/oauth2-proxy/pkg/upstream/http.go:143-144. The stdlib documents that timer as header-receipt-only, but on HTTP/1.1 chunked-encoding upstream connections it also fires mid-body — a known practical issue acknowledged in apps/runtime/pkg/proxy/local_proxy.go:217-221.

Spaces-product-type apps front oauth2-proxy with TLS (HTTP/2) so the body-affecting variant of the quirk doesn't trigger; verified clean to 600s.

Other suspects ruled out:

AWS NLB connection-idle-timeout=300s — TCP idle is reset by traffic; 5s heartbeats would have prevented any idle cut.
Apps Envoy route databricks-apps-cluster timeout: 60s — observed lifetimes were 300s, so this 60s wasn't the binding cap.
Lakegate (Apps Gateway) request_timeout=60s — verified empirically not body-cutting on Spaces apps; the hyper API contract holds (timeout wraps the headers-receipt future, not the body).

Concurrency check: 20 parallel 120s SSE streams completed cleanly; avg TTFB matched single-stream baseline. No HTTP/2 multiplexing penalty observed at this concurrency. The original UI-02 multiplexing-pressure hypothesis is not reproducing at this N — higher-N tests are listed as a follow-up.

Universe fix

https://github.com/databricks-eng/universe/pull/1867246 bumps upstream_timeout to 30m on Container's oauth2-proxy. Container is the dominant product type and remains so; the fix needs to land there.

Why this PR is still worth landing

The probe code is the regression test for the universe fix and a reusable diagnostic for any future SSE / streaming issue on Apps. Once the universe PR is broadly deployed, re-running the probe with --durations 1500000 --heartbeat 5000 against any Container app should complete cleanly at ~25 minutes; a regression manifests as outcome: server-close at some lower value.

What's in this PR

probe.ts — opens one SSE connection per duration in a configurable ladder; records lifetime, bytes, end-cause (completed / server-close / network-error / client-hard-timeout / auth-redirect / wrong-content-type).
server.ts — companion server speaking the /sse-probe?hold-ms=…&heartbeat-ms=… contract, for app images that include npx / tsx.
server.py — Python stdlib equivalent of server.ts for app images that don't (e.g. Spaces).
README.md — usage, the result-pattern interpretation table, and the recipe for verifying the universe fix.

Test plan

Empirically reproduce the cap on dogfood Container app — done; 301.5s ± 18ms confirmed.
Verify Spaces apps don't experience the cap — done; clean to 600s.
Locate the source of the cap in universe — done; apps/oauth2-proxy/proxy.cfg:34.
Once https://github.com/databricks-eng/universe/pull/1867246 is broadly deployed, re-run --durations 1500000 --heartbeat 5000 against a Container app and confirm clean completion at ~25 minutes.

Follow-ups (out of scope for this PR)

In appkit's lane:

WebSocket variant of the probe + wire both probes into apps/dev-playground (single piece of work).
Promote AppKit's existing Last-Event-ID reconnection pattern (StreamManager + ring buffer + abort signals) as the recommended shape for long-running SSE apps. Defense in depth.
Higher-N concurrency probe (N=100, N=500) to actually stress the HTTP/2 multiplexing hypothesis from the original UI-02 doc.
CI regression test once the universe fix is broadly deployed.

Out of appkit, tracked elsewhere:

Architectural fix to bring Container's oauth2-proxy hop to TLS / HTTP/2 (Apps platform / networking work).
Durable docs of the diagnosis under apps/tech-docs/ (universe doc PR after #1867246 lands).

This pull request and its description were written by Claude (claude.ai).

Adds a tiny TS reproducer for the SSE idle-timeout gap reported in ES-1742245 (field-facing "AI Value Roadmap" app dropping ~75% of SSE connections through the Apps reverse proxy). Two files: - probe.ts — opens one SSE connection per duration in a configurable ladder; records lifetime, bytes, and how the connection ended (completed / server-close / network-error). - server.ts — companion server that responds on /sse-probe, holding the connection open for the requested duration with an optional heartbeat comment. Deploy as an app entrypoint to measure the Databricks-hosted ceiling vs an EKS / localhost control. - README.md — usage, what to look for (sharp cliff at 60s/90s/120s/180s maps back to apps/gateway vs oauth2-proxy vs DP ApiProxy envoy), and how heartbeat behavior distinguishes idle timeouts from absolute request timeouts. Why this is a separate PR: UI-01's source doc and ES-1742245 disagree on whether the drop is timeout-driven or buffering-driven. Running this probe against a dogfood app answers that question empirically and tells us which fix to pursue (per-route request_timeout raise, heartbeat middleware, or buffering / HTTP/2 hardening). Draft because the fix itself depends on those results. Co-authored-by: Isaac

- probe.ts: distinguish `completed` (server held full target) from `server-close` (server closed early) by comparing lifetime against the target with a 500ms tolerance — `completed` was previously unreachable. - probe.ts: rename outcome `timeout-header` -> `client-hard-timeout` to reflect what actually happened (the client's safety timer fired, not an HTTP timeout response). - probe.ts: drop the placeholder header-parse loop; the actual parser is the second loop. - server.ts: drop the redundant `Connection: keep-alive` header (managed by Node's HTTP layer; ignored on HTTP/2). - server.ts: guard heartbeat and probe-end writes with try/catch so a half-closed connection mid-interval doesn't crash the server. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>

- probe.ts: `client-hard-timeout` outcome was unreachable. Node 22's fetch throws AbortController.abort(reason) DIRECTLY (not as an AbortError-with-cause), so the previous `e.name === 'AbortError'` check never matched. Switch the abort reason to a Symbol sentinel and detect via `signal.aborted && signal.reason === PROBE_HARD_TIMEOUT`. - probe.ts: extract classifyFetchError() so the outcome-classification logic is unit-testable without networking. - probe.ts: distinguish 'server-close' (proxy/upstream reset mid-stream) from 'network-error' (failure before any bytes arrived) using a streamStarted flag. Surface the underlying socket message via `error.cause.message` instead of an opaque 'fetch failed'. - probe.ts: add new outcomes `auth-redirect` and `wrong-content-type` with `redirect: 'manual'` + an explicit content-type check, so an oauth2-proxy login page no longer masquerades as a short-lived stream. - probe.ts: fail fast (exit 2) when --durations resolves to an empty list, instead of silently exiting 0 with no probes run. - server.ts: validate hold-ms / heartbeat-ms with a parseDurationParam helper that rejects NaN/negative values and clamps to safe maxima. Math.max(0, NaN) was returning NaN, which collapsed setTimeout to 1ms. - server.ts: drop the dead try/catch around res.write — Node returns a boolean for backpressure rather than throwing synchronously. Add a proper res.on('error', cleanup) handler for the actual async failure path. - vitest.config.ts: register a 'tools' project so the new tools/sse-timeout-probe/probe.test.ts runs under `pnpm test`. Tests cover: hard-timeout vs server-close vs network-error classification, the unrelated-abort-reason guard rail, the cause fallback, parseDurationParam edge cases, and an in-process integration smoke test of the completed/wrong-content-type paths. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>

pkosiec · 2026-04-28T10:46:28Z

Hey James, thanks for putting this together!

A few thoughts:

This seems like a one-time diagnostic tool — once the probe runs and identifies the root cause, the code becomes obsolete. I'm not sure merging a temporary tool into AppKit's main branch is the right path, as it adds 660 lines + a new vitest test project for something that won't be needed afterward.
This isn't really AppKit-scoped — the probe targets Databricks Apps proxy/gateway timeout behavior, not the AppKit SDK itself.
What if we skip the merge and go straight to the investigation? The probe could be run from a branch or standalone repo just as well. Once the root cause is clear, the real value would be contributing the fix directly to the relevant Apps components (apps/gateway, oauth2-proxy, etc.). If you'd like to drive this, that would be great — otherwise I can pick it up in a week or so once the bigger roadmap items are wrapped up.

Thanks!

…ython server Replaces the speculative diagnostic framing in the README with the actual finding: oauth2-proxy upstream_timeout = 5m on the Container path, manifesting as Go's ResponseHeaderTimeout body-cut quirk on HTTP/1.1 chunked-encoding upstream connections. Adds server.py — Python stdlib equivalent of server.ts — so the probe can also be deployed against app images that don't include npx/tsx (e.g. Spaces apps). Updates the Follow-ups section to combined WS-variant + dev-playground wiring, and points at the universe fix (databricks-eng/universe#1867246). Refs: ES-1742245 Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>

jamesbroadhead · 2026-04-29T10:28:35Z

Closing in favor of https://github.com/databricks-eng/universe/pull/1869944, which adds the same probe at apps/tools/sse-probe/ in universe.

The probe characterizes the Apps platform proxy chain (NLB, Envoy, Lakegate, oauth2-proxy), not appkit itself, so it belongs alongside the platform it tests and the fix landing alongside it (https://github.com/databricks-eng/universe/pull/1867246). The universe PR carries the empirical findings, the result-interpretation table, the migration note, and the recipe for verifying the upstream_timeout fix once it's broadly deployed.

No code changes are lost — probe.ts, probe.test.ts, server.ts, plus the server.py added in 9b74f3e, all moved over verbatim with path/name references updated.

Follow-ups (WS variant, dev-playground integration, AppKit-side reconnection-pattern docs) tracked in the universe PR's README.

jamesbroadhead mentioned this pull request Apr 27, 2026

tools: sse-timeout-probe for UI-01 / UI-02 empirical trace #292

Closed

3 tasks

jamesbroadhead added 2 commits April 27, 2026 17:25

jamesbroadhead requested a review from pkosiec April 27, 2026 20:28

jamesbroadhead closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(tools): add sse-timeout-probe for UI-01 / UI-02 empirical trace#318

chore(tools): add sse-timeout-probe for UI-01 / UI-02 empirical trace#318
jamesbroadhead wants to merge 4 commits intomainfrom
jb/ui-01-sse-timeout-repro

jamesbroadhead commented Apr 27, 2026 •

edited

Loading

Uh oh!

pkosiec commented Apr 28, 2026 •

edited

Loading

Uh oh!

jamesbroadhead commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesbroadhead commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What we found

Universe fix

Why this PR is still worth landing

What's in this PR

Test plan

Follow-ups (out of scope for this PR)

Uh oh!

pkosiec commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesbroadhead commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesbroadhead commented Apr 27, 2026 •

edited

Loading

pkosiec commented Apr 28, 2026 •

edited

Loading