Skip to content

chore(tools): add sse-timeout-probe for UI-01 / UI-02 empirical trace#318

Closed
jamesbroadhead wants to merge 4 commits intomainfrom
jb/ui-01-sse-timeout-repro
Closed

chore(tools): add sse-timeout-probe for UI-01 / UI-02 empirical trace#318
jamesbroadhead wants to merge 4 commits intomainfrom
jb/ui-01-sse-timeout-repro

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

@jamesbroadhead jamesbroadhead commented Apr 27, 2026

Summary

Adds tools/sse-timeout-probe/ — a deterministic SSE / streaming-response probe for Databricks Apps. Built to characterize the cap reported in ES-1742245 / UI-01 / UI-02; landing as a reusable diagnostic and the regression test for the platform fix.

What we found

SSE streams through Databricks Apps Container were terminated at exactly 301.55s ± 18ms regardless of heartbeat traffic — variance below 100ms across four hold values (300s, 400s, 500s, 600s) confirms an absolute, not idle, timer.

Source: apps/oauth2-proxy/proxy.cfg upstream_timeout = "5m", mapped to Go's transport.ResponseHeaderTimeout at apps/oauth2-proxy/pkg/upstream/http.go:143-144. The stdlib documents that timer as header-receipt-only, but on HTTP/1.1 chunked-encoding upstream connections it also fires mid-body — a known practical issue acknowledged in apps/runtime/pkg/proxy/local_proxy.go:217-221.

Spaces-product-type apps front oauth2-proxy with TLS (HTTP/2) so the body-affecting variant of the quirk doesn't trigger; verified clean to 600s.

Other suspects ruled out:

  • AWS NLB connection-idle-timeout=300s — TCP idle is reset by traffic; 5s heartbeats would have prevented any idle cut.
  • Apps Envoy route databricks-apps-cluster timeout: 60s — observed lifetimes were 300s, so this 60s wasn't the binding cap.
  • Lakegate (Apps Gateway) request_timeout=60s — verified empirically not body-cutting on Spaces apps; the hyper API contract holds (timeout wraps the headers-receipt future, not the body).

Concurrency check: 20 parallel 120s SSE streams completed cleanly; avg TTFB matched single-stream baseline. No HTTP/2 multiplexing penalty observed at this concurrency. The original UI-02 multiplexing-pressure hypothesis is not reproducing at this N — higher-N tests are listed as a follow-up.

Universe fix

https://github.com/databricks-eng/universe/pull/1867246 bumps upstream_timeout to 30m on Container's oauth2-proxy. Container is the dominant product type and remains so; the fix needs to land there.

Why this PR is still worth landing

The probe code is the regression test for the universe fix and a reusable diagnostic for any future SSE / streaming issue on Apps. Once the universe PR is broadly deployed, re-running the probe with --durations 1500000 --heartbeat 5000 against any Container app should complete cleanly at ~25 minutes; a regression manifests as outcome: server-close at some lower value.

What's in this PR

  • probe.ts — opens one SSE connection per duration in a configurable ladder; records lifetime, bytes, end-cause (completed / server-close / network-error / client-hard-timeout / auth-redirect / wrong-content-type).
  • server.ts — companion server speaking the /sse-probe?hold-ms=…&heartbeat-ms=… contract, for app images that include npx / tsx.
  • server.py — Python stdlib equivalent of server.ts for app images that don't (e.g. Spaces).
  • README.md — usage, the result-pattern interpretation table, and the recipe for verifying the universe fix.

Test plan

  • Empirically reproduce the cap on dogfood Container app — done; 301.5s ± 18ms confirmed.
  • Verify Spaces apps don't experience the cap — done; clean to 600s.
  • Locate the source of the cap in universe — done; apps/oauth2-proxy/proxy.cfg:34.
  • Once https://github.com/databricks-eng/universe/pull/1867246 is broadly deployed, re-run --durations 1500000 --heartbeat 5000 against a Container app and confirm clean completion at ~25 minutes.

Follow-ups (out of scope for this PR)

In appkit's lane:

  • WebSocket variant of the probe + wire both probes into apps/dev-playground (single piece of work).
  • Promote AppKit's existing Last-Event-ID reconnection pattern (StreamManager + ring buffer + abort signals) as the recommended shape for long-running SSE apps. Defense in depth.
  • Higher-N concurrency probe (N=100, N=500) to actually stress the HTTP/2 multiplexing hypothesis from the original UI-02 doc.
  • CI regression test once the universe fix is broadly deployed.

Out of appkit, tracked elsewhere:

  • Architectural fix to bring Container's oauth2-proxy hop to TLS / HTTP/2 (Apps platform / networking work).
  • Durable docs of the diagnosis under apps/tech-docs/ (universe doc PR after #1867246 lands).

This pull request and its description were written by Claude (claude.ai).

Adds a tiny TS reproducer for the SSE idle-timeout gap reported in ES-1742245
(field-facing "AI Value Roadmap" app dropping ~75% of SSE connections
through the Apps reverse proxy).

Two files:
- probe.ts     — opens one SSE connection per duration in a configurable
                 ladder; records lifetime, bytes, and how the connection
                 ended (completed / server-close / network-error).
- server.ts    — companion server that responds on /sse-probe, holding the
                 connection open for the requested duration with an optional
                 heartbeat comment. Deploy as an app entrypoint to measure
                 the Databricks-hosted ceiling vs an EKS / localhost control.
- README.md    — usage, what to look for (sharp cliff at 60s/90s/120s/180s
                 maps back to apps/gateway vs oauth2-proxy vs DP ApiProxy
                 envoy), and how heartbeat behavior distinguishes idle
                 timeouts from absolute request timeouts.

Why this is a separate PR: UI-01's source doc and ES-1742245 disagree on
whether the drop is timeout-driven or buffering-driven. Running this probe
against a dogfood app answers that question empirically and tells us which
fix to pursue (per-route request_timeout raise, heartbeat middleware, or
buffering / HTTP/2 hardening). Draft because the fix itself depends on
those results.

Co-authored-by: Isaac
- probe.ts: distinguish `completed` (server held full target) from
  `server-close` (server closed early) by comparing lifetime against
  the target with a 500ms tolerance — `completed` was previously
  unreachable.
- probe.ts: rename outcome `timeout-header` -> `client-hard-timeout`
  to reflect what actually happened (the client's safety timer fired,
  not an HTTP timeout response).
- probe.ts: drop the placeholder header-parse loop; the actual parser
  is the second loop.
- server.ts: drop the redundant `Connection: keep-alive` header
  (managed by Node's HTTP layer; ignored on HTTP/2).
- server.ts: guard heartbeat and probe-end writes with try/catch so a
  half-closed connection mid-interval doesn't crash the server.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
- probe.ts: `client-hard-timeout` outcome was unreachable. Node 22's
  fetch throws AbortController.abort(reason) DIRECTLY (not as an
  AbortError-with-cause), so the previous `e.name === 'AbortError'`
  check never matched. Switch the abort reason to a Symbol sentinel and
  detect via `signal.aborted && signal.reason === PROBE_HARD_TIMEOUT`.
- probe.ts: extract classifyFetchError() so the outcome-classification
  logic is unit-testable without networking.
- probe.ts: distinguish 'server-close' (proxy/upstream reset mid-stream)
  from 'network-error' (failure before any bytes arrived) using a
  streamStarted flag. Surface the underlying socket message via
  `error.cause.message` instead of an opaque 'fetch failed'.
- probe.ts: add new outcomes `auth-redirect` and `wrong-content-type`
  with `redirect: 'manual'` + an explicit content-type check, so an
  oauth2-proxy login page no longer masquerades as a short-lived stream.
- probe.ts: fail fast (exit 2) when --durations resolves to an empty
  list, instead of silently exiting 0 with no probes run.
- server.ts: validate hold-ms / heartbeat-ms with a parseDurationParam
  helper that rejects NaN/negative values and clamps to safe maxima.
  Math.max(0, NaN) was returning NaN, which collapsed setTimeout to 1ms.
- server.ts: drop the dead try/catch around res.write — Node returns a
  boolean for backpressure rather than throwing synchronously. Add a
  proper res.on('error', cleanup) handler for the actual async failure
  path.
- vitest.config.ts: register a 'tools' project so the new
  tools/sse-timeout-probe/probe.test.ts runs under `pnpm test`.

Tests cover: hard-timeout vs server-close vs network-error
classification, the unrelated-abort-reason guard rail, the cause
fallback, parseDurationParam edge cases, and an in-process
integration smoke test of the completed/wrong-content-type paths.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
@jamesbroadhead jamesbroadhead requested a review from pkosiec April 27, 2026 20:28
@pkosiec
Copy link
Copy Markdown
Member

pkosiec commented Apr 28, 2026

Hey James, thanks for putting this together!

A few thoughts:

  1. This seems like a one-time diagnostic tool — once the probe runs and identifies the root cause, the code becomes obsolete. I'm not sure merging a temporary tool into AppKit's main branch is the right path, as it adds 660 lines + a new vitest test project for something that won't be needed afterward.
  2. This isn't really AppKit-scoped — the probe targets Databricks Apps proxy/gateway timeout behavior, not the AppKit SDK itself.
  3. What if we skip the merge and go straight to the investigation? The probe could be run from a branch or standalone repo just as well. Once the root cause is clear, the real value would be contributing the fix directly to the relevant Apps components (apps/gateway, oauth2-proxy, etc.). If you'd like to drive this, that would be great — otherwise I can pick it up in a week or so once the bigger roadmap items are wrapped up.

Thanks!

…ython server

Replaces the speculative diagnostic framing in the README with the actual
finding: oauth2-proxy upstream_timeout = 5m on the Container path,
manifesting as Go's ResponseHeaderTimeout body-cut quirk on HTTP/1.1
chunked-encoding upstream connections.

Adds server.py — Python stdlib equivalent of server.ts — so the probe can
also be deployed against app images that don't include npx/tsx (e.g.
Spaces apps).

Updates the Follow-ups section to combined WS-variant + dev-playground
wiring, and points at the universe fix (databricks-eng/universe#1867246).

Refs: ES-1742245
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
@jamesbroadhead
Copy link
Copy Markdown
Contributor Author

Closing in favor of https://github.com/databricks-eng/universe/pull/1869944, which adds the same probe at apps/tools/sse-probe/ in universe.

The probe characterizes the Apps platform proxy chain (NLB, Envoy, Lakegate, oauth2-proxy), not appkit itself, so it belongs alongside the platform it tests and the fix landing alongside it (https://github.com/databricks-eng/universe/pull/1867246). The universe PR carries the empirical findings, the result-interpretation table, the migration note, and the recipe for verifying the upstream_timeout fix once it's broadly deployed.

No code changes are lost — probe.ts, probe.test.ts, server.ts, plus the server.py added in 9b74f3e, all moved over verbatim with path/name references updated.

Follow-ups (WS variant, dev-playground integration, AppKit-side reconnection-pattern docs) tracked in the universe PR's README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants