chore(tools): add sse-timeout-probe for UI-01 / UI-02 empirical trace#318
chore(tools): add sse-timeout-probe for UI-01 / UI-02 empirical trace#318jamesbroadhead wants to merge 4 commits intomainfrom
Conversation
Adds a tiny TS reproducer for the SSE idle-timeout gap reported in ES-1742245
(field-facing "AI Value Roadmap" app dropping ~75% of SSE connections
through the Apps reverse proxy).
Two files:
- probe.ts — opens one SSE connection per duration in a configurable
ladder; records lifetime, bytes, and how the connection
ended (completed / server-close / network-error).
- server.ts — companion server that responds on /sse-probe, holding the
connection open for the requested duration with an optional
heartbeat comment. Deploy as an app entrypoint to measure
the Databricks-hosted ceiling vs an EKS / localhost control.
- README.md — usage, what to look for (sharp cliff at 60s/90s/120s/180s
maps back to apps/gateway vs oauth2-proxy vs DP ApiProxy
envoy), and how heartbeat behavior distinguishes idle
timeouts from absolute request timeouts.
Why this is a separate PR: UI-01's source doc and ES-1742245 disagree on
whether the drop is timeout-driven or buffering-driven. Running this probe
against a dogfood app answers that question empirically and tells us which
fix to pursue (per-route request_timeout raise, heartbeat middleware, or
buffering / HTTP/2 hardening). Draft because the fix itself depends on
those results.
Co-authored-by: Isaac
- probe.ts: distinguish `completed` (server held full target) from `server-close` (server closed early) by comparing lifetime against the target with a 500ms tolerance — `completed` was previously unreachable. - probe.ts: rename outcome `timeout-header` -> `client-hard-timeout` to reflect what actually happened (the client's safety timer fired, not an HTTP timeout response). - probe.ts: drop the placeholder header-parse loop; the actual parser is the second loop. - server.ts: drop the redundant `Connection: keep-alive` header (managed by Node's HTTP layer; ignored on HTTP/2). - server.ts: guard heartbeat and probe-end writes with try/catch so a half-closed connection mid-interval doesn't crash the server. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
- probe.ts: `client-hard-timeout` outcome was unreachable. Node 22's
fetch throws AbortController.abort(reason) DIRECTLY (not as an
AbortError-with-cause), so the previous `e.name === 'AbortError'`
check never matched. Switch the abort reason to a Symbol sentinel and
detect via `signal.aborted && signal.reason === PROBE_HARD_TIMEOUT`.
- probe.ts: extract classifyFetchError() so the outcome-classification
logic is unit-testable without networking.
- probe.ts: distinguish 'server-close' (proxy/upstream reset mid-stream)
from 'network-error' (failure before any bytes arrived) using a
streamStarted flag. Surface the underlying socket message via
`error.cause.message` instead of an opaque 'fetch failed'.
- probe.ts: add new outcomes `auth-redirect` and `wrong-content-type`
with `redirect: 'manual'` + an explicit content-type check, so an
oauth2-proxy login page no longer masquerades as a short-lived stream.
- probe.ts: fail fast (exit 2) when --durations resolves to an empty
list, instead of silently exiting 0 with no probes run.
- server.ts: validate hold-ms / heartbeat-ms with a parseDurationParam
helper that rejects NaN/negative values and clamps to safe maxima.
Math.max(0, NaN) was returning NaN, which collapsed setTimeout to 1ms.
- server.ts: drop the dead try/catch around res.write — Node returns a
boolean for backpressure rather than throwing synchronously. Add a
proper res.on('error', cleanup) handler for the actual async failure
path.
- vitest.config.ts: register a 'tools' project so the new
tools/sse-timeout-probe/probe.test.ts runs under `pnpm test`.
Tests cover: hard-timeout vs server-close vs network-error
classification, the unrelated-abort-reason guard rail, the cause
fallback, parseDurationParam edge cases, and an in-process
integration smoke test of the completed/wrong-content-type paths.
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
|
Hey James, thanks for putting this together! A few thoughts:
Thanks! |
…ython server Replaces the speculative diagnostic framing in the README with the actual finding: oauth2-proxy upstream_timeout = 5m on the Container path, manifesting as Go's ResponseHeaderTimeout body-cut quirk on HTTP/1.1 chunked-encoding upstream connections. Adds server.py — Python stdlib equivalent of server.ts — so the probe can also be deployed against app images that don't include npx/tsx (e.g. Spaces apps). Updates the Follow-ups section to combined WS-variant + dev-playground wiring, and points at the universe fix (databricks-eng/universe#1867246). Refs: ES-1742245 Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
|
Closing in favor of https://github.com/databricks-eng/universe/pull/1869944, which adds the same probe at The probe characterizes the Apps platform proxy chain (NLB, Envoy, Lakegate, oauth2-proxy), not appkit itself, so it belongs alongside the platform it tests and the fix landing alongside it (https://github.com/databricks-eng/universe/pull/1867246). The universe PR carries the empirical findings, the result-interpretation table, the migration note, and the recipe for verifying the upstream_timeout fix once it's broadly deployed. No code changes are lost — Follow-ups (WS variant, dev-playground integration, AppKit-side reconnection-pattern docs) tracked in the universe PR's README. |
Summary
Adds
tools/sse-timeout-probe/— a deterministic SSE / streaming-response probe for Databricks Apps. Built to characterize the cap reported in ES-1742245 / UI-01 / UI-02; landing as a reusable diagnostic and the regression test for the platform fix.What we found
SSE streams through Databricks Apps Container were terminated at exactly 301.55s ± 18ms regardless of heartbeat traffic — variance below 100ms across four hold values (300s, 400s, 500s, 600s) confirms an absolute, not idle, timer.
Source:
apps/oauth2-proxy/proxy.cfgupstream_timeout = "5m", mapped to Go'stransport.ResponseHeaderTimeoutatapps/oauth2-proxy/pkg/upstream/http.go:143-144. The stdlib documents that timer as header-receipt-only, but on HTTP/1.1 chunked-encoding upstream connections it also fires mid-body — a known practical issue acknowledged inapps/runtime/pkg/proxy/local_proxy.go:217-221.Spaces-product-type apps front oauth2-proxy with TLS (HTTP/2) so the body-affecting variant of the quirk doesn't trigger; verified clean to 600s.
Other suspects ruled out:
connection-idle-timeout=300s— TCP idle is reset by traffic; 5s heartbeats would have prevented any idle cut.databricks-apps-cluster timeout: 60s— observed lifetimes were 300s, so this 60s wasn't the binding cap.request_timeout=60s— verified empirically not body-cutting on Spaces apps; the hyper API contract holds (timeout wraps the headers-receipt future, not the body).Concurrency check: 20 parallel 120s SSE streams completed cleanly; avg TTFB matched single-stream baseline. No HTTP/2 multiplexing penalty observed at this concurrency. The original UI-02 multiplexing-pressure hypothesis is not reproducing at this N — higher-N tests are listed as a follow-up.
Universe fix
https://github.com/databricks-eng/universe/pull/1867246 bumps
upstream_timeoutto30mon Container's oauth2-proxy. Container is the dominant product type and remains so; the fix needs to land there.Why this PR is still worth landing
The probe code is the regression test for the universe fix and a reusable diagnostic for any future SSE / streaming issue on Apps. Once the universe PR is broadly deployed, re-running the probe with
--durations 1500000 --heartbeat 5000against any Container app should complete cleanly at ~25 minutes; a regression manifests asoutcome: server-closeat some lower value.What's in this PR
probe.ts— opens one SSE connection per duration in a configurable ladder; records lifetime, bytes, end-cause (completed/server-close/network-error/client-hard-timeout/auth-redirect/wrong-content-type).server.ts— companion server speaking the/sse-probe?hold-ms=…&heartbeat-ms=…contract, for app images that includenpx/tsx.server.py— Python stdlib equivalent ofserver.tsfor app images that don't (e.g. Spaces).README.md— usage, the result-pattern interpretation table, and the recipe for verifying the universe fix.Test plan
apps/oauth2-proxy/proxy.cfg:34.--durations 1500000 --heartbeat 5000against a Container app and confirm clean completion at ~25 minutes.Follow-ups (out of scope for this PR)
In appkit's lane:
apps/dev-playground(single piece of work).Last-Event-IDreconnection pattern (StreamManager+ ring buffer + abort signals) as the recommended shape for long-running SSE apps. Defense in depth.Out of appkit, tracked elsewhere:
apps/tech-docs/(universe doc PR after #1867246 lands).This pull request and its description were written by Claude (claude.ai).