Skip to content

feat: default streaming encode for sequential renders#579

Open
miguel-heygen wants to merge 3 commits intomainfrom
feat/streaming-render-clone
Open

feat: default streaming encode for sequential renders#579
miguel-heygen wants to merge 3 commits intomainfrom
feat/streaming-render-clone

Conversation

@miguel-heygen
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen commented Apr 30, 2026

Problem

Streaming encode is available, but resolveConfig() still defaulted it off, so sequential renders kept writing every captured frame to disk before encoding.

Production has also seen streaming encode fail on longer videos because the ffmpeg streaming process can hit its timeout. We need the same guard used in GSAP rendering: stream only for videos up to four minutes by default, make that cutoff operator-tunable, and use normal frame-dir encoding for longer videos.

The standalone sub-composition render extraction also mutated body before the final render subtree was fully assembled, which made the logic more fragile than necessary.

What this fixes

  • Defaults enableStreamingEncode to true in engine config while preserving PRODUCER_ENABLE_STREAMING_ENCODE=false as the opt-out.
  • Adds streamingEncodeMaxDurationSeconds to EngineConfig, defaulting to 240, with PRODUCER_STREAMING_ENCODE_MAX_DURATION_SECONDS as the env override.
  • Logs the structured streaming-encode gate decision with config flag, format, worker count, duration seconds, max duration seconds, and result.
  • Falls back to disk-frame encoding if spawning the streaming encoder fails before capture starts.
  • Uses streaming encode for sequential video renders up to and including the configured max duration.
  • Keeps renders over the configured max duration on normal encoding to avoid the long-running ffmpeg streaming timeout failure mode.
  • Keeps auto-parallel and png-sequence renders on the existing disk-frame path.
  • Builds the standalone-entry render clone off-DOM first, then swaps the body contents in one place.
  • Adds unit coverage for the default, env opt-out, env cutoff, streaming-path gating, duration cutoff, and render-clone extraction behavior.

Root cause

Streaming encode was configured as an opt-in experimental path, so default renders could not benefit from it. A straight global flip is not safe without guards:

  • the current ordered streaming writer blocks later parallel workers behind earlier frame ranges, which can erase auto-parallel capture throughput
  • long videos can keep the streaming ffmpeg process open long enough to trip FFMPEG_STREAMING_TIMEOUT_MS
  • a failed streaming encoder spawn should not fail the whole render when the existing disk-frame path can still run

Verification

Local checks

  • bunx vitest run packages/engine/src/config.test.ts packages/producer/src/services/renderOrchestrator.test.ts
  • bun run --filter @hyperframes/engine typecheck
  • bun run --filter @hyperframes/producer typecheck
  • bunx oxfmt --check packages/engine/src/config.ts packages/engine/src/config.test.ts packages/producer/src/services/renderOrchestrator.ts packages/producer/src/services/renderOrchestrator.test.ts
  • bunx oxlint packages/engine/src/config.ts packages/engine/src/config.test.ts packages/producer/src/services/renderOrchestrator.ts packages/producer/src/services/renderOrchestrator.test.ts
  • bun run --filter /producer build
  • git diff --check

Render / Perf

  • Opt-out benchmark: PRODUCER_ENABLE_STREAMING_ENCODE=false bun run --filter @hyperframes/producer benchmark -- --only css-spinner-render-compat --runs 1
    • 5719ms total, 803ms encode, 201MiB peak RSS
  • Naive global streaming benchmark before the parallel guard:
    • 8768ms total, 7321ms encode, 185MiB peak RSS
  • Guarded default benchmark after the fix:
    • 5509ms total, 761ms encode, 180MiB peak RSS
  • Default streaming render:
    • bun packages/cli/src/cli.ts init /tmp/hf-streaming-rollout-proof --example blank --non-interactive --skip-skills
    • bun packages/cli/src/cli.ts render /tmp/hf-streaming-rollout-proof --workers 1 --fps 24 --quality draft --output /tmp/hf-streaming-rollout-proof-artifacts/output.mp4
    • Log included streaming-encode gate {"enabled":true,"configFlag":true,"outputFormat":"mp4","workerCount":1,"durationSeconds":10,"maxDurationSeconds":240}
    • CLI displayed Streaming frame 1/240 through Streaming frame 240/240
    • ffprobe confirmed H.264, 1920x1080, 24/1, 10.000000s
  • Env cutoff disk-frame render:
    • PRODUCER_STREAMING_ENCODE_MAX_DURATION_SECONDS=1 bun packages/cli/src/cli.ts render /tmp/hf-streaming-rollout-proof --workers 1 --fps 24 --quality draft --output /tmp/hf-streaming-rollout-proof-artifacts/output-env-cutoff.mp4
    • Log included streaming-encode gate {"enabled":false,"configFlag":true,"outputFormat":"mp4","workerCount":1,"durationSeconds":10,"maxDurationSeconds":1}
    • CLI displayed Capturing frame 1/240 through Capturing frame 240/240, then Encoding video
    • ffprobe confirmed H.264, 1920x1080, 24/1, 10.000000s

Browser Verification

No browser UI changed. The render proofs exercised the headless browser capture path used by producer renders.

Notes

The enableStreamingEncode gate now runs after auto-worker calibration, so it sees the final post-calibration worker count. The standalone render-clone extraction and RenderConfig["format"] type cleanup are readability/refactor pieces bundled with this branch; future changes like those should split out when they are not needed by the behavior change.

This intentionally does not enable streaming for auto-parallel renders yet. The next safe step would be a non-blocking ordered writer that lets parallel workers continue capturing while a single drain loop writes frames to ffmpeg in sequence.

@jrusso1020
Copy link
Copy Markdown
Collaborator

@miguel-heygen one issue we've run into in production with sreaming encoding is it will fail on longer videos because there's an ffmpeg timeout (I think 5 minutes?)

So in our gsap rendering if it's 4 minutes or under we do streaming encoding, if it's over 4 minutes we do normal encoding

Copy link
Copy Markdown
Collaborator Author

@miguel-heygen one issue we've run into in production with sreaming encoding is it will fail on longer videos because there's an ffmpeg timeout (I think 5 minutes?)

So in our gsap rendering if it's 4 minutes or under we do streaming encoding, if it's over 4 minutes we do normal encoding

let's do this

@miguel-heygen
Copy link
Copy Markdown
Collaborator Author

@miguel-heygen one issue we've run into in production with sreaming encoding is it will fail on longer videos because there's an ffmpeg timeout (I think 5 minutes?)

So in our gsap rendering if it's 4 minutes or under we do streaming encoding, if it's over 4 minutes we do normal encoding

addressed with the same pattern

@miguel-heygen miguel-heygen requested review from jrusso1020 and vanceingalls and removed request for jrusso1020 April 30, 2026 20:21
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staff-engineer review. The change is small, the unit tests are well-targeted, and the gating function is the right shape. My concerns are mostly about (a) the magic 4-minute constant, (b) lack of observability around a default-behavior flip, and (c) bundling unrelated refactors.

Things to fix before merge

1. The 4-minute cutoff is a magic number with no clear relationship to its underlying cause

packages/producer/src/services/renderOrchestrator.ts

const STREAMING_ENCODE_MAX_DURATION_SECONDS = 4 * 60;

The PR description says this mirrors a guard used in GSAP rendering and the failure mode is "ffmpeg streaming process can hit its timeout." But:

  • The actual root cause is wall-time of the encode pipe, not composition duration. A 10-second composition with heavy 4K frames could take longer to encode than a 4-minute trivial one. Tying the gate to composition seconds is an approximation that will get noisier as fixtures diversify.
  • The relevant tunable, ffmpegStreamingTimeout / FFMPEG_STREAMING_TIMEOUT_MS, lives in EngineConfig. This new constant doesn't. So an operator who bumps the streaming timeout in prod has to know there's also a hidden 4-minute hard-cap in the orchestrator that ignores their tuning. That's a footgun.

Two minimal fixes:

  • Move the constant into EngineConfig (e.g., streamingEncodeMaxDurationSeconds, default 240), env-var addressable. That's consistent with how every other knob in this file works.
  • Add a // Mirrors GSAP rendering's 4-minute streaming guard. Production has seen ffmpeg's streaming pipe hit FFMPEG_STREAMING_TIMEOUT_MS on longer videos. comment so the next person can tie the magic number back to the failure mode.

2. No observability on the gating decision

Right now you cannot tell from logs whether a given render took the streaming path or disk-frame path, nor why. After this default flip, the question "what % of production renders are now using streaming, and how often is the duration gate firing?" has no answer without instrumenting after the fact.

Cheap fix: a single info-level log at the gate decision, with the four inputs and the result:

log.info("streaming-encode gate", {
  enabled: enableStreamingEncode,
  configFlag: cfg.enableStreamingEncode,
  outputFormat,
  workerCount,
  durationSeconds: job.duration,
});

Without this, the next time streaming starts misbehaving in prod, the on-call has to bisect between "did the gate fire?" and "did the ffmpeg pipe time out?" with no signal.

3. This is a default-flip; it deserves a more defensive rollout than "merge and watch"

The PR is enabling streaming by default for sequential renders, and the description itself says streaming has previously failed on long videos in prod. The duration guard mitigates the known failure, but the historical reason this lived behind an opt-in flag was uncertainty.

Two options that are both low-cost and would make this much safer:

  • Phased default: keep DEFAULT_CONFIG.enableStreamingEncode = false, and flip it on in the helm chart (or wherever PRODUCER_ENABLE_STREAMING_ENCODE is set per environment). That way dev/canary gets the new default first; if a regression shows up, prod helm value reverts in seconds without a code revert.
  • Or: leave the code default flipped, but add a "fallback to disk-frame on streaming spawn failure" branch around spawnStreamingEncoder (line ~3351) so a transient ffmpeg failure doesn't fail the whole render. The 4-min static gate is preventive; a runtime fallback is corrective. Together they would be belt-and-suspenders.

I'd take either; I'd push back on "merge as-is and roll forward."

4. Bundled-but-unrelated refactors should be split

Three changes in this PR are independent of the streaming-default feature:

  • createStandaloneEntryRenderClone / replaceBodyWithRenderClone extraction in extractStandaloneEntryFromIndex — pure readability win, no behavior change.
  • outputFormat cast tightening: "mp4" | "webm" | "mov" | "png-sequence"NonNullable<RenderConfig["format"]>. Nice cleanup, separate concern.
  • The hoist of the enableStreamingEncode decision from line ~1756 down into executeRenderJob past calibration. This is a behavior change (the gate now sees the post-calibration workerCount, which is the right thing — see good-things below), but it's worth calling out explicitly in the PR body.

Not blocking, but if the streaming default has to be reverted, the refactors revert with it. Future preference: separate PR.

Subtle correctness concerns worth a comment

5. workerCount reassignment after streaming encoder is spawned

At line 2509 you compute workerCount, then at line 2519 you may force it to 1 if calibration switched to screenshot mode. The new gate at line 2534 runs after both, so it sees the final value — good.

But further down, around line 3522 (workerCount = lastAttempt.workers;), the fallback retry loop can change workerCount after enableStreamingEncode has already been computed and a streaming encoder has already been spawned. The streaming-encoder spawn at line 3351 is gated on enableStreamingEncode, which is now stale relative to workerCount. In practice this only matters if a 1-worker streaming render falls back into a multi-worker retry — which I don't think happens (retries usually scale workers down, not up) — but the invariant is undocumented and one refactor away from breaking. Worth a one-line comment that says "streaming encode is locked in at this point; retries can shrink workerCount but can't grow it past 1 once streaming has started."

6. job.duration semantics

The whole gate hinges on job.duration being in seconds. The constant name (STREAMING_ENCODE_MAX_DURATION_SECONDS) implies seconds, and the test (240 for boundary) confirms the intent. Worth a one-line type or comment at the function signature so a future reader doesn't have to chase the convention through the call site.

What's good

  • shouldUseStreamingEncode taking Pick<EngineConfig, "enableStreamingEncode"> rather than the full config — clean, testable, minimal coupling.
  • Boundary tested at 240 and 240.001 — that's the right way to test a strict-greater-than cutoff.
  • Sensible default (Number.isFinite(durationSeconds) || durationSeconds <= 0) for malformed durations.
  • png-sequence and workerCount > 1 short-circuits documented in the gate function and the comment block.
  • The extractStandaloneEntryFromIndex refactor is a real readability win (build clone off-DOM, swap body once).
  • Default-on plus explicit env opt-out is the right shape if the rollout concern is addressed.
  • Benchmark numbers actually showing the guarded default beating the disk-frame path it replaces (5509ms vs 5719ms) — that's the right shape of evidence.

Summary

Approve once (1) constant lives in EngineConfig with a comment and (2) at least one structured log on the gate decision are added. (3) Rollout strategy is the one I'd push back on — even if you keep the code default flipped, please pair it with a runtime fallback on streaming spawn failure, because "this used to fail in prod and we have a heuristic we hope catches it" is a stance that needs a safety net.

@miguel-heygen
Copy link
Copy Markdown
Collaborator Author

@jrusso1020 addressed the review feedback in 4f52a37:

  • moved the streaming duration cutoff into EngineConfig as streamingEncodeMaxDurationSeconds with PRODUCER_STREAMING_ENCODE_MAX_DURATION_SECONDS
  • added structured streaming-encode gate logging with inputs/result
  • added a spawn-time fallback from streaming encode to disk-frame encode
  • documented the seconds semantics and streaming/worker-count invariant
  • updated the PR body with the worker-count hoist note and fresh verification

Validation run in this pass:

  • bunx vitest run packages/engine/src/config.test.ts packages/producer/src/services/renderOrchestrator.test.ts
  • bun run --filter @hyperframes/engine typecheck
  • bun run --filter @hyperframes/producer typecheck
  • bun run --filter @hyperframes/producer build
  • bunx oxfmt --check ...changed files
  • bunx oxlint ...changed files
  • git diff --check
  • CLI render default path showed streaming-encode gate enabled and Streaming frame 1/240 through 240/240; ffprobe confirmed H.264 1920x1080 24fps 10s
  • CLI render with PRODUCER_STREAMING_ENCODE_MAX_DURATION_SECONDS=1 showed gate disabled and disk-frame capture + encode; ffprobe confirmed H.264 1920x1080 24fps 10s

CI is running on the new head now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants