feat: default streaming encode for sequential renders#579
feat: default streaming encode for sequential renders#579miguel-heygen wants to merge 3 commits intomainfrom
Conversation
|
@miguel-heygen one issue we've run into in production with sreaming encoding is it will fail on longer videos because there's an ffmpeg timeout (I think 5 minutes?) So in our gsap rendering if it's 4 minutes or under we do streaming encoding, if it's over 4 minutes we do normal encoding |
let's do this |
addressed with the same pattern |
jrusso1020
left a comment
There was a problem hiding this comment.
Staff-engineer review. The change is small, the unit tests are well-targeted, and the gating function is the right shape. My concerns are mostly about (a) the magic 4-minute constant, (b) lack of observability around a default-behavior flip, and (c) bundling unrelated refactors.
Things to fix before merge
1. The 4-minute cutoff is a magic number with no clear relationship to its underlying cause
packages/producer/src/services/renderOrchestrator.ts
const STREAMING_ENCODE_MAX_DURATION_SECONDS = 4 * 60;The PR description says this mirrors a guard used in GSAP rendering and the failure mode is "ffmpeg streaming process can hit its timeout." But:
- The actual root cause is wall-time of the encode pipe, not composition duration. A 10-second composition with heavy 4K frames could take longer to encode than a 4-minute trivial one. Tying the gate to composition seconds is an approximation that will get noisier as fixtures diversify.
- The relevant tunable,
ffmpegStreamingTimeout/FFMPEG_STREAMING_TIMEOUT_MS, lives inEngineConfig. This new constant doesn't. So an operator who bumps the streaming timeout in prod has to know there's also a hidden 4-minute hard-cap in the orchestrator that ignores their tuning. That's a footgun.
Two minimal fixes:
- Move the constant into
EngineConfig(e.g.,streamingEncodeMaxDurationSeconds, default240), env-var addressable. That's consistent with how every other knob in this file works. - Add a
// Mirrors GSAP rendering's 4-minute streaming guard. Production has seen ffmpeg's streaming pipe hit FFMPEG_STREAMING_TIMEOUT_MS on longer videos.comment so the next person can tie the magic number back to the failure mode.
2. No observability on the gating decision
Right now you cannot tell from logs whether a given render took the streaming path or disk-frame path, nor why. After this default flip, the question "what % of production renders are now using streaming, and how often is the duration gate firing?" has no answer without instrumenting after the fact.
Cheap fix: a single info-level log at the gate decision, with the four inputs and the result:
log.info("streaming-encode gate", {
enabled: enableStreamingEncode,
configFlag: cfg.enableStreamingEncode,
outputFormat,
workerCount,
durationSeconds: job.duration,
});Without this, the next time streaming starts misbehaving in prod, the on-call has to bisect between "did the gate fire?" and "did the ffmpeg pipe time out?" with no signal.
3. This is a default-flip; it deserves a more defensive rollout than "merge and watch"
The PR is enabling streaming by default for sequential renders, and the description itself says streaming has previously failed on long videos in prod. The duration guard mitigates the known failure, but the historical reason this lived behind an opt-in flag was uncertainty.
Two options that are both low-cost and would make this much safer:
- Phased default: keep
DEFAULT_CONFIG.enableStreamingEncode = false, and flip it on in the helm chart (or whereverPRODUCER_ENABLE_STREAMING_ENCODEis set per environment). That way dev/canary gets the new default first; if a regression shows up, prod helm value reverts in seconds without a code revert. - Or: leave the code default flipped, but add a "fallback to disk-frame on streaming spawn failure" branch around
spawnStreamingEncoder(line ~3351) so a transient ffmpeg failure doesn't fail the whole render. The 4-min static gate is preventive; a runtime fallback is corrective. Together they would be belt-and-suspenders.
I'd take either; I'd push back on "merge as-is and roll forward."
4. Bundled-but-unrelated refactors should be split
Three changes in this PR are independent of the streaming-default feature:
createStandaloneEntryRenderClone/replaceBodyWithRenderCloneextraction inextractStandaloneEntryFromIndex— pure readability win, no behavior change.outputFormatcast tightening:"mp4" | "webm" | "mov" | "png-sequence"→NonNullable<RenderConfig["format"]>. Nice cleanup, separate concern.- The hoist of the
enableStreamingEncodedecision from line ~1756 down intoexecuteRenderJobpast calibration. This is a behavior change (the gate now sees the post-calibrationworkerCount, which is the right thing — see good-things below), but it's worth calling out explicitly in the PR body.
Not blocking, but if the streaming default has to be reverted, the refactors revert with it. Future preference: separate PR.
Subtle correctness concerns worth a comment
5. workerCount reassignment after streaming encoder is spawned
At line 2509 you compute workerCount, then at line 2519 you may force it to 1 if calibration switched to screenshot mode. The new gate at line 2534 runs after both, so it sees the final value — good.
But further down, around line 3522 (workerCount = lastAttempt.workers;), the fallback retry loop can change workerCount after enableStreamingEncode has already been computed and a streaming encoder has already been spawned. The streaming-encoder spawn at line 3351 is gated on enableStreamingEncode, which is now stale relative to workerCount. In practice this only matters if a 1-worker streaming render falls back into a multi-worker retry — which I don't think happens (retries usually scale workers down, not up) — but the invariant is undocumented and one refactor away from breaking. Worth a one-line comment that says "streaming encode is locked in at this point; retries can shrink workerCount but can't grow it past 1 once streaming has started."
6. job.duration semantics
The whole gate hinges on job.duration being in seconds. The constant name (STREAMING_ENCODE_MAX_DURATION_SECONDS) implies seconds, and the test (240 for boundary) confirms the intent. Worth a one-line type or comment at the function signature so a future reader doesn't have to chase the convention through the call site.
What's good
shouldUseStreamingEncodetakingPick<EngineConfig, "enableStreamingEncode">rather than the full config — clean, testable, minimal coupling.- Boundary tested at
240and240.001— that's the right way to test a strict-greater-than cutoff. - Sensible default (
Number.isFinite(durationSeconds) || durationSeconds <= 0) for malformed durations. png-sequenceandworkerCount > 1short-circuits documented in the gate function and the comment block.- The
extractStandaloneEntryFromIndexrefactor is a real readability win (build clone off-DOM, swap body once). - Default-on plus explicit env opt-out is the right shape if the rollout concern is addressed.
- Benchmark numbers actually showing the guarded default beating the disk-frame path it replaces (5509ms vs 5719ms) — that's the right shape of evidence.
Summary
Approve once (1) constant lives in EngineConfig with a comment and (2) at least one structured log on the gate decision are added. (3) Rollout strategy is the one I'd push back on — even if you keep the code default flipped, please pair it with a runtime fallback on streaming spawn failure, because "this used to fail in prod and we have a heuristic we hope catches it" is a stance that needs a safety net.
|
@jrusso1020 addressed the review feedback in 4f52a37:
Validation run in this pass:
CI is running on the new head now. |
Problem
Streaming encode is available, but
resolveConfig()still defaulted it off, so sequential renders kept writing every captured frame to disk before encoding.Production has also seen streaming encode fail on longer videos because the ffmpeg streaming process can hit its timeout. We need the same guard used in GSAP rendering: stream only for videos up to four minutes by default, make that cutoff operator-tunable, and use normal frame-dir encoding for longer videos.
The standalone sub-composition render extraction also mutated
bodybefore the final render subtree was fully assembled, which made the logic more fragile than necessary.What this fixes
enableStreamingEncodetotruein engine config while preservingPRODUCER_ENABLE_STREAMING_ENCODE=falseas the opt-out.streamingEncodeMaxDurationSecondstoEngineConfig, defaulting to240, withPRODUCER_STREAMING_ENCODE_MAX_DURATION_SECONDSas the env override.streaming-encode gatedecision with config flag, format, worker count, duration seconds, max duration seconds, and result.png-sequencerenders on the existing disk-frame path.Root cause
Streaming encode was configured as an opt-in experimental path, so default renders could not benefit from it. A straight global flip is not safe without guards:
FFMPEG_STREAMING_TIMEOUT_MSVerification
Local checks
bunx vitest run packages/engine/src/config.test.ts packages/producer/src/services/renderOrchestrator.test.tsbun run --filter @hyperframes/engine typecheckbun run --filter @hyperframes/producer typecheckbunx oxfmt --check packages/engine/src/config.ts packages/engine/src/config.test.ts packages/producer/src/services/renderOrchestrator.ts packages/producer/src/services/renderOrchestrator.test.tsbunx oxlint packages/engine/src/config.ts packages/engine/src/config.test.ts packages/producer/src/services/renderOrchestrator.ts packages/producer/src/services/renderOrchestrator.test.tsbun run --filter /producer buildgit diff --checkRender / Perf
PRODUCER_ENABLE_STREAMING_ENCODE=false bun run --filter @hyperframes/producer benchmark -- --only css-spinner-render-compat --runs 15719ms total,803ms encode,201MiBpeak RSS8768ms total,7321ms encode,185MiBpeak RSS5509ms total,761ms encode,180MiBpeak RSSbun packages/cli/src/cli.ts init /tmp/hf-streaming-rollout-proof --example blank --non-interactive --skip-skillsbun packages/cli/src/cli.ts render /tmp/hf-streaming-rollout-proof --workers 1 --fps 24 --quality draft --output /tmp/hf-streaming-rollout-proof-artifacts/output.mp4streaming-encode gate {"enabled":true,"configFlag":true,"outputFormat":"mp4","workerCount":1,"durationSeconds":10,"maxDurationSeconds":240}Streaming frame 1/240throughStreaming frame 240/240ffprobeconfirmed H.264,1920x1080,24/1,10.000000sPRODUCER_STREAMING_ENCODE_MAX_DURATION_SECONDS=1 bun packages/cli/src/cli.ts render /tmp/hf-streaming-rollout-proof --workers 1 --fps 24 --quality draft --output /tmp/hf-streaming-rollout-proof-artifacts/output-env-cutoff.mp4streaming-encode gate {"enabled":false,"configFlag":true,"outputFormat":"mp4","workerCount":1,"durationSeconds":10,"maxDurationSeconds":1}Capturing frame 1/240throughCapturing frame 240/240, thenEncoding videoffprobeconfirmed H.264,1920x1080,24/1,10.000000sBrowser Verification
No browser UI changed. The render proofs exercised the headless browser capture path used by producer renders.
Notes
The
enableStreamingEncodegate now runs after auto-worker calibration, so it sees the final post-calibration worker count. The standalone render-clone extraction andRenderConfig["format"]type cleanup are readability/refactor pieces bundled with this branch; future changes like those should split out when they are not needed by the behavior change.This intentionally does not enable streaming for auto-parallel renders yet. The next safe step would be a non-blocking ordered writer that lets parallel workers continue capturing while a single drain loop writes frames to ffmpeg in sequence.