Phase 28: deploy-runtime layer (Dockerfiles, deploy workflows, deploy IAM, /healthz, lifespan, backfill roster)#96
Merged
Conversation
…serving apps Unauthenticated /healthz on earnings + weather serving (exempt from auth + ratelimit + the weather global ceiling — the Cloud Run probe idiom). Wire the cross-project earnings-streaming SegmentSubscriber into the earnings app lifespan (opt-in via EARNINGS_STREAMING_SUBSCRIPTION; no-op default, H2 single-instance) via a registry-backed bus adapter so /stream can find the bus.
Audio-free earnings-serving image (parquet extra only, no whisper/av/ffmpeg, firewall a); shared weather ingest image (satellite CLI); capture/stt/rolefact CUDA/CPU images + thin python -m services.earnings.jobs.* entrypoints that drive the shipped engine libraries (lazy audio imports; audio dies on ephemeral disk).
…--incremental Committed 66-station Kalshi∪Polymarket roster (D-28.8, drift-checked vs the live markets catalogs). backfill --roster resolves + shards by BATCH_TASK_INDEX (one array-task per station); --incremental yesterday scopes to the current year with resume; --progress-bucket accepted (GCS marker wiring TODO'd, 28-21 C4). The shipped infra/batch.tf container args now run.
Per-service workflow_dispatch deploys (serving, capture, stt, rolefact, weather ingest); run-weather-backfill gates the full 66-shard ~28 TB fleet behind an explicit cost sign-off (1-station pilot default). deploy_iam.tf adds the Codex-flagged deploy-time grants: public run.invoker on the serving services (GATE #2), deploy-SA run.developer + act-as on runtime SAs + artifactregistry.writer.
…e workflows Codex + Python Architect findings (P1/P2): - P1: STT deploys as a Cloud Run SERVICE but ran a one-shot CLI (never bound $PORT). Add services.earnings.jobs.stt_server (uvicorn /healthz + /transcribe wrapping the shipped transcriber); repoint stt.Dockerfile to serve it (one-shot CLI kept for the GCE MIG fallback). - P2: run-weather-backfill pilot passed only --stations (explicit CLI mode needs satellites/products/year-window/out) — pass the full explicit arg set. - P2: script-injection — image_tag/pilot_station now flow via env: and the Batch JSON is built with jq --arg (no shell/JSON interpolation) across all new workflows. - P2: rolefact skips the R2 upload when a zero-mention call writes no fact parquet. - P2: satellite CLI warns loudly for roster stations outside the GOES footprint. - P3: /healthz exemption is now trailing-slash tolerant across all 5 middlewares.
…nv contract Codex round-2 (gpt-5.5) P1/P2 — the deployed capture/STT/rolefact containers must match infra/cloud_run.tf's env contract: - capture: consume CAPTURE_JOBS_SUBSCRIPTION (pull the per-call spec) + upload the transient audio to AUDIO_HANDOFF_BUCKET (private GCS handoff, never R2); direct CAPTURE_* env kept as an operator override. - stt: transcribe_call/main/POST-/transcribe accept a gs:// handoff object in AUDIO_HANDOFF_BUCKET (capture + STT don't share a disk) — download → transcribe → cleanup; local paths still work; /healthz stays model-free. - rolefact: read R2_BUCKET (infra name); Dockerfile drops [earnings] (no faster-whisper/av in the post-audio CPU image — fact_builder imports clean without it). - stt.Dockerfile: bootstrap pip for deadsnakes 3.12 (ensurepip) with a build-time guard; add google-cloud-storage. capture.Dockerfile: add pubsub + storage clients.
…progress store Codex round-3 (gpt-5.5) P1/P2: - P1: HKO (Hong Kong Observatory pseudo-station) has no satellite StationInfo, so its shard resolved to zero partitions (silent data hole). Exclude it via _roster._NON_SATELLITE_STATIONS -> the satellite roster is 65 (66 union minus HKO); task_count 66->65 in run-weather-backfill.yml + infra/batch.tf; a new test asserts EVERY roster station resolves (no empty shards). - P1: STT now DELETES the source handoff object in AUDIO_HANDOFF_BUCKET after a successful transcription (kept on failure for retry) — raw audio no longer accumulates in GCS (D-27.9). - P2: wire --progress-bucket to a durable GcsProgressStore (shard-disjoint marker URI) so preempted Spot slices rehydrate markers from GCS instead of reprocessing.
…scription Codex round-4 (gpt-5.5) 2 P1: - STT deletes the handoff audio object ONLY after the transcript is durably written (TranscriptLedger.append) + any live publish — moved out of the download context manager into _delete_handoff_source. A ledger-write failure now KEEPS the source audio for retry instead of stranding the call. New tests assert the download->transcribe->ledger->delete order and the skip-on-failure. - earnings-serving lifespan parses the INGEST project + bare id from the FULL subscription resource path the infra sets (projects/<ingest>/subscriptions/<n>); the serving instance's GOOGLE_CLOUD_PROJECT is the wrong project for the cross-project earnings-streaming subscription. New parse tests cover both forms.
Codex round-5 (gpt-5.5) P1/P2: - P1: run-weather-backfill Batch secretVariables pointed at the satellite project, but the r2-*/eumetsat-* secrets live in the backend secrets project (var.secrets_project) — the submitted job would 404 on the secrets. Reference the backend project (AR_PROJECT) for the secret resource paths; drop the now-unused project-number var. - P2: the live SSE path publishes from the STT SA (28-GCE-ARCHITECTURE §3), but pubsub.tf granted pubsub.publisher only to the rolefact SA — live publish would 403. Grant the STT runtime SA publisher on the earnings-streaming topic (deploy_iam.tf).
…2s auth Codex round-6 (gpt-5.5) 2 P1 + 1 P2 in the earnings live-ingest path: - P1: capture now TRIGGERS STT after the handoff upload (POST the gs:// ref to the private STT service /transcribe) and acks the capture message ONLY after a 2xx — no more captured-but-never-transcribed orphans; unset STT_SERVICE_URL fails loud. Adds STT_SERVICE_URL env (cloud_run.tf) + capture SA run.invoker on STT + a metadata-server ID token (audience=STT_SERVICE_URL) for the private-service call. - P1: capture holds the Pub/Sub lease across 60-90min captures (background modify_ack_deadline loop) so a long call is not redelivered/duplicated. - P2: STT image adds google-cloud-pubsub (the live-publish path imports pubsub_v1). Note: the capture->STT->rolefact live orchestration is now wired end-to-end but its live validation stays operator-gated (28-10/11/13 Task 3/4), per the plans.
|
📝 Docs-required check: REMINDER API-surface change without docs and no opt-out — surfacing reminder. API-surface files changed: Docs files changed: Docs surfaces to consider
How to silence this reminder
This check is advisory — it never blocks the merge. |
|
Parity ticket gate: PASSED See |
…gger timeout/lease Two P1s from the final Codex gate on the deployed split-container topology: 1. Transcript durability across containers. STT, role/fact, and serving run as SEPARATE Cloud Run resources with isolated ephemeral disks, so STT's local TranscriptLedger write was invisible to the role/fact Job (it would fail "no persisted transcript"). STT now publishes the audio-free transcript parquet to the R2 data plane (the architecture's durable text store); role/fact rehydrates it from R2 on a local-cache miss. Opt-in on R2_BUCKET (a co-located operator run with a shared disk is unchanged). Adds the STT R2-write IAM/env + boto3 in the STT image; only TEXT crosses to R2, never audio (D-27.9). New _r2_sink.download. 2. capture->STT trigger timeout + lease. The synchronous /transcribe transcribes the whole call before responding, but capture gave the POST only 60s — every real call timed out mid-transcription -> NACK -> duplicate recapture. The timeout now covers a full transcription (STT_TRIGGER_TIMEOUT_SECONDS, default 3600s = the Cloud Run max the STT service is now pinned to), and the Pub/Sub lease is held ACROSS the trigger (it was released before it). Serving reading transcripts/facts from R2 (Codex also noted serving) remains a separate services-layer follow-up: the earnings serving read path uses the local ledger and would need an R2 list+get wrapper across its routes — out of scope for this deploy-runtime PR and tracked separately. Tests: STT R2 publish (+no-op when unset), role/fact R2 download-on-miss (+miss still fails loud), capture long/overridable timeout + lease-held-during-trigger.
…er writes, capture timeout Three P1s from the round-2 Codex gate, all making the deployed capture->STT->rolefact pipeline actually runnable and correct: 1. Provision the audio handoff bucket + IAM. The infra referenced earnings-audio-handoff-<projnum> by env but never CREATED it, and granted no storage access — the first real capture upload / STT download would 403 and the pipeline could never run end to end. Adds the private, in-firewall GCS bucket (co-located with STT, uniform access, public-access-prevention, 1-day orphan reaper) + capture objectAdmin (write/overwrite) and STT objectAdmin (get + post-ledger delete). Corrects the now-inaccurate "read-only" STT SA description. 2. Idempotent ledger writes. TranscriptLedger/FactLedger.append concatenates, so a retried STT/rolefact run (redelivery, R2-upload error, capture timeout) DOUBLED the rows and made role/fact double-count mentions — corrupting settlement data. Adds an atomic TranscriptLedger/FactLedger.replace (overwrite under the same FileLock, same audio-free + fail-closed-Kalshi guards); STT and rolefact now REPLACE their complete-per-call artifact instead of appending. 3. Capture job timeout. Round-7 made capture wait synchronously for STT, but the capture Cloud Run Job timeout was still 5400s (capture only) — a real capture + transcription could exceed it and be killed before ack -> redelivery -> duplicate recapture. Bumped to 9000s to cover capture + the synchronous STT wait (the decoupled trigger remains the real fix). Tests: ledger replace idempotency (overwrite / shrink / empty-removes / fail-closed-still-runs); STT + rolefact re-run-not-doubled end to end.
…ear, stale-R2-fact tombstone
Three findings from the round-3 Codex gate:
1. [P1] Roster backfill silently under-covered non-GOES stations. The full fleet
(--roster kalshi,polymarket, no --satellites) sent every shard through GOES-16/18,
whose footprint is only the Americas / E-Pacific — so ~half the 65-station roster
(EDDM, RJTT, FACT, ...) fetched nothing and then marked the empty slice complete,
leaving the advertised backfill unpopulated and wasting Spot. Roster mode now
EXCLUDES (and loudly logs) stations outside the GOES footprint under the default
satellites; an all-non-GOES shard cleanly no-ops. Global native-ring coverage
(Himawari/Meteosat/VIIRS) stays the 28-26 follow-up, reached via --satellites
(which bypasses the filter). batch.tf / the workflow docs updated to match.
2. [P2] --incremental yesterday keyed on TODAY's year, so on Jan 1 (UTC) it never
refreshed the prior-year Dec 31 partition ("yesterday"). Now keys on yesterday's
year.
3. [P2] A zero-fact role/fact rerun cleared the LOCAL partition (idempotent replace)
but left the prior nonzero run's earnings/facts/<t>/<c>.parquet in R2 — serving
reads R2 as the durable store, so it kept serving stale facts. The zero-row path
now tombstones the R2 object (new _r2_sink.delete, idempotent).
Tests: roster GOES-footprint filter (kept/dropped/all-non-GOES-noop), shard-routing
tests isolated with explicit --satellites, yesterday-year across boundary, zero-fact
R2 delete.
…from R2 Two findings from the round-4 Codex gate, completing the hosted earnings data plane: 1. [P1] Serving read the container-local ledger, not R2. STT/role-fact publish the transcript + fact parquet to R2, but the earnings serving app built ServingState against its ephemeral local disk — so a fresh Cloud Run instance returned EMPTY /transcripts and /facts even after ingest succeeded. Adds services/earnings/ r2_read.py (EarningsR2Reader + R2LedgerSource) mirroring the weather serving R2 read path: READ-ONLY token, list+get, settlement-safe NoSuchKey→[] vs real-error propagation, audio-free. ServingState.build now reads from R2 when the read token is present and no explicit ledger_root is given (local ledger for tests / on-device). The earnings_serving infra ALREADY injects the R2 read token + R2_BUCKET (no infra change); the serving image gains boto3. 2. [P2] A failed zero-fact R2 tombstone was swallowed. role/fact's zero-row branch caught+ignored delete() errors, so a real auth/network failure exited "success" while the stale fact object kept being served. Since delete_object is idempotent (missing key succeeds), any exception is a real failure — now propagated so the job fails and retries. Tests: EarningsR2Reader parse / miss-is-empty / real-error-propagates / list tickers+call_ids / unsafe-segment; ServingState R2-vs-local gating; /transcripts + /facts serve R2 rows end to end.
… handoff bucket [P1] The capture job pulled a real Pub/Sub message from CAPTURE_JOBS_SUBSCRIPTION but, when AUDIO_HANDOFF_BUCKET was unset (deploy misconfig), took the bare-local branch: printed the ephemeral path and ACKED the message. The job then exited and deleted the only audio copy without ever uploading it or triggering STT, and Pub/Sub — already acked — never redelivered → silent settlement-audio loss. Now fail loud EARLY (before the expensive capture), leaving the message un-acked so it redelivers once the env is fixed. The no-handoff-bucket path stays legitimate ONLY for the bare-local operator override (_NoopHandle: no Pub/Sub message, ack is a no-op). Test: subscription pull + AUDIO_HANDOFF_BUCKET unset raises + does not ack.
… from env
[P2] The deployed capture->STT trigger posts only {audio_path, ticker, call_id},
so the STT Cloud Run SERVICE (/transcribe) kept publish_live=False and never called
_maybe_publish_live — yet the serving app ALWAYS starts the earnings-streaming
subscriber (EARNINGS_STREAMING_SUBSCRIPTION is set unconditionally). So /stream had
nothing to fan out for hosted calls unless every request manually supplied the
streaming fields.
stt_server now derives publish_live / streaming_project / streaming_topic from the
service env (mirroring the one-shot jobs/stt.py main()), and the STT service infra
sets EARNINGS_STREAMING_ENABLED=1 / _PROJECT / _TOPIC (the STT SA already holds
pubsub.publisher on the topic). An explicit request field still overrides the env.
Tests: env-derived publish (on), no-env (off), request-override.
…stration seam Codex R7-6 flagged that STT does not auto-trigger the role/fact stage, so hosted /facts stays empty until role/fact runs. Documented as an intentional deferral in services/earnings/jobs/stt.py: role/fact needs the per-market TERM SPECS (ROLEFACT_TERMS) which STT does not have — they are market-specific and must be threaded capture->STT->role/fact (or fetched from the markets catalog), and WHO triggers role/fact is the same operator-gated orchestration decision the capture->STT entrypoint already documents. Tracked as a follow-up; the data plane (capture->STT->R2->role/fact->R2->serving) is fully wired, only the auto-trigger orchestration remains operator/scheduler-driven.
…e-layer # Conflicts: # infra/batch.tf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 28 — deploy-runtime layer
Phase 28 (hosted GCE data platform) landed the serving apps, Terraform, and the working
weather-servingdeploy onmain(PR #92). This PR adds everything that turns "code + terraform" into a deployable platform: the container images the Terraform expects, the entrypoints they run, per-service manual-only deploy workflows, deploy-time IAM,/healthz, the SegmentSubscriber lifespan, and the settlement-station roster thebatch.tfalready referenced but the CLI couldn't yet resolve.What's here
6 Dockerfiles + entrypoints (7 images) — all non-published (COPY
services/, never a PyPI dist):deploy/earnings/serving.Dockerfileuvicorn services.earnings.app:app(audio-free:[parquet]extra only)deploy/earnings/capture.Dockerfilepython -m services.earnings.jobs.capturedeploy/earnings/stt.Dockerfile(CUDA L4)uvicorn services.earnings.jobs.stt_server:app(HTTP service)deploy/earnings/rolefact.Dockerfilepython -m services.earnings.jobs.rolefact(CPU, no audio toolchain)deploy/weather/serving.Dockerfile(existing)uvicorn services.weather.app:appdeploy/weather/ingest.Dockerfilepython -m mostlyright.weather.satellite backfill …/healthzon both serving apps — unauthenticated, exempt from auth + rate-limit + the weather global ceiling (trailing-slash tolerant), so a Cloud Run probe is never 401/429'd.SegmentSubscriber lifespan on earnings-serving — opt-in (
EARNINGS_STREAMING_SUBSCRIPTION); parses the ingest project from the full subscription resource path; no-op by default (H2 single-instance preserved).Settlement-station backfill roster — the committed 65-station Kalshi∪Polymarket satellite roster (66 union minus the non-satellite HKO), drift-checked against the live
marketscatalogs; a test asserts every roster station resolves to a satelliteStationInfo(no empty shards). The satellite CLI gains--roster/ shard-by-BATCH_TASK_INDEX/--incremental/--progress-bucket(durableGcsProgressStorefor crash-safe Spot resume), so the shippedbatch.tfcontainer args actually run.7 manual-only deploy workflows (
workflow_dispatch, WIF-keyless).run-weather-backfill.ymlencodes the rollout: 1-station pilot by default, and the full 65-shard (~28 TB) fleet is blocked unless the operator setsmode=full+confirm_cost_signoff=true(the H5 cost sign-off). Batch job JSON is built injection-safely viaenv:passthrough +jq --arg.Deploy-time IAM (
infra/deploy_iam.tf) — the Codex-flagged gaps: publicrun.invokeron the serving services (GATE #2, behind fail-closed API-key auth), deploy-SArun.developer+iam.serviceAccountUser(act-as, scoped per runtime SA) +artifactregistry.writer,batch.jobsEditor, and apubsub.publishergrant for the STT SA (the live-SSE publisher).Rollout (as requested)
Cheap path first (serving + daily incremental), then the 1-station backfill pilot, then the full 28 TB fleet gated on the cost sign-off input.
Review
Two-reviewer loop per REVIEW-DISCIPLINE: Python Architect (no P1; P2s fixed) + Codex gpt-5.5 medium across 6 rounds — every P1/P2 fixed and covered by tests (STT HTTP entrypoint, ingest env-contract reconciliation to
cloud_run.tf, HKO roster hole, STT delete-after-ledger + handoff GCS wiring, cross-project subscription parsing, Batch secrets project, capture→STT triggering + Pub/Sub lease extension, and more). All new + existing serving/jobs/satellite tests green; ruff clean.The earnings live ingest pipeline (capture → STT → rolefact) is now wired end-to-end, but per the plans (28-10/11/13 Task 3/4) its live validation stays operator-gated — it is not on the cheap-path rollout. Weather + serving are the deploy-first paths.
tofu validateon the merged Phase 28 infra fails on two pre-existing (PR #92) issues, independent of this branch — spun off as their own tasks:infra/batch.tfdeclaresgoogle_batch_job, which is not a real provider resource (Cloud Batch is submit-only). This PR'srun-weather-backfill.ymldeliberately submits viagcloud batch jobs submit(the correct API path), so the deploy path here is unaffected.infra/cloud_run.tfplacesmax_instance_countin the top-levelscaling {}block (belongs intemplate { scaling {} }).Both must be fixed for the platform to
tofu apply.🤖 Generated with Claude Code