Skip to content

Phase 28: deploy-runtime layer (Dockerfiles, deploy workflows, deploy IAM, /healthz, lifespan, backfill roster)#96

Merged
helloiamvu merged 18 commits into
mainfrom
phase28/deploy-runtime-layer
Jul 3, 2026
Merged

Phase 28: deploy-runtime layer (Dockerfiles, deploy workflows, deploy IAM, /healthz, lifespan, backfill roster)#96
helloiamvu merged 18 commits into
mainfrom
phase28/deploy-runtime-layer

Conversation

@helloiamvu

Copy link
Copy Markdown
Member

Phase 28 — deploy-runtime layer

Phase 28 (hosted GCE data platform) landed the serving apps, Terraform, and the working weather-serving deploy on main (PR #92). This PR adds everything that turns "code + terraform" into a deployable platform: the container images the Terraform expects, the entrypoints they run, per-service manual-only deploy workflows, deploy-time IAM, /healthz, the SegmentSubscriber lifespan, and the settlement-station roster the batch.tf already referenced but the CLI couldn't yet resolve.

What's here

6 Dockerfiles + entrypoints (7 images) — all non-published (COPY services/, never a PyPI dist):

Image Dockerfile Entrypoint
earnings-serving deploy/earnings/serving.Dockerfile uvicorn services.earnings.app:app (audio-free: [parquet] extra only)
earnings-capture deploy/earnings/capture.Dockerfile python -m services.earnings.jobs.capture
earnings-stt deploy/earnings/stt.Dockerfile (CUDA L4) uvicorn services.earnings.jobs.stt_server:app (HTTP service)
earnings-rolefact deploy/earnings/rolefact.Dockerfile python -m services.earnings.jobs.rolefact (CPU, no audio toolchain)
weather-serving deploy/weather/serving.Dockerfile (existing) uvicorn services.weather.app:app
weather-backfill + weather-incremental deploy/weather/ingest.Dockerfile python -m mostlyright.weather.satellite backfill …

/healthz on both serving apps — unauthenticated, exempt from auth + rate-limit + the weather global ceiling (trailing-slash tolerant), so a Cloud Run probe is never 401/429'd.

SegmentSubscriber lifespan on earnings-serving — opt-in (EARNINGS_STREAMING_SUBSCRIPTION); parses the ingest project from the full subscription resource path; no-op by default (H2 single-instance preserved).

Settlement-station backfill roster — the committed 65-station Kalshi∪Polymarket satellite roster (66 union minus the non-satellite HKO), drift-checked against the live markets catalogs; a test asserts every roster station resolves to a satellite StationInfo (no empty shards). The satellite CLI gains --roster / shard-by-BATCH_TASK_INDEX / --incremental / --progress-bucket (durable GcsProgressStore for crash-safe Spot resume), so the shipped batch.tf container args actually run.

7 manual-only deploy workflows (workflow_dispatch, WIF-keyless). run-weather-backfill.yml encodes the rollout: 1-station pilot by default, and the full 65-shard (~28 TB) fleet is blocked unless the operator sets mode=full + confirm_cost_signoff=true (the H5 cost sign-off). Batch job JSON is built injection-safely via env: passthrough + jq --arg.

Deploy-time IAM (infra/deploy_iam.tf) — the Codex-flagged gaps: public run.invoker on the serving services (GATE #2, behind fail-closed API-key auth), deploy-SA run.developer + iam.serviceAccountUser (act-as, scoped per runtime SA) + artifactregistry.writer, batch.jobsEditor, and a pubsub.publisher grant for the STT SA (the live-SSE publisher).

Rollout (as requested)

Cheap path first (serving + daily incremental), then the 1-station backfill pilot, then the full 28 TB fleet gated on the cost sign-off input.

Review

Two-reviewer loop per REVIEW-DISCIPLINE: Python Architect (no P1; P2s fixed) + Codex gpt-5.5 medium across 6 rounds — every P1/P2 fixed and covered by tests (STT HTTP entrypoint, ingest env-contract reconciliation to cloud_run.tf, HKO roster hole, STT delete-after-ledger + handoff GCS wiring, cross-project subscription parsing, Batch secrets project, capture→STT triggering + Pub/Sub lease extension, and more). All new + existing serving/jobs/satellite tests green; ruff clean.

The earnings live ingest pipeline (capture → STT → rolefact) is now wired end-to-end, but per the plans (28-10/11/13 Task 3/4) its live validation stays operator-gated — it is not on the cheap-path rollout. Weather + serving are the deploy-first paths.

⚠️ Pre-existing infra bugs surfaced (out of scope, tracked separately)

tofu validate on the merged Phase 28 infra fails on two pre-existing (PR #92) issues, independent of this branch — spun off as their own tasks:

  • infra/batch.tf declares google_batch_job, which is not a real provider resource (Cloud Batch is submit-only). This PR's run-weather-backfill.yml deliberately submits via gcloud batch jobs submit (the correct API path), so the deploy path here is unaffected.
  • infra/cloud_run.tf places max_instance_count in the top-level scaling {} block (belongs in template { scaling {} }).

Both must be fixed for the platform to tofu apply.

🤖 Generated with Claude Code

helloiamvu added 10 commits July 3, 2026 19:15
…serving apps

Unauthenticated /healthz on earnings + weather serving (exempt from auth +
ratelimit + the weather global ceiling — the Cloud Run probe idiom). Wire the
cross-project earnings-streaming SegmentSubscriber into the earnings app
lifespan (opt-in via EARNINGS_STREAMING_SUBSCRIPTION; no-op default, H2
single-instance) via a registry-backed bus adapter so /stream can find the bus.
Audio-free earnings-serving image (parquet extra only, no whisper/av/ffmpeg,
firewall a); shared weather ingest image (satellite CLI); capture/stt/rolefact
CUDA/CPU images + thin python -m services.earnings.jobs.* entrypoints that drive
the shipped engine libraries (lazy audio imports; audio dies on ephemeral disk).
…--incremental

Committed 66-station Kalshi∪Polymarket roster (D-28.8, drift-checked vs the live
markets catalogs). backfill --roster resolves + shards by BATCH_TASK_INDEX (one
array-task per station); --incremental yesterday scopes to the current year with
resume; --progress-bucket accepted (GCS marker wiring TODO'd, 28-21 C4). The
shipped infra/batch.tf container args now run.
Per-service workflow_dispatch deploys (serving, capture, stt, rolefact, weather
ingest); run-weather-backfill gates the full 66-shard ~28 TB fleet behind an
explicit cost sign-off (1-station pilot default). deploy_iam.tf adds the
Codex-flagged deploy-time grants: public run.invoker on the serving services
(GATE #2), deploy-SA run.developer + act-as on runtime SAs + artifactregistry.writer.
…e workflows

Codex + Python Architect findings (P1/P2):
- P1: STT deploys as a Cloud Run SERVICE but ran a one-shot CLI (never bound $PORT).
  Add services.earnings.jobs.stt_server (uvicorn /healthz + /transcribe wrapping the
  shipped transcriber); repoint stt.Dockerfile to serve it (one-shot CLI kept for the
  GCE MIG fallback).
- P2: run-weather-backfill pilot passed only --stations (explicit CLI mode needs
  satellites/products/year-window/out) — pass the full explicit arg set.
- P2: script-injection — image_tag/pilot_station now flow via env: and the Batch JSON
  is built with jq --arg (no shell/JSON interpolation) across all new workflows.
- P2: rolefact skips the R2 upload when a zero-mention call writes no fact parquet.
- P2: satellite CLI warns loudly for roster stations outside the GOES footprint.
- P3: /healthz exemption is now trailing-slash tolerant across all 5 middlewares.
…nv contract

Codex round-2 (gpt-5.5) P1/P2 — the deployed capture/STT/rolefact containers must
match infra/cloud_run.tf's env contract:
- capture: consume CAPTURE_JOBS_SUBSCRIPTION (pull the per-call spec) + upload the
  transient audio to AUDIO_HANDOFF_BUCKET (private GCS handoff, never R2); direct
  CAPTURE_* env kept as an operator override.
- stt: transcribe_call/main/POST-/transcribe accept a gs:// handoff object in
  AUDIO_HANDOFF_BUCKET (capture + STT don't share a disk) — download → transcribe →
  cleanup; local paths still work; /healthz stays model-free.
- rolefact: read R2_BUCKET (infra name); Dockerfile drops [earnings] (no
  faster-whisper/av in the post-audio CPU image — fact_builder imports clean without it).
- stt.Dockerfile: bootstrap pip for deadsnakes 3.12 (ensurepip) with a build-time
  guard; add google-cloud-storage. capture.Dockerfile: add pubsub + storage clients.
…progress store

Codex round-3 (gpt-5.5) P1/P2:
- P1: HKO (Hong Kong Observatory pseudo-station) has no satellite StationInfo, so
  its shard resolved to zero partitions (silent data hole). Exclude it via
  _roster._NON_SATELLITE_STATIONS -> the satellite roster is 65 (66 union minus
  HKO); task_count 66->65 in run-weather-backfill.yml + infra/batch.tf; a new test
  asserts EVERY roster station resolves (no empty shards).
- P1: STT now DELETES the source handoff object in AUDIO_HANDOFF_BUCKET after a
  successful transcription (kept on failure for retry) — raw audio no longer
  accumulates in GCS (D-27.9).
- P2: wire --progress-bucket to a durable GcsProgressStore (shard-disjoint marker
  URI) so preempted Spot slices rehydrate markers from GCS instead of reprocessing.
…scription

Codex round-4 (gpt-5.5) 2 P1:
- STT deletes the handoff audio object ONLY after the transcript is durably
  written (TranscriptLedger.append) + any live publish — moved out of the
  download context manager into _delete_handoff_source. A ledger-write failure
  now KEEPS the source audio for retry instead of stranding the call. New tests
  assert the download->transcribe->ledger->delete order and the skip-on-failure.
- earnings-serving lifespan parses the INGEST project + bare id from the FULL
  subscription resource path the infra sets (projects/<ingest>/subscriptions/<n>);
  the serving instance's GOOGLE_CLOUD_PROJECT is the wrong project for the
  cross-project earnings-streaming subscription. New parse tests cover both forms.
Codex round-5 (gpt-5.5) P1/P2:
- P1: run-weather-backfill Batch secretVariables pointed at the satellite project,
  but the r2-*/eumetsat-* secrets live in the backend secrets project (var.secrets_project)
  — the submitted job would 404 on the secrets. Reference the backend project
  (AR_PROJECT) for the secret resource paths; drop the now-unused project-number var.
- P2: the live SSE path publishes from the STT SA (28-GCE-ARCHITECTURE §3), but
  pubsub.tf granted pubsub.publisher only to the rolefact SA — live publish would
  403. Grant the STT runtime SA publisher on the earnings-streaming topic (deploy_iam.tf).
…2s auth

Codex round-6 (gpt-5.5) 2 P1 + 1 P2 in the earnings live-ingest path:
- P1: capture now TRIGGERS STT after the handoff upload (POST the gs:// ref to the
  private STT service /transcribe) and acks the capture message ONLY after a 2xx —
  no more captured-but-never-transcribed orphans; unset STT_SERVICE_URL fails loud.
  Adds STT_SERVICE_URL env (cloud_run.tf) + capture SA run.invoker on STT + a
  metadata-server ID token (audience=STT_SERVICE_URL) for the private-service call.
- P1: capture holds the Pub/Sub lease across 60-90min captures (background
  modify_ack_deadline loop) so a long call is not redelivered/duplicated.
- P2: STT image adds google-cloud-pubsub (the live-publish path imports pubsub_v1).

Note: the capture->STT->rolefact live orchestration is now wired end-to-end but its
live validation stays operator-gated (28-10/11/13 Task 3/4), per the plans.
@helloiamvu helloiamvu requested a review from Tarabcak July 3, 2026 20:08
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

📝 Docs-required check: REMINDER

API-surface change without docs and no opt-out — surfacing reminder.

API-surface files changed:

packages/weather/src/mostlyright/weather/earnings/ledger.py
packages/weather/src/mostlyright/weather/satellite/__main__.py
packages/weather/src/mostlyright/weather/satellite/_r2_sink.py
packages/weather/src/mostlyright/weather/satellite/_roster.py

Docs files changed:

(none)

Docs surfaces to consider

  • CHANGELOG.md — every behavior change goes here. Auto-synced to the docs site on every release.
  • docs/ — hand-authored prose. Lifted into the landing repo as MDX on every release.
  • Per-package READMEs (packages/*/README.md, packages-ts/*/README.md) — front-door copy on PyPI / npm.
  • Docstrings (Python """ ... """ / TypeScript JSDoc /** ... */) — propagate to the auto-generated API reference via Sphinx / TypeDoc.

How to silence this reminder

  • Add a docs change to this PR (any *.md / *.mdx under docs/, or CHANGELOG.md, or any README.md).
  • Apply the docs-not-required label.
  • Add a line to the PR body: docs-not-required: <one-sentence reason>.

This check is advisory — it never blocks the merge.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

Parity ticket gate: PASSED

parity-ticket-check: PR does not touch parity-trigger surface; gate skipped.

See CROSS-SDK-SYNC.md §2 for the workflow.

…gger timeout/lease

Two P1s from the final Codex gate on the deployed split-container topology:

1. Transcript durability across containers. STT, role/fact, and serving run as
   SEPARATE Cloud Run resources with isolated ephemeral disks, so STT's local
   TranscriptLedger write was invisible to the role/fact Job (it would fail "no
   persisted transcript"). STT now publishes the audio-free transcript parquet to
   the R2 data plane (the architecture's durable text store); role/fact rehydrates
   it from R2 on a local-cache miss. Opt-in on R2_BUCKET (a co-located operator run
   with a shared disk is unchanged). Adds the STT R2-write IAM/env + boto3 in the
   STT image; only TEXT crosses to R2, never audio (D-27.9). New _r2_sink.download.

2. capture->STT trigger timeout + lease. The synchronous /transcribe transcribes
   the whole call before responding, but capture gave the POST only 60s — every
   real call timed out mid-transcription -> NACK -> duplicate recapture. The
   timeout now covers a full transcription (STT_TRIGGER_TIMEOUT_SECONDS, default
   3600s = the Cloud Run max the STT service is now pinned to), and the Pub/Sub
   lease is held ACROSS the trigger (it was released before it).

Serving reading transcripts/facts from R2 (Codex also noted serving) remains a
separate services-layer follow-up: the earnings serving read path uses the local
ledger and would need an R2 list+get wrapper across its routes — out of scope for
this deploy-runtime PR and tracked separately.

Tests: STT R2 publish (+no-op when unset), role/fact R2 download-on-miss (+miss
still fails loud), capture long/overridable timeout + lease-held-during-trigger.
…er writes, capture timeout

Three P1s from the round-2 Codex gate, all making the deployed capture->STT->rolefact
pipeline actually runnable and correct:

1. Provision the audio handoff bucket + IAM. The infra referenced
   earnings-audio-handoff-<projnum> by env but never CREATED it, and granted no
   storage access — the first real capture upload / STT download would 403 and the
   pipeline could never run end to end. Adds the private, in-firewall GCS bucket
   (co-located with STT, uniform access, public-access-prevention, 1-day orphan
   reaper) + capture objectAdmin (write/overwrite) and STT objectAdmin (get +
   post-ledger delete). Corrects the now-inaccurate "read-only" STT SA description.

2. Idempotent ledger writes. TranscriptLedger/FactLedger.append concatenates, so a
   retried STT/rolefact run (redelivery, R2-upload error, capture timeout) DOUBLED
   the rows and made role/fact double-count mentions — corrupting settlement data.
   Adds an atomic TranscriptLedger/FactLedger.replace (overwrite under the same
   FileLock, same audio-free + fail-closed-Kalshi guards); STT and rolefact now
   REPLACE their complete-per-call artifact instead of appending.

3. Capture job timeout. Round-7 made capture wait synchronously for STT, but the
   capture Cloud Run Job timeout was still 5400s (capture only) — a real
   capture + transcription could exceed it and be killed before ack -> redelivery
   -> duplicate recapture. Bumped to 9000s to cover capture + the synchronous STT
   wait (the decoupled trigger remains the real fix).

Tests: ledger replace idempotency (overwrite / shrink / empty-removes /
fail-closed-still-runs); STT + rolefact re-run-not-doubled end to end.
…ear, stale-R2-fact tombstone

Three findings from the round-3 Codex gate:

1. [P1] Roster backfill silently under-covered non-GOES stations. The full fleet
   (--roster kalshi,polymarket, no --satellites) sent every shard through GOES-16/18,
   whose footprint is only the Americas / E-Pacific — so ~half the 65-station roster
   (EDDM, RJTT, FACT, ...) fetched nothing and then marked the empty slice complete,
   leaving the advertised backfill unpopulated and wasting Spot. Roster mode now
   EXCLUDES (and loudly logs) stations outside the GOES footprint under the default
   satellites; an all-non-GOES shard cleanly no-ops. Global native-ring coverage
   (Himawari/Meteosat/VIIRS) stays the 28-26 follow-up, reached via --satellites
   (which bypasses the filter). batch.tf / the workflow docs updated to match.

2. [P2] --incremental yesterday keyed on TODAY's year, so on Jan 1 (UTC) it never
   refreshed the prior-year Dec 31 partition ("yesterday"). Now keys on yesterday's
   year.

3. [P2] A zero-fact role/fact rerun cleared the LOCAL partition (idempotent replace)
   but left the prior nonzero run's earnings/facts/<t>/<c>.parquet in R2 — serving
   reads R2 as the durable store, so it kept serving stale facts. The zero-row path
   now tombstones the R2 object (new _r2_sink.delete, idempotent).

Tests: roster GOES-footprint filter (kept/dropped/all-non-GOES-noop), shard-routing
tests isolated with explicit --satellites, yesterday-year across boundary, zero-fact
R2 delete.
…from R2

Two findings from the round-4 Codex gate, completing the hosted earnings data plane:

1. [P1] Serving read the container-local ledger, not R2. STT/role-fact publish the
   transcript + fact parquet to R2, but the earnings serving app built ServingState
   against its ephemeral local disk — so a fresh Cloud Run instance returned EMPTY
   /transcripts and /facts even after ingest succeeded. Adds services/earnings/
   r2_read.py (EarningsR2Reader + R2LedgerSource) mirroring the weather serving R2
   read path: READ-ONLY token, list+get, settlement-safe NoSuchKey→[] vs real-error
   propagation, audio-free. ServingState.build now reads from R2 when the read token
   is present and no explicit ledger_root is given (local ledger for tests /
   on-device). The earnings_serving infra ALREADY injects the R2 read token +
   R2_BUCKET (no infra change); the serving image gains boto3.

2. [P2] A failed zero-fact R2 tombstone was swallowed. role/fact's zero-row branch
   caught+ignored delete() errors, so a real auth/network failure exited "success"
   while the stale fact object kept being served. Since delete_object is idempotent
   (missing key succeeds), any exception is a real failure — now propagated so the
   job fails and retries.

Tests: EarningsR2Reader parse / miss-is-empty / real-error-propagates / list
tickers+call_ids / unsafe-segment; ServingState R2-vs-local gating; /transcripts +
/facts serve R2 rows end to end.
… handoff bucket

[P1] The capture job pulled a real Pub/Sub message from CAPTURE_JOBS_SUBSCRIPTION
but, when AUDIO_HANDOFF_BUCKET was unset (deploy misconfig), took the bare-local
branch: printed the ephemeral path and ACKED the message. The job then exited and
deleted the only audio copy without ever uploading it or triggering STT, and
Pub/Sub — already acked — never redelivered → silent settlement-audio loss.

Now fail loud EARLY (before the expensive capture), leaving the message un-acked so
it redelivers once the env is fixed. The no-handoff-bucket path stays legitimate
ONLY for the bare-local operator override (_NoopHandle: no Pub/Sub message, ack is
a no-op).

Test: subscription pull + AUDIO_HANDOFF_BUCKET unset raises + does not ack.
… from env

[P2] The deployed capture->STT trigger posts only {audio_path, ticker, call_id},
so the STT Cloud Run SERVICE (/transcribe) kept publish_live=False and never called
_maybe_publish_live — yet the serving app ALWAYS starts the earnings-streaming
subscriber (EARNINGS_STREAMING_SUBSCRIPTION is set unconditionally). So /stream had
nothing to fan out for hosted calls unless every request manually supplied the
streaming fields.

stt_server now derives publish_live / streaming_project / streaming_topic from the
service env (mirroring the one-shot jobs/stt.py main()), and the STT service infra
sets EARNINGS_STREAMING_ENABLED=1 / _PROJECT / _TOPIC (the STT SA already holds
pubsub.publisher on the topic). An explicit request field still overrides the env.

Tests: env-derived publish (on), no-env (off), request-override.
…stration seam

Codex R7-6 flagged that STT does not auto-trigger the role/fact stage, so hosted
/facts stays empty until role/fact runs. Documented as an intentional deferral in
services/earnings/jobs/stt.py: role/fact needs the per-market TERM SPECS
(ROLEFACT_TERMS) which STT does not have — they are market-specific and must be
threaded capture->STT->role/fact (or fetched from the markets catalog), and WHO
triggers role/fact is the same operator-gated orchestration decision the
capture->STT entrypoint already documents. Tracked as a follow-up; the data plane
(capture->STT->R2->role/fact->R2->serving) is fully wired, only the auto-trigger
orchestration remains operator/scheduler-driven.
@helloiamvu helloiamvu merged commit d87501b into main Jul 3, 2026
5 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant