Skip to content

Phase 28: Hosted GCE Data Platform — serving apps + IaC + SDK hosted surface (platform layer)#92

Merged
helloiamvu merged 18 commits into
mainfrom
phase28/hosted-gce-platform
Jul 3, 2026
Merged

Phase 28: Hosted GCE Data Platform — serving apps + IaC + SDK hosted surface (platform layer)#92
helloiamvu merged 18 commits into
mainfrom
phase28/hosted-gce-platform

Conversation

@helloiamvu

@helloiamvu helloiamvu commented Jul 3, 2026

Copy link
Copy Markdown
Member

Phase 28 — Hosted GCE Data Platform (platform + IaC + SDK/serving surface)

Stands up the opt-in hosted path on top of the local-first SDK, per GATE #1 + #2 (both signed 2026-07-02). The SDK default path stays hosted-call-free — hosted is reached only via delivery="hosted" / WEATHER_HOSTED_URL / EARNINGS_HOSTED_URL + MOSTLYRIGHT_API_KEY.

This PR is built on the earlier phase28/hosted-gce-platform foundation; it reconciles that work, fixes everything an adversarial review surfaced, and adds the hosted-call grep-gate. It deliberately scopes out the deploy-runtime containerization layer (see Deferred below).

What's in this PR

Serving apps (non-published — never in any wheel)

  • services/weather/GET /satellite + /capabilities, R2 read-only, rows byte-identical to local delivery="live" modulo the delivery channel (D-28.2).
  • services/earnings/ — cross-project Pub/Sub bridge (pubsub_bridge.py) joining the ingest→serving SSE transport, with a structural audio firewall (assert_message_audio_free on both publish and receive; closed MESSAGE_KINDS with no audio kind; lazy GCP import).

SDK hosted surface (published)

  • satellite/_hosted_client.py — the delivery="hosted" seam (opt-in; env-gated).
  • satellite/_progress.py — durable, upload-gated crash-safe progress store (a Spot kill between local write and R2 upload leaves the partition unmarked → retried).
  • packages-ts/weather/src/{hosted,earnings} — browser/MV3 hosted shim (fetch + satellite + earnings SSE stream). No Node APIs.

Infrastructure (flat infra/ OpenTofu)

  • Cloud Run services/jobs + Cloud Batch + Pub/Sub (with DLQ) + Secret Manager data-sources + budgets + monitoring + per-workload runtime SAs. Firewalls encoded as data: serving SAs bind R2-read + api-key only; ingest/backfill SAs bind R2-write + EUMETSAT; STT GPU L4 pinned to us-central1 (not offered in europe-west3).

Firewall-A enforcement

  • scripts/check_no_hosted_calls.py — the amended default-path grep-gate (green over 202 published modules; trips on a synthetic hosted call; opt-in seam identifiers quarantined to 4 allowlisted files). Wired into test.yml.

Review + fixes

An adversarial review (29 agents, 7 dimensions × the firewalls, refute-by-default verification) confirmed the app code is sound (serving suite 137 green, TS 12 green) and surfaced 11 findings. All fixed + regression-tested:

# Sev Fix
1 CRIT Backfill Batch job now injects the R2-write + EUMETSAT secret_variables it was missing (would otherwise upload zero derived parquet).
3 HIGH R2 write jobs inject R2_WRITE_ACCESS_KEY_ID/R2_WRITE_SECRET_ACCESS_KEY (the names _r2_sink reads) — a generic R2_ACCESS_KEY_ID left the sink's _require_env unset → ValueError on upload.
4 HIGH /satellite bounds the window (_MAX_WINDOW_MONTHS) before any R2 I/O — an unbounded far-future end fanned one request out to ~120k object reads (amplification DoS).
5 HIGH earnings + weather auth compare the key as UTF-8 byteshmac.compare_digest raises TypeError on a non-ASCII header, turning a 401 into a 500 (unauthenticated 500-trigger).
6 HIGH TS /stream client mints a signed, single-scope ?token= locally from the public key (Web Crypto HMAC, byte-identical to Python mint_stream_token) instead of ?apiKey= — the server only accepts a signed token, so every hosted stream would have 401'd.
7 HIGH /stream honors a ?lastEventId= query fallback for the explicit cross-cut resume (a fresh EventSource can't set the header).
2 HIGH The firewall-A grep-gate (check_no_hosted_calls.py) was missing entirely — created + CI-wired.

Regression tests added: over-wide window → 422; non-ASCII key header → 401; TS token verify/tamper. Full fast suite is green across 3.11/3.12/3.13 + pandas-3 + polars + coverage.

Two-reviewer discipline complete. A second, independent Codex 5.5-high pass then found + I fixed a further 5 HIGH: /satellite was swallowing all read errors (incl. a wrong/typoed R2 bucket) into 200 [] — now only a missing object (NoSuchKey) is empty, every other failure is a loud 502 (settlement safety); the earnings serving now gets EARNINGS_API_KEY from the unified secret (booted + token-verify aligned); the Cloud Batch backfill runs as its dedicated least-privilege SA; and the Python + TS hosted clients fetch a station list one request per station (single ICAO) with a required satellite, instead of a comma-joined param the server would 422. Codex then verified the fixes hold and all five firewalls hold (A hosted-call-free, B services/ not in wheels, C audio never reaches serving/PubSub/R2, D R2 read-vs-write split, E role_parser/climate untouched).

Deferred to an operator-gated follow-up (not in this PR)

The other session built the platform; the deploy-runtime containerization layer is intentionally staged for gate-time, because it can't be CI-validated or run yet:

  • Live earnings pipeline (28-10/11/13): capture (Chromium+ffmpeg) + STT (GPU L4) + rolefact Dockerfiles/entrypoints + capture VPC/NAT/static-IP + audio-handoff bucket TF. Operator-gated — needs a scheduled live earnings webcast to validate end-to-end.
  • Backfill/incremental entrypoints + settlement-station roster (28-21/22): the kalshi∪polymarket station roster is settlement-adjacent (wrong stations → wrong contract coverage) and cost-gated (tied to the pilot-station + 28TB sign-off).
  • Per-service deploy workflows + deploy.yml build step (the two committed deploy workflows are workflow_dispatch-only stubs — they do not auto-fire), and the api.mostlyright.md Cloudflare auto-DNS (token is provisioned; not needed for v1 — serving ships on *.run.app).
  • Deploy-time IAM (flagged by the Codex review, deferred as part of this layer): public roles/run.invoker on the serving services (Cloud Run's IAM gate is in front of the app's API-key auth); the deploy SAs' artifactregistry.writer + run.developer + iam.serviceAccountUser (push/deploy/act-as); and the /capabilities uptime check needs an auth path (currently would 401).
  • Pub/Sub-bridge lifespan wiring into the earnings serving app (/stream intentionally 404s live calls until the capture→STT→rolefact→pubsub pipeline lands). The batch /facts /transcripts /capabilities routes are independent and unaffected.

These all land together in the deploy follow-up (built + reviewed as one coherent unit), at which point the cheap-path deploy (serving + daily incremental) runs and the 1-station backfill pilot precedes the cost gate.

🤖 Generated with Claude Code

helloiamvu and others added 14 commits July 2, 2026 22:21
… carve-out

- Amend 'All API calls direct from SDK' rule: default path stays hosted-call-free,
  wheel grep-gate narrowed (not removed), hosted reached only via opt-in env seams
- Amend 'no hosted infra in v0.1' tech-stack constraint with the opt-in carve-out
- Amend the 'No FastAPI, no Docker' decision row for the opt-in serving API
- services/ deploy deps stay NON-published (never enter any PyPI dist)
- Records OPERATOR GATE #1 sign-off (2026-07-02)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… gate

- Add services/earnings/tests/test_deploy_precondition.py
- Assert [earnings] engine entrypoint modules import cleanly (no ImportError)
- Assert create_app registers /transcripts /facts /capabilities /stream routers
- Assert composed serving surface is audio-free (assert_no_audio_surface, D-27.9)
- Pure importability + surface test; no GPU/network/GCP; collected under 'not live'

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add _r2_sink.py: boto3 S3-compat write-token client (R2 endpoint,
  region_name=auto, adaptive retries max_attempts=5). Write-token creds
  read from env by NAME (R2_ACCOUNT_ID/R2_WRITE_ACCESS_KEY_ID/
  R2_WRITE_SECRET_ACCESS_KEY) — never inline. upload() delegates to
  s3.upload_file.
- Wire opt-in sink into backfill_goes_satellite via r2_target: AFTER the
  atomic write_satellite_cache local write, upload the derived partition
  under the weather/satellite/ key prefix. No r2_target -> local-only,
  byte-identical to pre-28-20.
- Thread r2_target through bulk_backfill + _SliceItem + _run_slice
  (picklable for the process-pool path).
- test_backfill_upload_sink.py: 7 tests (mock boto3) covering opt-in
  upload, empty-partition no-upload, local-only unchanged, R2 client
  ctor, missing-env raise, no secret value in module.
- Lift _assert_goes_only -> _assert_backfill_supported: the bulk backfill
  now accepts the WHOLE native ring. GOES/Himawari/VIIRS route through the
  anonymous-NODD transports (--mirror aws|gcp); eumetsat_meteosat routes to
  a NEW KEYED path. Unknown satellites still rejected by the downstream
  enum validator. Retargeted the orphaned 'Phase 27' comment.
- Add _eumetsat.py: keyed EUMETSAT Data Store fetch (fetch_meteosat_month)
  wrapping the shipped _eumetsat_store OAuth2 transport, bounded by a
  DISTINCT fleet-wide BoundedSemaphore (_METEOSAT_MAX_CONNS=10) — separate
  from the anon-NODD max_workers fan-out (EUMETSAT 30 req/s, 10 conns,
  5 TB/day, single shared key). Meteosat is NOT a --mirror NODD source.
- Source-aware transport dispatch in backfill_goes_satellite: GOES keeps
  the bare monkeypatchable names; Himawari/VIIRS via _anon_* resolvers;
  Meteosat delegates the whole month to the keyed semaphore-bounded fetch.
- Add --r2-target/--r2-bucket CLI flags (opt-in R2 sink; default local-only).
- eumdac pinned >=3.1,<4.0 in [satellite] extra — publisher verified as
  EUMETSAT on pypi.org 2026-07-02 (Task 1 legitimacy checkpoint).
- test_eumetsat_source.py: 13 tests (gate lift, routing, distinct
  semaphore). Replaced TestNonGoesBackfillRejected with
  TestMultiFamilyBackfillAccepted; added CLI r2-flag tests.
…les, no-org amendment

- infra/providers.tf: google + google-beta v6.x pins, quota-project routing
- infra/backend.tf: GCS remote state in mostlyright-backend (bucket bootstrapped manually)
- infra/variables.tf: billing_account, github_repo, serving_region, gpu_region (us-central1), artifact_registry
- infra/README.md: records no-org -> no-folder architecture amendment (cites 28-GCE-ARCHITECTURE §1)
- infra/terraform.tfvars.example: placeholder tfvars (real terraform.tfvars gitignored)
- .gitignore: ignore terraform.tfvars, state, .terraform/, plan outputs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bindings

- infra/projects.tf: 3 flat billing-linked google_project (mr-earnings-ingest, mr-serving, mr-staging), no folder_id/org_id; per-project API enablement (serving R2-read-only; ingest adds compute/batch/pubsub/scheduler; staging mirrors serving)
- infra/wif.tf: WIF pool + OIDC provider (repo-pinned attribute_condition assertion.repository == github_repo), one deploy SA per project + workloadIdentityUser binding
- infra/artifact_registry.tf: cross-project artifactregistry.reader on existing europe-west3 repo (reuse, no create; mostlyright-backend untouched)
- infra/outputs.tf: resolved project IDs/numbers, WIF provider name, deploy SA emails for downstream plans + deploy.yml
- infra/.terraform.lock.hcl: provider version lock (reproducibility)

tofu validate exits 0; tofu init -backend=false resolves google/google-beta v6.x

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- permissions: id-token: write + contents: read (WIF OIDC token minting)
- google-github-actions/auth@v2 with workload_identity_provider (no SA key file / no credentials_json)
- google-github-actions/setup-gcloud@v2 + auth smoke test
- workflow_dispatch trigger with target-project choice input
- image build/push + gcloud run deploy stubbed for W1/W2 to fill against existing Artifact Registry

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, billing-cap gate

Discovered at live apply (2026-07-02):
- mr-staging project ID globally taken -> per-project ID override vars; staging = mr-staging-mostlyright (ingest/serving keep bare IDs)
- GCP auto-parented projects under domain org node 673021848874 (a Cloud Identity org DOES exist for vu@mostlyright.md); ignore_changes=[org_id] to accept it (amendment intent holds: config authors no org/folder hierarchy) + deletion_policy=DELETE
- billing account hit its default 5-project link cap -> mr-staging billing-link quota-rejected; gate staging + its billing-dependent resources behind enable_staging (default false) until an operator billing-quota increase; ingest+serving fully provisioned (billing-live, APIs, WIF, deploy SAs, cross-project AR reader bindings)

tofu apply: 25 added, 0 changed, 0 destroyed; mostlyright-backend untouched except additive AR reader members

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…billing 5-project cap + staging unblock steps)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cloud Run services/jobs + Cloud Batch + Pub/Sub (DLQ) + Secret Manager
data-sources + budgets + monitoring + per-workload runtime SAs, on the flat
infra/ layout. Firewalls encoded as data: serving SAs bind R2-read+api-key
only; ingest/backfill SAs bind R2-write+EUMETSAT; STT GPU L4 pinned to
us-central1 (not offered in europe-west3).

Review fixes folded in:
- R2 write jobs (rolefact, incremental) inject R2_WRITE_ACCESS_KEY_ID /
  R2_WRITE_SECRET_ACCESS_KEY (the names _r2_sink reads) — a generic
  R2_ACCESS_KEY_ID left the sink's _require_env unset -> ValueError on upload.
- Backfill Batch job now injects the R2-write + EUMETSAT secret_variables it
  was missing (it would otherwise upload zero derived parquet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…seam

- services/earnings: cross-project Pub/Sub bridge (SegmentPublisher/Subscriber)
  with a structural audio firewall (assert_message_audio_free on publish AND
  receive; closed MESSAGE_KINDS with no audio kind; lazy GCP import).
- services/weather: /satellite + /capabilities app (R2 read-only), byte-identical
  to local live modulo the delivery channel.
- satellite/_hosted_client.py (delivery="hosted" seam) + _progress.py (durable,
  upload-gated crash-safe progress store).

Review fixes folded in:
- /satellite bounds the query window (_MAX_WINDOW_MONTHS) before any R2 I/O — an
  unbounded far-future end fanned one request out to ~120k object reads (DoS).
- earnings + weather auth compare the key as UTF-8 bytes: hmac.compare_digest
  raises TypeError on a non-ASCII header, turning a 401 into a 500.
- /stream honors a ?lastEventId= query fallback for an explicit cross-cut resume
  (a fresh EventSource cannot set the Last-Event-ID header).
- regression tests: over-wide window -> 422; non-ASCII key header -> 401.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Browser/MV3-safe hosted seam: hosted/{fetch,satellite}, earnings/hostedStream
(EventSource + deterministic Last-Event-ID reconnect). No Node APIs.

Review fix folded in: the earnings /stream client now mints a signed, single-scope
?token= locally from the public MOSTLYRIGHT_API_KEY (Web Crypto HMAC-SHA256,
byte-identical to the Python mint_stream_token) instead of sending ?apiKey= — the
server only accepts a signed token, so every hosted stream would have 401'd. The
resume cursor rides ?lastEventId= (now honored server-side). streamToken.test.ts
proves the token verifies + is url-safe + rejects a wrong-key tamper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/check_no_hosted_calls.py enforces D-28.2: the published SDK default path
makes NO hosted call. Two rules over packages/{core,weather,markets}/src — no
hardcoded hosted host, and the opt-in seam identifiers (delivery="hosted",
*_HOSTED_URL) quarantined to 4 allowlisted seam files. Green on the tree (202
modules), trips on a synthetic default-path hosted call. Wired into test.yml.
Adds docs/hosted-api.md + deploy-runbook.md + earnings-synced-deeplink-design.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Iterate Uint8Array with for-of (element is number, not number|undefined) and
assert the token.split('.') tuple in the tests. tsc --noEmit clean; 12 tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

Docs-required check: PASS

API-surface change includes docs updates — no reminder needed.

API-surface files changed:

packages-ts/weather/src/earnings/hostedStream.ts
packages-ts/weather/src/earnings/streamToken.ts
packages-ts/weather/src/hosted/fetch.ts
packages-ts/weather/src/hosted/index.ts
packages-ts/weather/src/hosted/satellite.ts
packages-ts/weather/src/index.ts
packages/weather/src/mostlyright/weather/satellite/__init__.py
packages/weather/src/mostlyright/weather/satellite/__main__.py
packages/weather/src/mostlyright/weather/satellite/_backfill.py
packages/weather/src/mostlyright/weather/satellite/_eumetsat.py
packages/weather/src/mostlyright/weather/satellite/_hosted_client.py
packages/weather/src/mostlyright/weather/satellite/_progress.py
packages/weather/src/mostlyright/weather/satellite/_r2_sink.py

Docs files changed:

CLAUDE.md
docs/deploy-runbook.md
docs/earnings-synced-deeplink-design.md
docs/hosted-api.md
infra/README.md

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

Parity ticket gate: PASSED

parity-ticket-check: PR does not touch parity-trigger surface; gate skipped.

See CROSS-SDK-SYNC.md §2 for the workflow.

minereda and others added 4 commits July 3, 2026 14:09
The satellite delivery="hosted" seam (satellite/__init__.py dispatch +
_hosted_client.py) is now a cross-SDK public API paired with the TS hosted shim
(packages-ts/weather/src/hosted). Registering it as a Python parity trigger makes
the parity-ticket gate recognize this PR as a paired-language change AND enforces
TS parity on future changes to the Python hosted surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_r2_object_key_matches_backfill_sink_layout imports satellite/_backfill
(which pulls _goes_s3 -> boto3 at module scope). The base CI fast-suite runs
WITHOUT the [satellite] extra, so the import raised ModuleNotFoundError and failed
fast-suite (3.11/3.12/3.13) + pandas-3 + polars + coverage-gate. A function-level
pytest.importorskip('boto3') skips it cleanly there; the satellite-coverage lane
(which installs the extra) still runs it. Reproduced with: uv sync --all-packages
(no extra) then pytest.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- #3 SETTLEMENT SAFETY: /satellite no longer swallows every read exception into
  200 []. r2_read maps a genuinely-missing object (NoSuchKey/404) to
  FileNotFoundError; the route skips only that (empty month) and fails LOUD (502)
  on any real error (bad creds, R2 outage, corrupt parquet) — a silent [] could be
  misread as 'no data' and corrupt a settlement/training result. +regression test.
- #1: earnings serving injects EARNINGS_API_KEY (the name the app reads) from the
  unified mostlyright-api-key secret, so it boots AND its expected key == the
  value the TS/extension client mints its signed ?token= with.
- #4: the Cloud Batch backfill job now runs as the dedicated weather_backfill SA
  (least-privilege / firewall D) instead of the default compute SA.
- #6: Python satellite(delivery='hosted') fetches a station LIST one request per
  station (single ICAO each) and concatenates — the server 422s on a joined
  'KNYC,KLGA'. +multi-station test.
- #5: the TS hosted shim requires  (the server does not auto-route) and
  loops one request per station instead of repeating station= (which the server
  drops). +rewritten multi-station test.

Deferred (Codex #2): wiring the SegmentSubscriber into the earnings app lifespan
is part of the operator-gated live pipeline (capture->STT->rolefact->pubsub);
/stream stays 404 until that lands. Documented in the PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codex round-3: mapping NoSuchBucket / any 404 to FileNotFoundError (-> 200 [])
meant a typoed/deleted/wrong R2_BUCKET would make EVERY satellite query look like
'no data' — a catastrophic silent failure. Restrict the not-found mapping to a
missing OBJECT (NoSuchKey); a missing BUCKET (or any other error) now propagates
to the route's loud 502.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants