Phase 28: Hosted GCE Data Platform — serving apps + IaC + SDK hosted surface (platform layer)#92
Merged
Merged
Conversation
… carve-out - Amend 'All API calls direct from SDK' rule: default path stays hosted-call-free, wheel grep-gate narrowed (not removed), hosted reached only via opt-in env seams - Amend 'no hosted infra in v0.1' tech-stack constraint with the opt-in carve-out - Amend the 'No FastAPI, no Docker' decision row for the opt-in serving API - services/ deploy deps stay NON-published (never enter any PyPI dist) - Records OPERATOR GATE #1 sign-off (2026-07-02) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… gate - Add services/earnings/tests/test_deploy_precondition.py - Assert [earnings] engine entrypoint modules import cleanly (no ImportError) - Assert create_app registers /transcripts /facts /capabilities /stream routers - Assert composed serving surface is audio-free (assert_no_audio_surface, D-27.9) - Pure importability + surface test; no GPU/network/GCP; collected under 'not live' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add _r2_sink.py: boto3 S3-compat write-token client (R2 endpoint, region_name=auto, adaptive retries max_attempts=5). Write-token creds read from env by NAME (R2_ACCOUNT_ID/R2_WRITE_ACCESS_KEY_ID/ R2_WRITE_SECRET_ACCESS_KEY) — never inline. upload() delegates to s3.upload_file. - Wire opt-in sink into backfill_goes_satellite via r2_target: AFTER the atomic write_satellite_cache local write, upload the derived partition under the weather/satellite/ key prefix. No r2_target -> local-only, byte-identical to pre-28-20. - Thread r2_target through bulk_backfill + _SliceItem + _run_slice (picklable for the process-pool path). - test_backfill_upload_sink.py: 7 tests (mock boto3) covering opt-in upload, empty-partition no-upload, local-only unchanged, R2 client ctor, missing-env raise, no secret value in module.
- Lift _assert_goes_only -> _assert_backfill_supported: the bulk backfill now accepts the WHOLE native ring. GOES/Himawari/VIIRS route through the anonymous-NODD transports (--mirror aws|gcp); eumetsat_meteosat routes to a NEW KEYED path. Unknown satellites still rejected by the downstream enum validator. Retargeted the orphaned 'Phase 27' comment. - Add _eumetsat.py: keyed EUMETSAT Data Store fetch (fetch_meteosat_month) wrapping the shipped _eumetsat_store OAuth2 transport, bounded by a DISTINCT fleet-wide BoundedSemaphore (_METEOSAT_MAX_CONNS=10) — separate from the anon-NODD max_workers fan-out (EUMETSAT 30 req/s, 10 conns, 5 TB/day, single shared key). Meteosat is NOT a --mirror NODD source. - Source-aware transport dispatch in backfill_goes_satellite: GOES keeps the bare monkeypatchable names; Himawari/VIIRS via _anon_* resolvers; Meteosat delegates the whole month to the keyed semaphore-bounded fetch. - Add --r2-target/--r2-bucket CLI flags (opt-in R2 sink; default local-only). - eumdac pinned >=3.1,<4.0 in [satellite] extra — publisher verified as EUMETSAT on pypi.org 2026-07-02 (Task 1 legitimacy checkpoint). - test_eumetsat_source.py: 13 tests (gate lift, routing, distinct semaphore). Replaced TestNonGoesBackfillRejected with TestMultiFamilyBackfillAccepted; added CLI r2-flag tests.
…les, no-org amendment - infra/providers.tf: google + google-beta v6.x pins, quota-project routing - infra/backend.tf: GCS remote state in mostlyright-backend (bucket bootstrapped manually) - infra/variables.tf: billing_account, github_repo, serving_region, gpu_region (us-central1), artifact_registry - infra/README.md: records no-org -> no-folder architecture amendment (cites 28-GCE-ARCHITECTURE §1) - infra/terraform.tfvars.example: placeholder tfvars (real terraform.tfvars gitignored) - .gitignore: ignore terraform.tfvars, state, .terraform/, plan outputs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bindings - infra/projects.tf: 3 flat billing-linked google_project (mr-earnings-ingest, mr-serving, mr-staging), no folder_id/org_id; per-project API enablement (serving R2-read-only; ingest adds compute/batch/pubsub/scheduler; staging mirrors serving) - infra/wif.tf: WIF pool + OIDC provider (repo-pinned attribute_condition assertion.repository == github_repo), one deploy SA per project + workloadIdentityUser binding - infra/artifact_registry.tf: cross-project artifactregistry.reader on existing europe-west3 repo (reuse, no create; mostlyright-backend untouched) - infra/outputs.tf: resolved project IDs/numbers, WIF provider name, deploy SA emails for downstream plans + deploy.yml - infra/.terraform.lock.hcl: provider version lock (reproducibility) tofu validate exits 0; tofu init -backend=false resolves google/google-beta v6.x Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- permissions: id-token: write + contents: read (WIF OIDC token minting) - google-github-actions/auth@v2 with workload_identity_provider (no SA key file / no credentials_json) - google-github-actions/setup-gcloud@v2 + auth smoke test - workflow_dispatch trigger with target-project choice input - image build/push + gcloud run deploy stubbed for W1/W2 to fill against existing Artifact Registry Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, billing-cap gate Discovered at live apply (2026-07-02): - mr-staging project ID globally taken -> per-project ID override vars; staging = mr-staging-mostlyright (ingest/serving keep bare IDs) - GCP auto-parented projects under domain org node 673021848874 (a Cloud Identity org DOES exist for vu@mostlyright.md); ignore_changes=[org_id] to accept it (amendment intent holds: config authors no org/folder hierarchy) + deletion_policy=DELETE - billing account hit its default 5-project link cap -> mr-staging billing-link quota-rejected; gate staging + its billing-dependent resources behind enable_staging (default false) until an operator billing-quota increase; ingest+serving fully provisioned (billing-live, APIs, WIF, deploy SAs, cross-project AR reader bindings) tofu apply: 25 added, 0 changed, 0 destroyed; mostlyright-backend untouched except additive AR reader members Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…billing 5-project cap + staging unblock steps) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cloud Run services/jobs + Cloud Batch + Pub/Sub (DLQ) + Secret Manager data-sources + budgets + monitoring + per-workload runtime SAs, on the flat infra/ layout. Firewalls encoded as data: serving SAs bind R2-read+api-key only; ingest/backfill SAs bind R2-write+EUMETSAT; STT GPU L4 pinned to us-central1 (not offered in europe-west3). Review fixes folded in: - R2 write jobs (rolefact, incremental) inject R2_WRITE_ACCESS_KEY_ID / R2_WRITE_SECRET_ACCESS_KEY (the names _r2_sink reads) — a generic R2_ACCESS_KEY_ID left the sink's _require_env unset -> ValueError on upload. - Backfill Batch job now injects the R2-write + EUMETSAT secret_variables it was missing (it would otherwise upload zero derived parquet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…seam - services/earnings: cross-project Pub/Sub bridge (SegmentPublisher/Subscriber) with a structural audio firewall (assert_message_audio_free on publish AND receive; closed MESSAGE_KINDS with no audio kind; lazy GCP import). - services/weather: /satellite + /capabilities app (R2 read-only), byte-identical to local live modulo the delivery channel. - satellite/_hosted_client.py (delivery="hosted" seam) + _progress.py (durable, upload-gated crash-safe progress store). Review fixes folded in: - /satellite bounds the query window (_MAX_WINDOW_MONTHS) before any R2 I/O — an unbounded far-future end fanned one request out to ~120k object reads (DoS). - earnings + weather auth compare the key as UTF-8 bytes: hmac.compare_digest raises TypeError on a non-ASCII header, turning a 401 into a 500. - /stream honors a ?lastEventId= query fallback for an explicit cross-cut resume (a fresh EventSource cannot set the Last-Event-ID header). - regression tests: over-wide window -> 422; non-ASCII key header -> 401. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Browser/MV3-safe hosted seam: hosted/{fetch,satellite}, earnings/hostedStream
(EventSource + deterministic Last-Event-ID reconnect). No Node APIs.
Review fix folded in: the earnings /stream client now mints a signed, single-scope
?token= locally from the public MOSTLYRIGHT_API_KEY (Web Crypto HMAC-SHA256,
byte-identical to the Python mint_stream_token) instead of sending ?apiKey= — the
server only accepts a signed token, so every hosted stream would have 401'd. The
resume cursor rides ?lastEventId= (now honored server-side). streamToken.test.ts
proves the token verifies + is url-safe + rejects a wrong-key tamper.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/check_no_hosted_calls.py enforces D-28.2: the published SDK default path
makes NO hosted call. Two rules over packages/{core,weather,markets}/src — no
hardcoded hosted host, and the opt-in seam identifiers (delivery="hosted",
*_HOSTED_URL) quarantined to 4 allowlisted seam files. Green on the tree (202
modules), trips on a synthetic default-path hosted call. Wired into test.yml.
Adds docs/hosted-api.md + deploy-runbook.md + earnings-synced-deeplink-design.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Iterate Uint8Array with for-of (element is number, not number|undefined) and
assert the token.split('.') tuple in the tests. tsc --noEmit clean; 12 tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
✅ Docs-required check: PASS API-surface change includes docs updates — no reminder needed. API-surface files changed: Docs files changed: |
|
Parity ticket gate: PASSED See |
The satellite delivery="hosted" seam (satellite/__init__.py dispatch + _hosted_client.py) is now a cross-SDK public API paired with the TS hosted shim (packages-ts/weather/src/hosted). Registering it as a Python parity trigger makes the parity-ticket gate recognize this PR as a paired-language change AND enforces TS parity on future changes to the Python hosted surface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_r2_object_key_matches_backfill_sink_layout imports satellite/_backfill
(which pulls _goes_s3 -> boto3 at module scope). The base CI fast-suite runs
WITHOUT the [satellite] extra, so the import raised ModuleNotFoundError and failed
fast-suite (3.11/3.12/3.13) + pandas-3 + polars + coverage-gate. A function-level
pytest.importorskip('boto3') skips it cleanly there; the satellite-coverage lane
(which installs the extra) still runs it. Reproduced with: uv sync --all-packages
(no extra) then pytest.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- #3 SETTLEMENT SAFETY: /satellite no longer swallows every read exception into 200 []. r2_read maps a genuinely-missing object (NoSuchKey/404) to FileNotFoundError; the route skips only that (empty month) and fails LOUD (502) on any real error (bad creds, R2 outage, corrupt parquet) — a silent [] could be misread as 'no data' and corrupt a settlement/training result. +regression test. - #1: earnings serving injects EARNINGS_API_KEY (the name the app reads) from the unified mostlyright-api-key secret, so it boots AND its expected key == the value the TS/extension client mints its signed ?token= with. - #4: the Cloud Batch backfill job now runs as the dedicated weather_backfill SA (least-privilege / firewall D) instead of the default compute SA. - #6: Python satellite(delivery='hosted') fetches a station LIST one request per station (single ICAO each) and concatenates — the server 422s on a joined 'KNYC,KLGA'. +multi-station test. - #5: the TS hosted shim requires (the server does not auto-route) and loops one request per station instead of repeating station= (which the server drops). +rewritten multi-station test. Deferred (Codex #2): wiring the SegmentSubscriber into the earnings app lifespan is part of the operator-gated live pipeline (capture->STT->rolefact->pubsub); /stream stays 404 until that lands. Documented in the PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codex round-3: mapping NoSuchBucket / any 404 to FileNotFoundError (-> 200 []) meant a typoed/deleted/wrong R2_BUCKET would make EVERY satellite query look like 'no data' — a catastrophic silent failure. Restrict the not-found mapping to a missing OBJECT (NoSuchKey); a missing BUCKET (or any other error) now propagates to the route's loud 502. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This was referenced Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 28 — Hosted GCE Data Platform (platform + IaC + SDK/serving surface)
Stands up the opt-in hosted path on top of the local-first SDK, per GATE #1 + #2 (both signed 2026-07-02). The SDK default path stays hosted-call-free — hosted is reached only via
delivery="hosted"/WEATHER_HOSTED_URL/EARNINGS_HOSTED_URL+MOSTLYRIGHT_API_KEY.This PR is built on the earlier
phase28/hosted-gce-platformfoundation; it reconciles that work, fixes everything an adversarial review surfaced, and adds the hosted-call grep-gate. It deliberately scopes out the deploy-runtime containerization layer (see Deferred below).What's in this PR
Serving apps (non-published — never in any wheel)
services/weather/—GET /satellite+/capabilities, R2 read-only, rows byte-identical to localdelivery="live"modulo the delivery channel (D-28.2).services/earnings/— cross-project Pub/Sub bridge (pubsub_bridge.py) joining the ingest→serving SSE transport, with a structural audio firewall (assert_message_audio_freeon both publish and receive; closedMESSAGE_KINDSwith no audio kind; lazy GCP import).SDK hosted surface (published)
satellite/_hosted_client.py— thedelivery="hosted"seam (opt-in; env-gated).satellite/_progress.py— durable, upload-gated crash-safe progress store (a Spot kill between local write and R2 upload leaves the partition unmarked → retried).packages-ts/weather/src/{hosted,earnings}— browser/MV3 hosted shim (fetch + satellite + earnings SSE stream). No Node APIs.Infrastructure (flat
infra/OpenTofu)us-central1(not offered ineurope-west3).Firewall-A enforcement
scripts/check_no_hosted_calls.py— the amended default-path grep-gate (green over 202 published modules; trips on a synthetic hosted call; opt-in seam identifiers quarantined to 4 allowlisted files). Wired intotest.yml.Review + fixes
An adversarial review (29 agents, 7 dimensions × the firewalls, refute-by-default verification) confirmed the app code is sound (serving suite 137 green, TS 12 green) and surfaced 11 findings. All fixed + regression-tested:
secret_variablesit was missing (would otherwise upload zero derived parquet).R2_WRITE_ACCESS_KEY_ID/R2_WRITE_SECRET_ACCESS_KEY(the names_r2_sinkreads) — a genericR2_ACCESS_KEY_IDleft the sink's_require_envunset →ValueErroron upload./satellitebounds the window (_MAX_WINDOW_MONTHS) before any R2 I/O — an unbounded far-futureendfanned one request out to ~120k object reads (amplification DoS).hmac.compare_digestraisesTypeErroron a non-ASCII header, turning a 401 into a 500 (unauthenticated 500-trigger)./streamclient mints a signed, single-scope?token=locally from the public key (Web Crypto HMAC, byte-identical to Pythonmint_stream_token) instead of?apiKey=— the server only accepts a signed token, so every hosted stream would have 401'd./streamhonors a?lastEventId=query fallback for the explicit cross-cut resume (a freshEventSourcecan't set the header).check_no_hosted_calls.py) was missing entirely — created + CI-wired.Regression tests added: over-wide window → 422; non-ASCII key header → 401; TS token verify/tamper. Full fast suite is green across 3.11/3.12/3.13 + pandas-3 + polars + coverage.
Two-reviewer discipline complete. A second, independent Codex 5.5-high pass then found + I fixed a further 5 HIGH:
/satellitewas swallowing all read errors (incl. a wrong/typoed R2 bucket) into200 []— now only a missing object (NoSuchKey) is empty, every other failure is a loud502(settlement safety); the earnings serving now getsEARNINGS_API_KEYfrom the unified secret (booted + token-verify aligned); the Cloud Batch backfill runs as its dedicated least-privilege SA; and the Python + TS hosted clients fetch a station list one request per station (single ICAO) with a requiredsatellite, instead of a comma-joined param the server would422. Codex then verified the fixes hold and all five firewalls hold (A hosted-call-free, Bservices/not in wheels, C audio never reaches serving/PubSub/R2, D R2 read-vs-write split, Erole_parser/climateuntouched).Deferred to an operator-gated follow-up (not in this PR)
The other session built the platform; the deploy-runtime containerization layer is intentionally staged for gate-time, because it can't be CI-validated or run yet:
deploy.ymlbuild step (the two committed deploy workflows areworkflow_dispatch-only stubs — they do not auto-fire), and theapi.mostlyright.mdCloudflare auto-DNS (token is provisioned; not needed for v1 — serving ships on*.run.app).roles/run.invokeron the serving services (Cloud Run's IAM gate is in front of the app's API-key auth); the deploy SAs'artifactregistry.writer+run.developer+iam.serviceAccountUser(push/deploy/act-as); and the/capabilitiesuptime check needs an auth path (currently would 401)./streamintentionally 404s live calls until the capture→STT→rolefact→pubsub pipeline lands). The batch/facts/transcripts/capabilitiesroutes are independent and unaffected.These all land together in the deploy follow-up (built + reviewed as one coherent unit), at which point the cheap-path deploy (serving + daily incremental) runs and the 1-station backfill pilot precedes the cost gate.
🤖 Generated with Claude Code