Skip to content

feat(readiness-probe-improvements): AIPLAT-916#185

Open
subpath wants to merge 6 commits into
mainfrom
feat-readiness-probe-improvements-AIPLAT-916
Open

feat(readiness-probe-improvements): AIPLAT-916#185
subpath wants to merge 6 commits into
mainfrom
feat-readiness-probe-improvements-AIPLAT-916

Conversation

@subpath

@subpath subpath commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Jira ticket: AIPLAT-916
Infra PR companion should be merged first: https://github.com/mozilla/dataservices-infra/pull/2090

What's New

/health/readiness now answers "can this pod serve a request right now?" via the HTTP status code, instead of always returning 200.

Kubernetes routes on the status code, so a pod with a down Postgres pool or a stale app_attest schema now drains from rotation instead of serving 500s.

Three checks run concurrently (asyncio.gather), each bounded by READINESS_CHECK_TIMEOUT_S (2s). All pass -> 200; any fail -> 503 with the failing check named in the body:

  1. litellm pool: real SELECT 1 (replaces the old _closed-flag read, which only knew if .close() was called).
  2. app_attest pool + migration head: reads alembic_version and asserts it equals the Alembic head(s) the running code ships. The read doubles as the pool's liveness check.
  3. LiteLLM readiness: requires HTTP 200 and db == "connected" (the old code parsed the body but never checked the status, so a 503-with-body read as healthy).

Liveness stays a constant {"status": "alive"}, so a dependency blip drains the pod without ever restarting it, no crash-loop path from this change.

QA:

all new unit tests ✅

@subpath subpath requested a review from a team as a code owner June 22, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant