Skip to content

Server-layer observability (uvicorn/granian/hypercorn) + live admin Observability dashboard#146

Merged
ancongui merged 8 commits into
mainfrom
feat/server-observability
Jun 17, 2026
Merged

Server-layer observability (uvicorn/granian/hypercorn) + live admin Observability dashboard#146
ancongui merged 8 commits into
mainfrom
feat/server-observability

Conversation

@ancongui

Copy link
Copy Markdown
Contributor

Summary

Adds observability for the ASGI server layer — until now pyfly only observed the application layer (the http_server_requests_seconds filter, tracing/correlation, process metrics). This surfaces metrics about the server itself across Uvicorn, Granian, and Hypercorn, with correct multi-worker aggregation, and a live admin Observability dashboard section.

Targets release v26.06.113.

How it works (3 cooperating mechanisms)

All write to the Prometheus registry and are auto-exposed at /actuator/prometheus; everything is gated on pyfly.server.observability.enabled (on under the web/core starters) and degrades to a no-op without prometheus_client.

  1. ServerMetricsASGIMiddleware (web/adapters/starlette/asgi_server_metrics.py) — the primary, uniform source, installed outermost so it runs in every worker for every server/worker-count. Emits server_active_connections, server_in_flight_requests, server_requests_total.
  2. ServerMetricsBinder (observability/server_metrics.py) — bound from the in-worker ASGI lifespan (beside register_process_metrics / ManagementServer). Emits server_workers, server_uptime_seconds, server_started_total/server_stopped_total, and optional server_native_connections.
  3. ServerStatsPort (server/ports/server_stats.py) — best-effort per-adapter native stats; uvicorn surfaces true socket counts on the serve_async path, granian/hypercorn report workers+uptime only.

Why not just read native server stats? On the production pyfly run path, uvicorn.run(workers=N) forks workers that each build their own server, so server_state is unreachable cross-process. The ASGI middleware (in-worker) is the uniform source; native stats are enrichment.

Multi-worker aggregation

pyfly run enables prometheus_client multiprocess mode (PROMETHEUS_MULTIPROC_DIR set before forking) for workers > 1, so one scrape aggregates across all workers via MultiProcessCollector. This also fixes the prior per-worker gap for http_server_requests_*.

Admin dashboard

New live Observability view (Monitoring group): stat cards (workers, uptime, active connections, in-flight, requests/sec), rolling charts, a per-worker breakdown table, lifecycle, and links to Metrics/Traces. Backed by GET /admin/api/observability + the observability SSE stream.

Config

pyfly.server.observability.{enabled,sample-interval-seconds,access-log}. Local Prometheus+Grafana stack added to docker-compose.yml (loopback-bound; ops/prometheus/prometheus.yml).

Scope

Gunicorn is intentionally not added (stack stays async-only ASGI), but the ServerStatsPort + multiprocess design is gunicorn-ready.

Quality

  • TDD throughout; 4890 tests pass, ruff clean, mypy --strict clean (683 files).
  • A real-server E2E test (tests/server/test_server_observability_e2e.py) boots uvicorn via serve_async, fires HTTP traffic, and asserts the server_* meters move and are served.
  • Hardened against an adversarial multi-agent review (binder exception-safety + graceful-shutdown resilience, off-thread sampling, multiprocess dir cleanup + graceful scrape fallback, per-stream requests/sec, auto server-label resolution) and a security review (loopback-bound, no hardcoded Grafana admin password, anonymous read-only).

Docs updated: observability/server/admin module docs, README, ROADMAP, and the observability book chapter (EN + ES).

🤖 Generated with Claude Code

ancongui added 8 commits June 17, 2026 16:42
Add server-observability backend: a ServerStatsPort outbound port (best-effort
native stats on the serve_async path), per-adapter sample() implementations, a
pure-ASGI server-metrics middleware (the uniform primary source for
connections/in-flight/requests across all servers and worker counts), and a
ServerMetricsBinder that emits worker/uptime/lifecycle meters from the in-worker
ASGI lifespan. Gated by pyfly.server.observability.* (on under web/core starters).
Wired into both the Starlette and FastAPI create_app lifespans.
Add pyfly.observability.multiprocess (init dir before workers fork, build an
aggregating MultiProcessCollector registry). /actuator/prometheus aggregates
across workers when PROMETHEUS_MULTIPROC_DIR is set; cli run enables it for
workers>1. Fixes the pre-existing per-worker gap for http_server_requests +
server_* meters.
Add an ObservabilityProvider (reads server_* meters, multiprocess-aware, with a
per-worker breakdown), REST + SSE routes (/admin/api/observability[,/sse]), and a
live observability.js view (stat cards, rolling charts, per-worker table, links to
Metrics/Traces) registered in the SPA router + sidebar.
…stack

End-to-end test boots a real uvicorn server via serve_async, fires HTTP traffic,
and asserts the server_* meters move and are served at the exposition. Add a
prometheus + grafana docker-compose stack (ops/prometheus/prometheus.yml) scraping
/actuator/prometheus.
…ity view

Update observability/server/admin module docs, README + ROADMAP, and the
observability book chapter (EN + ES) with the server_* metric catalog, the
pyfly.server.observability.* config, multi-worker aggregation, and the live
admin Observability dashboard section.
…urity review

- Binder: guard all gauge writes + run sample() off-thread; _run never dies
  silently; stop() always records server_stopped_total and cleans up even if the
  sampling task died (was: stop() re-raised a dead task's exception, breaking
  graceful shutdown). Mark workers dead in multiprocess mode on graceful stop.
- Resolve the concrete server type ('auto' -> uvicorn/granian/hypercorn) so the
  server_* metric label is meaningful; binder falls back off the 'auto' sentinel.
- Admin provider: honor pyfly.server.observability.enabled (disabled -> dashboard
  empty-state); fix falsy-zero native_connections; move requests/sec to per-stream
  state (was corrupted by sharing one provider across REST + SSE + tabs).
- Multiprocess: graceful scrape fallback when the dir is missing; atexit cleanup +
  stale-dir sweep so mmap dirs don't accumulate across restarts.
- ASGI exclusion matches /api/sse/ as a substring (custom admin paths too).
- docker-compose: bind prometheus/grafana to loopback, drop the hardcoded admin
  password, downgrade anonymous Grafana to read-only Viewer (security review).
Bump version 26.06.112 -> 26.06.113 and add the CHANGELOG entry for the
server-layer observability feature (metrics across uvicorn/granian/hypercorn,
multi-worker aggregation, live admin Observability dashboard).
CI runs 'ruff format --check'; format the 4 new files to satisfy it.
@ancongui ancongui merged commit 332776e into main Jun 17, 2026
6 checks passed
@ancongui ancongui deleted the feat/server-observability branch June 17, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant