Skip to content

eugeneliukindev/fastapi-observability-docker-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Observability Docker Stack

A full observability stack for a Gunicorn/FastAPI application, running entirely in Docker Compose. Covers all four pillars — metrics · logs · traces · profiles — with alerting out of the box.

Grafana dashboard: FastAPI Full Observability


Contents


Dashboard Preview

Overview

Open http://localhost:3000 after starting the stack. Use the Project dropdown to filter by Docker Compose project name.


Architecture

Architecture

How each signal travels

Metrics

  1. Backend exposes /metrics in Prometheus format (multiprocess-safe via prometheus_multiproc)
  2. Alloy pulls it every 15s — opt-in via metrics.scrape: "true" docker label on any service
  3. Alloy remote-writes to VictoriaMetrics
  4. vmalert queries VictoriaMetrics every 15s → fires to Alertmanager → Telegram / email

Logs

  1. Backend writes structured logfmt to stdout (includes trace_id on every line)
  2. Alloy reads container stdout via Docker API — no log driver config needed
  3. Docker Compose labels (project, service, container) are attached as Loki stream labels
  4. Logs are queryable in Grafana with LogQL

Traces

  1. Backend's OpenTelemetry SDK sends spans via OTLP gRPC → Alloy (:4317) → Tempo
  2. Tempo generates span metrics (RPS, latency, errors per operation) and pushes them to VictoriaMetrics
  3. trace_id in log lines creates a live link from any log entry to its full trace

Profiles

  1. Pyroscope SDK pushes CPU flame graphs via HTTP → Alloy (:4040) → Pyroscope storage
  2. Grafana links profiles to traces via Tempo's tracesToProfiles integration

How the Backend Works

The backend/ directory is a minimal FastAPI + Gunicorn application wired with all four observability signals. It is intentionally kept simple — the goal is to show the instrumentation, not the business logic.

Middleware stack

Every request passes through three middlewares in order:

RequestAccessMiddleware → MetricsMiddleware → FastAPI router
Middleware What it does
RequestAccessMiddleware Generates request_id, writes a structured logfmt access log line with method, path, status, duration
MetricsMiddleware Records requests_total, responses_total, request_duration_seconds, requests_in_progress, exceptions_total

Multiprocess metrics

Gunicorn forks multiple worker processes. The standard Prometheus client is not process-safe by default. The backend uses prometheus_multiproc mode — each worker writes its metrics to a shared directory (/tmp/prometheus_multiproc). The /metrics endpoint aggregates all files before responding.

Worker lifecycle hooks in gunicorn.conf.py ensure per-worker gauges are cleaned up on exit.

OpenTelemetry

Traces are sent via OTLP gRPC to Alloy (:4317) using BatchSpanProcessor. FastAPIInstrumentor automatically creates a root span for every request. PyroscopeSpanProcessor links each root span to a Pyroscope profile — enabling the Profiles button in Tempo.

Structured logging

Every log line is emitted in logfmt format and includes:

  • level, timestamp, message
  • request_id — unique per request, also returned as X-Request-Id response header
  • trace_id, span_id — injected from the active OpenTelemetry span, enabling Logs → Traces navigation in Grafana

Grafana Dashboard

Overview — Apdex · Error Rate · Total Requests · Workers

Overview

  • Apdex — user satisfaction score based on latency thresholds
  • Error Rate — percentage of 5xx responses
  • Total Requests — cumulative request count
  • Workers — live Gunicorn worker count (step graph, drops to 0 on crash)

Throughput (RPS) · P50 Latency

RPS

  • RPS total, broken down by path and by status code
  • P50 latency — total and per path

Latency — P95 · P99

Latency

P95 and P99 — total and broken down by path.

Slowest Endpoints

  • Top 10 Slowest Endpoints (P95 bar gauge, color-coded green → red)
  • Average Duration by endpoint

In-flight Requests

Requests In Progress

Requests currently being processed — total and by path. Useful for detecting request pile-ups.


Errors

Errors

  • 4xx Rate by path — client errors
  • 5xx Rate by path — server errors
  • Exceptions by Type — unhandled Python exceptions with rate

Traces & Span Metrics

Traces

  • Service Map — visual request flow graph with avg latency and RPS per node
  • Recent Traces — clickable list, opens full trace in Tempo
  • Span RPS / P99 Latency / Error Rate — broken down per operation

Logs

Logs

  • Log Volume by Level — histogram showing INFO / ERROR / WARNING over time
  • Log Stream — live log view; each line includes request_id, trace_id, span_id

Profiling

Profiling

  • CPU Time Consumed — total CPU usage over time across all workers
  • Flame Graph — aggregated call stack for the selected time range; shows hottest functions by self/total CPU time

Cross-signal Navigation

Every log line contains a trace_id linking it to a distributed trace.

Logs → Traces:

  1. Click any line in the Log Stream to expand it
  2. In the Links section, click "Open in Tempo"

Log to Trace

From a Trace span you can jump to:

  • Logs for this span — correlated log lines in Loki
  • Span metrics — RPS / latency / error rate for that operation
  • Profile — CPU flame graph for that request

Generate Load

To populate the dashboards with real traffic, run the included load script:

./load.sh           # ~20 req/s against http://localhost:8000
./load.sh 50        # custom rate
./load.sh 10 http://localhost:8001  # custom rate + URL

The script cycles through all endpoints — normal requests, 4xx, 5xx, slow — so every dashboard panel gets data within seconds.


Quick Start

1. Configure alerting

Edit observability/alertmanager/config.yaml and fill in the placeholders:

global:
  smtp_smarthost: '<SMTP_HOST>:<SMTP_PORT>'   # e.g. smtp.gmail.com:587
  smtp_from: '<SMTP_FROM>'
  smtp_auth_username: '<SMTP_USERNAME>'
  smtp_auth_password: '<SMTP_PASSWORD>'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '<ALERT_EMAIL>'

    telegram_configs:
      - bot_token: '<TELEGRAM_BOT_TOKEN>'     # from @BotFather → /newbot
        chat_id: <TELEGRAM_CHAT_ID>           # from @userinfobot

If you only need Telegram — remove email_configs. If you only need email — remove telegram_configs.

2. Start

docker compose up -d

3. Open Grafana

http://localhost:3000   →  admin / admin

Stack

Component Role Version
Grafana Alloy Collector — scrapes metrics, collects logs, receives traces & profiles v1.12.0
VictoriaMetrics Metrics storage (Prometheus-compatible) v1.131.0
vmalert Evaluates PromQL alert rules against VictoriaMetrics v1.131.0
Alertmanager Alert routing, deduplication & notifications v0.29.0
Loki Log storage & querying v3.5.9
Tempo Distributed trace storage + span metrics generation v2.8.0
Pyroscope Continuous profiling storage v1.17.0
Grafana Dashboards, Explore, cross-signal navigation v12.4.0
Backend Example FastAPI app (Gunicorn + Uvicorn workers)

Alerting

Rules live in observability/vmalert/rules/fastapi.yaml, evaluated every 15s.

Alert Fires when Severity Delay
BackendDown No metrics received from backend critical 1m
HighErrorRate 5xx responses > 5% of total warning 2m
HighLatencyP99 p99 latency > 1s warning 2m
  • Critical alerts suppress warnings with the same alertname via inhibit rules
  • All alerts route to default-receiver → Telegram + email
  • To adjust thresholds — edit expr in observability/vmalert/rules/fastapi.yaml

Adding Your Own Service

To opt a service into metrics scraping, add these docker labels:

services:
  my-service:
    labels:
      metrics.scrape: "true"           # required
      metrics.path: "/custom/metrics"  # optional, defaults to /metrics

Alloy auto-discovers the container and attaches project, service, container labels. No Alloy config changes needed.

Logs are collected from all running containers automatically — no labels required.


Multiple Environments

Run dev and staging side by side:

docker compose -p dev     up -d
docker compose -p staging up -d

Each project gets its own project label on all metrics and logs. Switch between them in Grafana using the Project dropdown at the top of the dashboard.


Ports

Service Port Purpose
Grafana 3000 Dashboards — http://localhost:3000
Backend API 8000 FastAPI docs — http://localhost:8000/docs
Alloy 12345 Pipeline debug UI — http://localhost:12345

Releases

No releases published

Packages

 
 
 

Contributors