Skip to content

feat: add OpenTelemetry metrics for the hosted MCP server#50

Merged
ChiragAgg5k merged 6 commits into
mainfrom
feat/opentelemetry-metrics
Jun 29, 2026
Merged

feat: add OpenTelemetry metrics for the hosted MCP server#50
ChiragAgg5k merged 6 commits into
mainfrom
feat/opentelemetry-metrics

Conversation

@ChiragAgg5k

Copy link
Copy Markdown
Member

What & why

The MCP server emitted no telemetry. The other Appwrite services use utopia-php/telemetry to export OpenTelemetry metrics over OTLP/HTTP to the shared stack (OTel Collector → Prometheus/Mimir → Grafana at telemetry.appwrite.systems). This wires the hosted HTTP MCP server into that same stack so we can see:

  1. Request rate per endpoint — MCP methods and per Appwrite service/action.
  2. Which agents connect — MCP clientInfo (Claude, Cursor, …) + OAuth client_id.
  3. Logins / active users — initialize handshakes and distinct authenticated users.
  4. Errors, latency, write confirmations, search/docs usage, uploads, auth rejections.

Approach

  • New telemetry.py owns the MeterProvider, all instruments, in-process active user/client TTL sets, and exception-safe record_* helpers. No-op unless transport is http and an OTLP endpoint is configured — self-hosted stdio never phones home; an unconfigured hosted server stays silent (mirrors the PHP None/NoTelemetry adapter).
  • Instrumented at the operator/handler/auth boundaries (server.py, operator.py, auth.py, http_app.py, docs_search.py).
  • Cardinality discipline: user ids (sub) are never labels — distinct counts are derived in-process and exposed only via the aggregate gauges mcp.users.active / mcp.clients.active.

Metrics

Prefixed mcp.mcp.requests, mcp.request.duration, mcp.tool.calls, mcp.appwrite.calls / .call.duration / .errors, mcp.write.confirmations, mcp.initializations, mcp.users.active, mcp.clients.active, mcp.auth.validations / .duration, mcp.search_tools.*, mcp.search_docs.*, mcp.context.requests, mcp.results.stored, mcp.resources.reads, mcp.uploads / .upload.bytes / .upload.errors, mcp.server.info, mcp.startup.validation.

Config

Standard OTEL_EXPORTER_OTLP_ENDPOINT / OTEL_EXPORTER_OTLP_HEADERS / OTEL_RESOURCE_ATTRIBUTES, plus the cloud-style _APP_TELEMETRY_OTLP_ENDPOINT / _APP_TELEMETRY_OTLP_HEADERS aliases. Documented in AGENTS.md and compose.yaml. Runtime values are set in the declarative deploy repo (assets-applications <env>/mcp/default.yaml), not in CI.

Testing

  • New tests/unit/test_telemetry.py (in-memory metric reader; no collector needed): label correctness, write-block, auth reasons, session dedupe, and the no-op path.
  • All 86 unit tests pass; ruff + black clean; Docker image builds.
  • Verified end-to-end against a real otel/opentelemetry-collector — metrics land with correct names/attributes and the deployment.environment.name resource attribute flows through.

Dashboards

Companion Grafana dashboards (MCP/overview.json, MCP/adoption.json) are in a separate PR against the dashboards repo.

Instrument the hosted HTTP transport with OpenTelemetry metrics, exported
over OTLP/HTTP to the shared Appwrite observability stack (OTel Collector ->
Prometheus/Mimir -> Grafana), mirroring the utopia-php/telemetry pattern used
by the PHP services.

- New telemetry.py owns the MeterProvider, instruments, in-process active
  user/client TTL sets, and exception-safe record_* helpers. No-op unless the
  transport is http and an OTLP endpoint is configured, so self-hosted stdio
  never phones home and an unconfigured server stays silent.
- Wired at the operator/handler/auth boundaries: MCP request rate + latency,
  per-service/action Appwrite calls + errors, initializations (which agents
  connect), aggregated active users/clients, auth validation outcomes/reasons,
  write confirmations, search/docs usage, uploads, and build info.
- User ids (sub) are never used as labels; distinct counts are derived in
  process and exposed only as aggregate gauges.
- Config via standard OTEL_* env vars plus the cloud-style _APP_TELEMETRY_*
  aliases. Documented in AGENTS.md and compose.yaml.
- Unit tests use an in-memory metric reader (no collector required).
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown

Greptile Summary

This PR wires the hosted Appwrite MCP server into the shared OpenTelemetry stack by adding a new telemetry.py module and instrumenting the HTTP request, auth, operator, docs-search, and upload paths. Telemetry is a strict no-op on stdio or when no OTLP endpoint is configured, mirroring the PHP NoTelemetry adapter pattern.

  • telemetry.py: new module owning all OTel instruments, TTL-based active-user/client rolling-window sets, and exception-safe record_* helpers; initializes lazily in build_app only for the HTTP transport.
  • Instrumentation: server.py, operator.py, auth.py, docs_search.py, and http_app.py each get targeted telemetry calls with careful double-count avoidance (e.g., auth rejection counted once in _verify_sync, duration attached separately in verify_token).
  • Tests: new test_telemetry.py uses an in-memory metric reader to assert label values, write-block counts, session deduplication, and the disabled no-op path.

Confidence Score: 5/5

Safe to merge — telemetry is fully isolated behind the _enabled flag, all record_* helpers swallow their own exceptions, and the stdio path is untouched.

The change is additive and exception-safe throughout. The only findings are label-naming inconsistencies in the metrics (duplicate issuer_mismatch reason for two distinct auth failures, and a misleading description on mcp.clients.active) that have no runtime impact.

auth.py — two structurally different rejection paths share the same reason="issuer_mismatch" metric label, making them indistinguishable in dashboards.

Important Files Changed

Filename Overview
src/mcp_server_appwrite/telemetry.py New telemetry module — instruments and helpers are well-structured and no-op when disabled; minor description mismatch on mcp.clients.active.
src/mcp_server_appwrite/auth.py Auth telemetry correctly avoids double-counting; two distinct rejection paths share reason="issuer_mismatch", making them indistinguishable in metrics.
src/mcp_server_appwrite/server.py Request-level and upload telemetry instrumented correctly; upload error reason codes (scheme, no_host, dns, etc.) are distinct and accurate.
src/mcp_server_appwrite/operator.py Tool-call wrapper and write-confirmation telemetry added cleanly with proper try/finally pattern.
src/mcp_server_appwrite/docs_search.py Race condition on _last_embedding_duration_s resolved by returning the duration as a tuple from _rank; telemetry call is in the happy path only.
src/mcp_server_appwrite/http_app.py Telemetry init wired into build_app; missing-token case counted separately from verifier rejections to avoid double-counting.
tests/unit/test_telemetry.py In-memory reader tests cover label correctness, write-block, session deduplication, and the disabled no-op path.

Reviews (4): Last reviewed commit: "(docs): trim Telemetry section in AGENTS..." | Re-trigger Greptile

Comment thread src/mcp_server_appwrite/server.py
Comment thread src/mcp_server_appwrite/docs_search.py Outdated
Comment thread src/mcp_server_appwrite/telemetry.py
Drop the speculative _APP_TELEMETRY_* aliases. When OTEL_EXPORTER_OTLP_HEADERS
is unset, build it from CF_ACCESS_CLIENT_ID + CF_ACCESS_CLIENT_SECRET so the
deployment reuses the shared telemetry-auth secret instead of a combined one.
The assets cluster runs Alloy in the telemetry namespace with an OTLP
receiver on :4318 that authenticates upstream and upserts the deployment.*
resource attributes. So the app just points at it — no CF-Access secret,
no header assembly, no OTEL_RESOURCE_ATTRIBUTES needed.
- upload SSRF guard: emit reason=no_host for missing-host (was conflated
  with reason=scheme)
- docs_search: return embedding duration from _rank instead of stashing it
  on the instance (removes a cross-request race on _last_embedding_duration_s)
- telemetry: only touch the active-user/client sets when enabled, so they
  can't grow unbounded when telemetry is off (pruning only runs via the
  gauge collection cycle, which is disabled then)
@ChiragAgg5k ChiragAgg5k merged commit 7e589fd into main Jun 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant