Add telemetry related calls (for pipeline monitoring)#648
Add telemetry related calls (for pipeline monitoring)#648nuclearcat merged 7 commits intokernelci:mainfrom
Conversation
Add TelemetryEvent model to the COLLECTIONS dict so the telemetry MongoDB collection is created at startup with TTL and compound indexes. Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add endpoint for pipeline services to submit telemetry events in bulk. Validates each event against the TelemetryEvent model and uses insert_many for efficient batch insertion. Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add paginated query endpoint for telemetry events with support for filtering by any field (kind, runtime, device_type, job_name, result, is_infra_error, etc.) plus time range via since/until parameters. Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add aggregation endpoint that groups telemetry events by configurable fields (runtime, device_type, job_name, tree, branch, arch, kind, error_type) and returns pass/fail/incomplete/skip/infra_error counts. Supports filtering by kind, runtime, device_type, job_name, tree, branch, arch, and time range (since/until) before aggregation. Also adds a generic db.aggregate() method for running MongoDB aggregation pipelines. Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add on-demand anomaly detection endpoint that identifies: 1. Runtime+device_type combinations with high infra error or failure rates exceeding a configurable threshold 2. Runtimes with recurring submission/connectivity errors Parameters: window (1h-48h), threshold (0.0-1.0), min_total (minimum event count to avoid noise from small samples). Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
6568cca to
7152a42
Compare
…efactor get_telemetry_stats() We add new method db.insert_many instead. Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
13ffe2e to
31a5c15
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds telemetry-related API endpoints for pipeline monitoring, as referenced in issue #1419. The changes introduce a separate telemetry collection to track pipeline execution events with higher volume and different query patterns than the existing EventHistory collection.
Changes:
- Added four new telemetry endpoints: POST /telemetry for bulk event insertion, GET /telemetry for querying events, GET /telemetry/stats for aggregated statistics, and GET /telemetry/anomalies for anomaly detection
- Added database support methods (insert_many, aggregate) to handle telemetry operations
- Integrated TelemetryEvent model from kernelci.api.models into the database collection mapping
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 16 comments.
| File | Description |
|---|---|
| api/main.py | Adds four telemetry endpoints with authentication, filtering, aggregation, and anomaly detection capabilities |
| api/db.py | Adds TelemetryEvent to collection mapping and implements insert_many and aggregate methods for bulk operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Get aggregated telemetry statistics. | ||
|
|
||
| This is rule-based anomaly detection using | ||
| thresholded empirical rates computed over | ||
| a sliding (rolling) time window. | ||
| This is not a full anomaly detection system | ||
| with baselines or machine learning, but at | ||
| last something to start with. | ||
|
|
||
| Query parameters: | ||
| - group_by: Comma-separated fields to group by | ||
| (runtime, device_type, job_name, tree, branch, arch, | ||
| kind, error_type) | ||
| - kind: Filter by event kind before aggregating | ||
| - runtime: Filter by runtime name | ||
| - since/until: Time range (ISO 8601) | ||
|
|
||
| Returns grouped counts with pass/fail/incomplete/infra_error | ||
| breakdowns for result-bearing events. | ||
| """ |
There was a problem hiding this comment.
Misleading or incorrect documentation. The docstring states "This is rule-based anomaly detection using thresholded empirical rates..." but this endpoint (get_telemetry_stats) only provides aggregated statistics without any anomaly detection logic. Anomaly detection is actually performed by the get_telemetry_anomalies endpoint. This docstring should describe the stats aggregation functionality, not anomaly detection.
| if since: | ||
| ts_filter['$gte'] = datetime.fromisoformat(since) | ||
| if until: | ||
| ts_filter['$lte'] = datetime.fromisoformat(until) |
There was a problem hiding this comment.
Missing error handling for datetime.fromisoformat(). If 'since' or 'until' parameters contain invalid ISO 8601 format strings, this will raise a ValueError that is not caught, resulting in an unhandled 500 error instead of a user-friendly 400 Bad Request. Wrap these calls in a try-except block and raise HTTPException with status 400.
| if since: | |
| ts_filter['$gte'] = datetime.fromisoformat(since) | |
| if until: | |
| ts_filter['$lte'] = datetime.fromisoformat(until) | |
| try: | |
| if since: | |
| ts_filter['$gte'] = datetime.fromisoformat(since) | |
| if until: | |
| ts_filter['$lte'] = datetime.fromisoformat(until) | |
| except ValueError as exc: | |
| raise HTTPException( | |
| status_code=status.HTTP_400_BAD_REQUEST, | |
| detail="Invalid 'since' or 'until' timestamp; expected ISO 8601 format.", | |
| ) from exc |
Related to: kernelci/kernelci-pipeline#1419