Skip to content

Add telemetry related calls (for pipeline monitoring)#648

Merged
nuclearcat merged 7 commits intokernelci:mainfrom
nuclearcat:add-telemetry-models-api
Feb 19, 2026
Merged

Add telemetry related calls (for pipeline monitoring)#648
nuclearcat merged 7 commits intokernelci:mainfrom
nuclearcat:add-telemetry-models-api

Conversation

@nuclearcat
Copy link
Member

@nuclearcat nuclearcat commented Feb 13, 2026

Add TelemetryEvent model to the COLLECTIONS dict so the telemetry
MongoDB collection is created at startup with TTL and compound indexes.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add endpoint for pipeline services to submit telemetry events in bulk.
Validates each event against the TelemetryEvent model and uses
insert_many for efficient batch insertion.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add paginated query endpoint for telemetry events with support for
filtering by any field (kind, runtime, device_type, job_name, result,
is_infra_error, etc.) plus time range via since/until parameters.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add aggregation endpoint that groups telemetry events by configurable
fields (runtime, device_type, job_name, tree, branch, arch, kind,
error_type) and returns pass/fail/incomplete/skip/infra_error counts.

Supports filtering by kind, runtime, device_type, job_name, tree,
branch, arch, and time range (since/until) before aggregation.

Also adds a generic db.aggregate() method for running MongoDB
aggregation pipelines.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Add on-demand anomaly detection endpoint that identifies:
1. Runtime+device_type combinations with high infra error or failure
   rates exceeding a configurable threshold
2. Runtimes with recurring submission/connectivity errors

Parameters: window (1h-48h), threshold (0.0-1.0), min_total (minimum
event count to avoid noise from small samples).

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
@nuclearcat nuclearcat force-pushed the add-telemetry-models-api branch from 6568cca to 7152a42 Compare February 13, 2026 08:15
…efactor get_telemetry_stats()

We add new method db.insert_many instead.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds telemetry-related API endpoints for pipeline monitoring, as referenced in issue #1419. The changes introduce a separate telemetry collection to track pipeline execution events with higher volume and different query patterns than the existing EventHistory collection.

Changes:

  • Added four new telemetry endpoints: POST /telemetry for bulk event insertion, GET /telemetry for querying events, GET /telemetry/stats for aggregated statistics, and GET /telemetry/anomalies for anomaly detection
  • Added database support methods (insert_many, aggregate) to handle telemetry operations
  • Integrated TelemetryEvent model from kernelci.api.models into the database collection mapping

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 16 comments.

File Description
api/main.py Adds four telemetry endpoints with authentication, filtering, aggregation, and anomaly detection capabilities
api/db.py Adds TelemetryEvent to collection mapping and implements insert_many and aggregate methods for bulk operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1051 to +1070
"""Get aggregated telemetry statistics.

This is rule-based anomaly detection using
thresholded empirical rates computed over
a sliding (rolling) time window.
This is not a full anomaly detection system
with baselines or machine learning, but at
last something to start with.

Query parameters:
- group_by: Comma-separated fields to group by
(runtime, device_type, job_name, tree, branch, arch,
kind, error_type)
- kind: Filter by event kind before aggregating
- runtime: Filter by runtime name
- since/until: Time range (ISO 8601)

Returns grouped counts with pass/fail/incomplete/infra_error
breakdowns for result-bearing events.
"""
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading or incorrect documentation. The docstring states "This is rule-based anomaly detection using thresholded empirical rates..." but this endpoint (get_telemetry_stats) only provides aggregated statistics without any anomaly detection logic. Anomaly detection is actually performed by the get_telemetry_anomalies endpoint. This docstring should describe the stats aggregation functionality, not anomaly detection.

Copilot uses AI. Check for mistakes.
Comment on lines +1015 to +1018
if since:
ts_filter['$gte'] = datetime.fromisoformat(since)
if until:
ts_filter['$lte'] = datetime.fromisoformat(until)
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for datetime.fromisoformat(). If 'since' or 'until' parameters contain invalid ISO 8601 format strings, this will raise a ValueError that is not caught, resulting in an unhandled 500 error instead of a user-friendly 400 Bad Request. Wrap these calls in a try-except block and raise HTTPException with status 400.

Suggested change
if since:
ts_filter['$gte'] = datetime.fromisoformat(since)
if until:
ts_filter['$lte'] = datetime.fromisoformat(until)
try:
if since:
ts_filter['$gte'] = datetime.fromisoformat(since)
if until:
ts_filter['$lte'] = datetime.fromisoformat(until)
except ValueError as exc:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="Invalid 'since' or 'until' timestamp; expected ISO 8601 format.",
) from exc

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nuclearcat nuclearcat added this pull request to the merge queue Feb 19, 2026
Merged via the queue into kernelci:main with commit ea6272f Feb 19, 2026
10 checks passed
@nuclearcat nuclearcat deleted the add-telemetry-models-api branch February 19, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments