refactor(telemetry)!: move backend tracing onto plugin/hook pattern#1181
refactor(telemetry)!: move backend tracing onto plugin/hook pattern#1181ajbozarth wants to merge 1 commit into
Conversation
Phase 1 of the tracing epic: migrates backend span emission from inline calls scattered across five backends onto a BackendTracingPlugin that subscribes to the existing generation_* hooks for chat and new generation_batch_* hooks for raw. Spans stay live on the OTel context across the API call so nested instrumentation parents under them. - Renames MELLEA_TRACE_* env vars to plural MELLEA_TRACES_* and introduces MELLEA_TRACES_ENABLED umbrella flag, opt-in MELLEA_TRACES_OTLP, and signal-specific OTEL_EXPORTER_OTLP_TRACES_ENDPOINT. No deprecation shim. - Drops deprecated gen_ai.system attribute in favor of gen_ai.provider.name. - Prefixes application-span attributes with mellea. for consistency with backend spans and the existing logging/metrics pillars. - Eagerly initialises the tracer provider and registers the BackendTracingPlugin at module import when MELLEA_TRACES_ENABLED is truthy. Tests reset module state and call _setup_tracing() to re-init after env-var changes, removing the need for importlib.reload. - Adds generation_batch_pre_call/post_call/error hook types so raw-path emits one span per generate_from_raw call (matching OTel GenAI semconv) rather than one per MOT. - Deletes mellea/telemetry/backend_instrumentation.py; backends no longer import from mellea.telemetry (except .context). - Removes the _telemetry_span round-trip on mot._meta and the tracing-specific error block in core/base.py; error-path span closure lives in the plugin's generation_error hook. Closes generative-computing#1045, generative-computing#1046, generative-computing#1047 (Phase 1 of generative-computing#444). BREAKING CHANGE: Public telemetry API surface narrowed: - MELLEA_TRACE_* env vars renamed to plural MELLEA_TRACES_* with no deprecation shim. - The deprecated gen_ai.system span attribute is removed; consumers should read gen_ai.provider.name instead. - The split helpers is_application_tracing_enabled() and is_backend_tracing_enabled() are replaced by a single is_tracing_enabled(). - The application-tracing helpers add_span_event, start_backend_span, end_backend_span, and trace_backend are removed from mellea.telemetry. Backend spans are now emitted by the BackendTracingPlugin automatically; application spans use trace_application (unchanged). - The mellea.telemetry.backend_instrumentation module is deleted along with its exports (start_generate_span, instrument_generate_from_raw, record_token_usage, record_response_metadata, finalize_backend_span). Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
Follow-up work spawned by this PR (not part of the tracing epic):
|
|
I also opened #1180 which fixes a bug in metrics that also applies here. (ie that needs to merge first) |
jakelorocco
left a comment
There was a problem hiding this comment.
looks good! I just have a few questions and a few places I think extra comments would be helpful.
| try: | ||
| # Chat path is single-action; sequences[0] is the only sequence. | ||
| last_token = hf_output.sequences[0][-1].item() | ||
| eos = self._tokenizer.eos_token_id |
There was a problem hiding this comment.
Technically don't we allow the user to specify stop sequences as well through model options? Should we be grabbing those as well?
| self._start: datetime.datetime | None = None | ||
| self._first_chunk_received: bool = False | ||
| self._generate_log: GenerateLog | None = None | ||
| self._generation_id: str | None = None |
There was a problem hiding this comment.
Can you please add a comment here that this is different than the response_id in GenerationMetadata? I think they are similar enough that we should add commentary.
| generation_id: Correlation identifier matching the corresponding | ||
| pre_call payload's `generation_id` for the same request, or | ||
| `None` when the firing site did not generate one. |
There was a problem hiding this comment.
Could you also add an explanation about generation_id being the Mellea identifier (and not the provider identifier) here as well please?
| @hook("generation_pre_call") | ||
| async def on_pre_call( |
There was a problem hiding this comment.
Can you please add a note explaining why pre_call isn't fire and forget but the others are?
| finish_backend_span_success( | ||
| payload.generation_id, operation="chat", usage=gen.usage, mot=mot, gen=gen | ||
| ) |
There was a problem hiding this comment.
What happens if someone only registers the start_backend_span function and these finish span functions never run?
Pull Request
Issue
Fixes #1045, fixes #1046, fixes #1047. Phase 1 of #444. Also closes out item 2 of #909.
Description
Phase 1 of the tracing epic. Migrates backend span emission from inline calls scattered across five backends onto a
BackendTracingPluginthat subscribes to the existinggeneration_*hooks (chat path) and newgeneration_batch_*hooks (raw path). Spans stay live on the OTel context across the API call so nested instrumentation (httpx, langchain) parents under the backend span.Reviewer call-outs:
MELLEA_TRACE_*→MELLEA_TRACES_*(plural), aligned withOTEL_EXPORTER_OTLP_TRACES_ENDPOINTandMELLEA_METRICS_*. AddsMELLEA_TRACES_ENABLEDumbrella flag and opt-inMELLEA_TRACES_OTLP. No deprecation shim — tracing was still pre-1.0 surface.gen_ai.systemremoved in favor ofgen_ai.provider.name. Cleans up the dual emission introduced by feat(telemetry): close five OTel GenAI semantic convention emission gaps (#1035) #1036.generation_batch_pre_call/generation_batch_post_call/generation_batch_errorfor plugin authors. Raw path now emits one span pergenerate_from_rawcall (per OTel GenAI semconv), not one per MOT.ModelOutputThunk.cancel_generation(added upstream by feat(stdlib): add stream_with_chunking() with per-chunk validation (#901) #942 in9e8a9636) had its body rewritten to fireGENERATION_ERRORinstead of calling the removed_telemetry_spanAPI. Reviewers may notice as "extra surface", but it's rebase-driven, not refactor-driven.BREAKING CHANGE: Public telemetry API surface narrowed:
MELLEA_TRACE_*env vars renamed to pluralMELLEA_TRACES_*with no deprecation shim.gen_ai.systemspan attribute is removed; consumers should readgen_ai.provider.nameinstead.is_application_tracing_enabled()andis_backend_tracing_enabled()are replaced by a singleis_tracing_enabled().add_span_event,start_backend_span,end_backend_span, andtrace_backendare removed frommellea.telemetry. Backend spans are now emitted by theBackendTracingPluginautomatically; application spans usetrace_application(unchanged).mellea.telemetry.backend_instrumentationmodule is deleted along with its exports (start_generate_span,instrument_generate_from_raw,record_token_usage,record_response_metadata,finalize_backend_span).Testing
Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.