Skip to content

Standardized metrics export - feature flagged#409

Open
chrishagglund-ship-it wants to merge 19 commits into
mainfrom
gated-metrics-standardization
Open

Standardized metrics export - feature flagged#409
chrishagglund-ship-it wants to merge 19 commits into
mainfrom
gated-metrics-standardization

Conversation

@chrishagglund-ship-it
Copy link
Copy Markdown
Contributor

@chrishagglund-ship-it chrishagglund-ship-it commented May 4, 2026

Provides cross-sdk standardized metrics output format, enabled by setting WORKER_CANONICAL_METRICS=true, see standardized metrics catalog for the proposed standard metrics catalog

The motivation for this is to enable a consistent metrics output surface across all the conductor SDKs

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 71.93277% with 167 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...nductor/client/telemetry/metrics_collector_base.py 66.29% 90 Missing ⚠️
.../client/configuration/settings/metrics_settings.py 30.30% 23 Missing ⚠️
...rc/conductor/client/orkes/orkes_workflow_client.py 28.00% 18 Missing ⚠️
...or/client/telemetry/canonical_metrics_collector.py 78.68% 13 Missing ⚠️
...rc/conductor/client/automator/async_task_runner.py 45.45% 6 Missing ⚠️
...uctor/client/telemetry/legacy_metrics_collector.py 94.39% 6 Missing ⚠️
src/conductor/client/automator/task_runner.py 63.63% 4 Missing ⚠️
src/conductor/client/telemetry/metrics_factory.py 91.66% 3 Missing ⚠️
src/conductor/client/automator/task_handler.py 83.33% 2 Missing ⚠️
src/conductor/client/http/async_api_client.py 0.00% 1 Missing ⚠️
... and 1 more
Files with missing lines Coverage Δ
src/conductor/client/http/api_client.py 97.27% <100.00%> (+<0.01%) ⬆️
src/conductor/client/orkes/orkes_base_client.py 100.00% <100.00%> (ø)
src/conductor/client/orkes_clients.py 91.48% <100.00%> (+0.18%) ⬆️
...tor/client/telemetry/model/metric_documentation.py 100.00% <100.00%> (ø)
...c/conductor/client/telemetry/model/metric_label.py 100.00% <100.00%> (ø)
...rc/conductor/client/telemetry/model/metric_name.py 100.00% <100.00%> (ø)
...ctor/client/workflow/executor/workflow_executor.py 42.30% <100.00%> (ø)
src/conductor/client/http/async_api_client.py 13.65% <0.00%> (-0.04%) ⬇️
...rc/conductor/client/telemetry/metrics_collector.py 85.71% <85.71%> (+21.33%) ⬆️
src/conductor/client/automator/task_handler.py 79.34% <83.33%> (+0.07%) ⬆️
... and 8 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@nthmost-orkes nthmost-orkes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Sonnet code review — automated survey via Claude Code

PR #409 Review: Standardized Metrics Export (Feature-Flagged)

Author: chrishagglund-ship-it | Size: +2387/-1751 across 32 files

Overview

The PR introduces a dual-surface metrics system behind WORKER_CANONICAL_METRICS=true. Core changes:

  • metrics_collector.py (927 lines → 38-line compatibility shim) re-exports LegacyMetricsCollector as MetricsCollector — backward-compatible for all existing import paths
  • New MetricsCollectorBase ABC consolidates Prometheus infrastructure; LegacyMetricsCollector preserves old quantile-gauge shape; CanonicalMetricsCollector uses real Prometheus Histogram objects
  • metrics_factory.py reads the env var, stamps a _subdir on settings, creates the directory, and returns the right subclass
  • MetricsSettings gains clean_directory, clean_dead_pids, and a computed metrics_directory
  • task_runner.py / async_task_runner.py get record_task_update_time() instrumentation (new call sites)
  • api_client.py / async_api_client.py capture URI template before path-param substitution so canonical histograms record /tasks/{taskType} not /tasks/my-task-123
  • Fork ordering in harness/main.py is fixed (workers fork before governor thread — correct for prometheus_client mutex safety)

The overall design is clean. The base/subclass split is logical, the factory is minimal, and the shim pattern is well-understood.


Must-Fix Issues

1. _cleaned_directories module-level set — breaks test isolation and suppresses cleanup on second call

metrics_factory.py uses a module-level set to guard idempotent cleanup. tearDown in test_metrics_factory.py never clears it, so any test that runs after a test that already populated the set will silently skip cleanup. Also, in any long-running process that reuses a directory path after deletion and recreation, cleanup is never retried.

Fix: Clear _cleaned_directories in tearDown, or accept a force_cleanup: bool bypass parameter.

2. _clean_dead_pid_files catches OSError instead of ProcessLookupError

try:
    os.kill(pid, 0)
except OSError:   # catches PermissionError too
    os.remove(path)

PermissionError is a subclass of OSError and fires when the PID exists but is owned by another user. This would delete .db files from live processes owned by other users.

Fix: Catch ProcessLookupError specifically.

3. measure_workflow_input_payload_size re-serializes full workflow input on every start_workflow call

encoded = json.dumps(workflow_input or {}, default=str)
size_bytes = len(encoded.encode("utf-8"))

This is in the hot path for every workflow start in canonical mode. For large payloads it's expensive (double allocation) and inaccurate — default=str silently coerces non-serializable types, changing the apparent size vs. the actual wire payload.

4. Missing type annotations for metrics_collector parameters

OrkesBaseClient, OrkesClients, OrkesWorkflowClient, and WorkflowExecutor accept metrics_collector=None without type annotations. Should be Optional[MetricsCollectorBase] throughout.

5. Test gaps

  • yes is documented as a truthy value for WORKER_CANONICAL_METRICS but has no test (only true and 1 are tested)
  • clean_directory=True and clean_dead_pids=True behavior not directly tested
  • _clean_dead_pid_files PermissionError path not tested (a unittest.mock.patch('os.kill', side_effect=PermissionError) would catch the bug in item 2)
  • metric_uri pass-through in api_client.py/async_api_client.py has no unit test verifying the template URI reaches the collector

Should-Fix Issues

6. metrics_directory is silently wrong if read before factory initialization

MetricsSettings.metrics_directory uses if self._subdir: (truthiness) but _subdir starts as None. If a user reads metrics_directory before calling create_metrics_collector, they get the base directory regardless of mode. Add an assertion or document the ordering constraint.

7. MetricsSettings._subdir mutation is a design smell

The settings object looks like an immutable config value object but gets mutated by the factory. This creates an implicit ordering constraint and makes metrics_directory non-idempotent across the object's lifetime. Cleaner: compute the directory eagerly at factory time and pass it in as a parameter, or freeze it at construction.

8. HTTP metrics server binding to all interfaces undocumented

The server binds to '' (0.0.0.0). Library users who pass http_port without understanding this could unintentionally expose metrics externally. Document this in the configuration docs.

9. record_api_request_time type annotation wrong throughout

def record_api_request_time(self, ..., metric_uri: str = None) -> None:

Should be Optional[str] = None in MetricsCollectorBase, LegacyMetricsCollector, and CanonicalMetricsCollector.

10. NoPidCollector uses last-value semantics for non-quantile, non-histogram gauges

For multi-process task_poll_time / task_execute_time in all mode, data['values'][-1] is filesystem-ordering-dependent. Should be livemax or aggregated correctly. This is pre-existing behavior now more visible in the new class.


Minor Notes

  • Performance hot path: LegacyMetricsCollector._record_quantiles sorts the full 1000-element deque on every metric write — O(n log n) inside a lock. sum(observations) also runs O(n) on each call. Pre-existing, but now canonically visible.
  • TOCTOU in resolve_metrics_type: No lock around the _subdir is None check. Two threads calling create_metrics_collector(same_settings) concurrently could both set _subdir. In practice they'd set it to the same value, so no correctness bug, but worth noting.
  • Harness YAML hardcodes WORKER_CANONICAL_METRICS=true: Fine for CI, but means the harness can't easily test legacy mode in a deployed environment.
  • WorkflowStatusProbe has no tests — acceptable for a harness helper but should be noted.
  • Python version: Module-level __getattr__ (PEP 562) requires Python 3.7+. Confirm the SDK's minimum version claim covers this.

Overall Assessment

The design is sound and the backward-compatibility story is solid. The shim pattern for metrics_collector.py is the right call. The main risks are the _cleaned_directories / test isolation issue (#1), the PermissionError bug in dead-PID cleanup (#2), and the hot-path JSON re-serialization (#3). Those three should be addressed before merge. The type annotation and test gap issues are polish but meaningful for maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants