Standardized metrics export - feature flagged by chrishagglund-ship-it · Pull Request #409 · conductor-oss/python-sdk

chrishagglund-ship-it · 2026-05-04T17:34:48Z

Provides cross-sdk standardized metrics output format, enabled by setting WORKER_CANONICAL_METRICS=true, see standardized metrics catalog for the proposed standard metrics catalog

The motivation for this is to enable a consistent metrics output surface across all the conductor SDKs

… legacy ones were released

codecov · 2026-05-04T17:37:10Z

Codecov Report

❌ Patch coverage is 71.93277% with 167 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...nductor/client/telemetry/metrics_collector_base.py	66.29%	90 Missing ⚠️
.../client/configuration/settings/metrics_settings.py	30.30%	23 Missing ⚠️
...rc/conductor/client/orkes/orkes_workflow_client.py	28.00%	18 Missing ⚠️
...or/client/telemetry/canonical_metrics_collector.py	78.68%	13 Missing ⚠️
...rc/conductor/client/automator/async_task_runner.py	45.45%	6 Missing ⚠️
...uctor/client/telemetry/legacy_metrics_collector.py	94.39%	6 Missing ⚠️
src/conductor/client/automator/task_runner.py	63.63%	4 Missing ⚠️
src/conductor/client/telemetry/metrics_factory.py	91.66%	3 Missing ⚠️
src/conductor/client/automator/task_handler.py	83.33%	2 Missing ⚠️
src/conductor/client/http/async_api_client.py	0.00%	1 Missing ⚠️
... and 1 more

Files with missing lines	Coverage Δ
src/conductor/client/http/api_client.py	`97.27% <100.00%> (+<0.01%)`	⬆️
src/conductor/client/orkes/orkes_base_client.py	`100.00% <100.00%> (ø)`
src/conductor/client/orkes_clients.py	`91.48% <100.00%> (+0.18%)`	⬆️
...tor/client/telemetry/model/metric_documentation.py	`100.00% <100.00%> (ø)`
...c/conductor/client/telemetry/model/metric_label.py	`100.00% <100.00%> (ø)`
...rc/conductor/client/telemetry/model/metric_name.py	`100.00% <100.00%> (ø)`
...ctor/client/workflow/executor/workflow_executor.py	`42.30% <100.00%> (ø)`
src/conductor/client/http/async_api_client.py	`13.65% <0.00%> (-0.04%)`	⬇️
...rc/conductor/client/telemetry/metrics_collector.py	`85.71% <85.71%> (+21.33%)`	⬆️
src/conductor/client/automator/task_handler.py	`79.34% <83.33%> (+0.07%)`	⬆️
... and 8 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… harness worker

nthmost-orkes

Claude Sonnet code review — automated survey via Claude Code

PR #409 Review: Standardized Metrics Export (Feature-Flagged)

Author: chrishagglund-ship-it | Size: +2387/-1751 across 32 files

Overview

The PR introduces a dual-surface metrics system behind WORKER_CANONICAL_METRICS=true. Core changes:

metrics_collector.py (927 lines → 38-line compatibility shim) re-exports LegacyMetricsCollector as MetricsCollector — backward-compatible for all existing import paths
New MetricsCollectorBase ABC consolidates Prometheus infrastructure; LegacyMetricsCollector preserves old quantile-gauge shape; CanonicalMetricsCollector uses real Prometheus Histogram objects
metrics_factory.py reads the env var, stamps a _subdir on settings, creates the directory, and returns the right subclass
MetricsSettings gains clean_directory, clean_dead_pids, and a computed metrics_directory
task_runner.py / async_task_runner.py get record_task_update_time() instrumentation (new call sites)
api_client.py / async_api_client.py capture URI template before path-param substitution so canonical histograms record /tasks/{taskType} not /tasks/my-task-123
Fork ordering in harness/main.py is fixed (workers fork before governor thread — correct for prometheus_client mutex safety)

The overall design is clean. The base/subclass split is logical, the factory is minimal, and the shim pattern is well-understood.

Must-Fix Issues

1. _cleaned_directories module-level set — breaks test isolation and suppresses cleanup on second call

metrics_factory.py uses a module-level set to guard idempotent cleanup. tearDown in test_metrics_factory.py never clears it, so any test that runs after a test that already populated the set will silently skip cleanup. Also, in any long-running process that reuses a directory path after deletion and recreation, cleanup is never retried.

Fix: Clear _cleaned_directories in tearDown, or accept a force_cleanup: bool bypass parameter.

2. _clean_dead_pid_files catches OSError instead of ProcessLookupError

try:
    os.kill(pid, 0)
except OSError:   # catches PermissionError too
    os.remove(path)

PermissionError is a subclass of OSError and fires when the PID exists but is owned by another user. This would delete .db files from live processes owned by other users.

Fix: Catch ProcessLookupError specifically.

3. measure_workflow_input_payload_size re-serializes full workflow input on every start_workflow call

encoded = json.dumps(workflow_input or {}, default=str)
size_bytes = len(encoded.encode("utf-8"))

This is in the hot path for every workflow start in canonical mode. For large payloads it's expensive (double allocation) and inaccurate — default=str silently coerces non-serializable types, changing the apparent size vs. the actual wire payload.

4. Missing type annotations for metrics_collector parameters

OrkesBaseClient, OrkesClients, OrkesWorkflowClient, and WorkflowExecutor accept metrics_collector=None without type annotations. Should be Optional[MetricsCollectorBase] throughout.

5. Test gaps

yes is documented as a truthy value for WORKER_CANONICAL_METRICS but has no test (only true and 1 are tested)
clean_directory=True and clean_dead_pids=True behavior not directly tested
_clean_dead_pid_files PermissionError path not tested (a unittest.mock.patch('os.kill', side_effect=PermissionError) would catch the bug in item 2)
metric_uri pass-through in api_client.py/async_api_client.py has no unit test verifying the template URI reaches the collector

Should-Fix Issues

6. metrics_directory is silently wrong if read before factory initialization

MetricsSettings.metrics_directory uses if self._subdir: (truthiness) but _subdir starts as None. If a user reads metrics_directory before calling create_metrics_collector, they get the base directory regardless of mode. Add an assertion or document the ordering constraint.

7. MetricsSettings._subdir mutation is a design smell

The settings object looks like an immutable config value object but gets mutated by the factory. This creates an implicit ordering constraint and makes metrics_directory non-idempotent across the object's lifetime. Cleaner: compute the directory eagerly at factory time and pass it in as a parameter, or freeze it at construction.

8. HTTP metrics server binding to all interfaces undocumented

The server binds to '' (0.0.0.0). Library users who pass http_port without understanding this could unintentionally expose metrics externally. Document this in the configuration docs.

9. record_api_request_time type annotation wrong throughout

def record_api_request_time(self, ..., metric_uri: str = None) -> None:

Should be Optional[str] = None in MetricsCollectorBase, LegacyMetricsCollector, and CanonicalMetricsCollector.

10. NoPidCollector uses last-value semantics for non-quantile, non-histogram gauges

For multi-process task_poll_time / task_execute_time in all mode, data['values'][-1] is filesystem-ordering-dependent. Should be livemax or aggregated correctly. This is pre-existing behavior now more visible in the new class.

Minor Notes

Performance hot path: LegacyMetricsCollector._record_quantiles sorts the full 1000-element deque on every metric write — O(n log n) inside a lock. sum(observations) also runs O(n) on each call. Pre-existing, but now canonically visible.
TOCTOU in resolve_metrics_type: No lock around the _subdir is None check. Two threads calling create_metrics_collector(same_settings) concurrently could both set _subdir. In practice they'd set it to the same value, so no correctness bug, but worth noting.
Harness YAML hardcodes WORKER_CANONICAL_METRICS=true: Fine for CI, but means the harness can't easily test legacy mode in a deployed environment.
WorkflowStatusProbe has no tests — acceptable for a harness helper but should be noted.
Python version: Module-level __getattr__ (PEP 562) requires Python 3.7+. Confirm the SDK's minimum version claim covers this.

Overall Assessment

The design is sound and the backward-compatibility story is solid. The shim pattern for metrics_collector.py is the right call. The main risks are the _cleaned_directories / test isolation issue (#1), the PermissionError bug in dead-PID cleanup (#2), and the hot-path JSON re-serialization (#3). Those three should be addressed before merge. The type annotation and test gap issues are polish but meaningful for maintainability.

chrishagglund-ship-it added 4 commits April 29, 2026 10:24

implement canonical metrics, feature flagged to preserve legacy where…

b7cb1aa

… legacy ones were released

validate the multiprocessing start method choice

9d671ca

remove unnecessary sleep

98cf8ac

fixes for a few canonical metrics

6fed505

chrishagglund-ship-it added 3 commits May 6, 2026 09:36

cleaner reporting on which metrics implementation is used in the test…

63974df

… harness worker

update docs regarding metrics

9bca754

additional update of docs re: metrics

a6e721d

chrishagglund-ship-it requested a review from mp-orkes May 7, 2026 16:46

chrishagglund-ship-it added 12 commits May 7, 2026 11:14

add or update a changelog

98255f8

wip fix payload type to not break upgraders that might refer to it

cf49ea8

wip attempt to address directory and db clobbering concern

d2c5fa0

wip, attempting to not impose new behaviors on upgraders

ce03f71

directory handling for metrics db location was broken by recent change

9d118b1

wip, dealing with cardinality in url labels of metrics

c65cf83

getting template strings output in canonical metrics uris - wip

b2978cb

fix startup issue with db cleanup and child processes

39d59e0

addressing hygeine issues

ca92647

mostly doc fixes, minor tweak to sensible default

99c9457

restore default version label in one legacy metric

443e31d

remove cruft

2038fdc

chrishagglund-ship-it requested a review from manan164 May 29, 2026 15:46

nthmost-orkes reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardized metrics export - feature flagged#409

Standardized metrics export - feature flagged#409
chrishagglund-ship-it wants to merge 19 commits into
mainfrom
gated-metrics-standardization

chrishagglund-ship-it commented May 4, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

nthmost-orkes left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chrishagglund-ship-it commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nthmost-orkes left a comment

Choose a reason for hiding this comment

PR #409 Review: Standardized Metrics Export (Feature-Flagged)

Overview

Must-Fix Issues

Should-Fix Issues

Minor Notes

Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chrishagglund-ship-it commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading