Standardized metrics export - feature flagged#409
Standardized metrics export - feature flagged#409chrishagglund-ship-it wants to merge 19 commits into
Conversation
nthmost-orkes
left a comment
There was a problem hiding this comment.
Claude Sonnet code review — automated survey via Claude Code
PR #409 Review: Standardized Metrics Export (Feature-Flagged)
Author: chrishagglund-ship-it | Size: +2387/-1751 across 32 files
Overview
The PR introduces a dual-surface metrics system behind WORKER_CANONICAL_METRICS=true. Core changes:
metrics_collector.py(927 lines → 38-line compatibility shim) re-exportsLegacyMetricsCollectorasMetricsCollector— backward-compatible for all existing import paths- New
MetricsCollectorBaseABC consolidates Prometheus infrastructure;LegacyMetricsCollectorpreserves old quantile-gauge shape;CanonicalMetricsCollectoruses real PrometheusHistogramobjects metrics_factory.pyreads the env var, stamps a_subdiron settings, creates the directory, and returns the right subclassMetricsSettingsgainsclean_directory,clean_dead_pids, and a computedmetrics_directorytask_runner.py/async_task_runner.pygetrecord_task_update_time()instrumentation (new call sites)api_client.py/async_api_client.pycapture URI template before path-param substitution so canonical histograms record/tasks/{taskType}not/tasks/my-task-123- Fork ordering in
harness/main.pyis fixed (workers fork before governor thread — correct for prometheus_client mutex safety)
The overall design is clean. The base/subclass split is logical, the factory is minimal, and the shim pattern is well-understood.
Must-Fix Issues
1. _cleaned_directories module-level set — breaks test isolation and suppresses cleanup on second call
metrics_factory.py uses a module-level set to guard idempotent cleanup. tearDown in test_metrics_factory.py never clears it, so any test that runs after a test that already populated the set will silently skip cleanup. Also, in any long-running process that reuses a directory path after deletion and recreation, cleanup is never retried.
Fix: Clear _cleaned_directories in tearDown, or accept a force_cleanup: bool bypass parameter.
2. _clean_dead_pid_files catches OSError instead of ProcessLookupError
try:
os.kill(pid, 0)
except OSError: # catches PermissionError too
os.remove(path)PermissionError is a subclass of OSError and fires when the PID exists but is owned by another user. This would delete .db files from live processes owned by other users.
Fix: Catch ProcessLookupError specifically.
3. measure_workflow_input_payload_size re-serializes full workflow input on every start_workflow call
encoded = json.dumps(workflow_input or {}, default=str)
size_bytes = len(encoded.encode("utf-8"))This is in the hot path for every workflow start in canonical mode. For large payloads it's expensive (double allocation) and inaccurate — default=str silently coerces non-serializable types, changing the apparent size vs. the actual wire payload.
4. Missing type annotations for metrics_collector parameters
OrkesBaseClient, OrkesClients, OrkesWorkflowClient, and WorkflowExecutor accept metrics_collector=None without type annotations. Should be Optional[MetricsCollectorBase] throughout.
5. Test gaps
yesis documented as a truthy value forWORKER_CANONICAL_METRICSbut has no test (onlytrueand1are tested)clean_directory=Trueandclean_dead_pids=Truebehavior not directly tested_clean_dead_pid_filesPermissionErrorpath not tested (aunittest.mock.patch('os.kill', side_effect=PermissionError)would catch the bug in item 2)metric_uripass-through inapi_client.py/async_api_client.pyhas no unit test verifying the template URI reaches the collector
Should-Fix Issues
6. metrics_directory is silently wrong if read before factory initialization
MetricsSettings.metrics_directory uses if self._subdir: (truthiness) but _subdir starts as None. If a user reads metrics_directory before calling create_metrics_collector, they get the base directory regardless of mode. Add an assertion or document the ordering constraint.
7. MetricsSettings._subdir mutation is a design smell
The settings object looks like an immutable config value object but gets mutated by the factory. This creates an implicit ordering constraint and makes metrics_directory non-idempotent across the object's lifetime. Cleaner: compute the directory eagerly at factory time and pass it in as a parameter, or freeze it at construction.
8. HTTP metrics server binding to all interfaces undocumented
The server binds to '' (0.0.0.0). Library users who pass http_port without understanding this could unintentionally expose metrics externally. Document this in the configuration docs.
9. record_api_request_time type annotation wrong throughout
def record_api_request_time(self, ..., metric_uri: str = None) -> None:Should be Optional[str] = None in MetricsCollectorBase, LegacyMetricsCollector, and CanonicalMetricsCollector.
10. NoPidCollector uses last-value semantics for non-quantile, non-histogram gauges
For multi-process task_poll_time / task_execute_time in all mode, data['values'][-1] is filesystem-ordering-dependent. Should be livemax or aggregated correctly. This is pre-existing behavior now more visible in the new class.
Minor Notes
- Performance hot path:
LegacyMetricsCollector._record_quantilessorts the full 1000-element deque on every metric write — O(n log n) inside a lock.sum(observations)also runs O(n) on each call. Pre-existing, but now canonically visible. - TOCTOU in
resolve_metrics_type: No lock around the_subdir is Nonecheck. Two threads callingcreate_metrics_collector(same_settings)concurrently could both set_subdir. In practice they'd set it to the same value, so no correctness bug, but worth noting. - Harness YAML hardcodes
WORKER_CANONICAL_METRICS=true: Fine for CI, but means the harness can't easily test legacy mode in a deployed environment. WorkflowStatusProbehas no tests — acceptable for a harness helper but should be noted.- Python version: Module-level
__getattr__(PEP 562) requires Python 3.7+. Confirm the SDK's minimum version claim covers this.
Overall Assessment
The design is sound and the backward-compatibility story is solid. The shim pattern for metrics_collector.py is the right call. The main risks are the _cleaned_directories / test isolation issue (#1), the PermissionError bug in dead-PID cleanup (#2), and the hot-path JSON re-serialization (#3). Those three should be addressed before merge. The type annotation and test gap issues are polish but meaningful for maintainability.
Provides cross-sdk standardized metrics output format, enabled by setting
WORKER_CANONICAL_METRICS=true, see standardized metrics catalog for the proposed standard metrics catalogThe motivation for this is to enable a consistent metrics output surface across all the conductor SDKs