Skip to content

Comments

Fix thread safety in MetricsCollector and parent process metrics registry#376

Merged
manan164 merged 7 commits intomainfrom
fix/thread-safety-and-metrics-registry
Feb 17, 2026
Merged

Fix thread safety in MetricsCollector and parent process metrics registry#376
manan164 merged 7 commits intomainfrom
fix/thread-safety-and-metrics-registry

Conversation

@manan164
Copy link
Contributor

Summary

This PR fixes two critical issues identified during review of PR #375:

1. Thread Safety in MetricsCollector ⚠️ HIGH SEVERITY

Problem: The MetricsCollector class had race conditions when accessed from multiple threads without proper synchronization.

Scenario:

  • Monitor thread calls increment_worker_restart()
  • Main thread calls other metric methods (e.g., increment_task_poll())
  • Both threads modify self.counters, self.gauges dictionaries simultaneously
  • Python dictionaries are NOT thread-safe for concurrent writes

Impact:

  • KeyError exceptions
  • Corrupted dictionary state
  • Missing metrics
  • Potential Python interpreter crashes

Solution:

  • Added threading.RLock() to protect all dictionary access
  • Wrapped all metric recording methods with lock protection
  • Added comprehensive thread safety tests
  • Updated docstring to document thread-safe behavior

2. Parent Process Metrics Registry Issue ⚠️ MEDIUM SEVERITY

Problem: TaskHandler parent process (coordinator) was instantiating a MetricsCollector and writing metrics to the same multiprocess directory as worker processes.

Impact:

  • Parent process PID metrics files persist after worker restarts
  • Confusion about which PID corresponds to which worker
  • Stale metrics from dead worker processes
  • worker_restart_total counter tracked in parent PID, not worker PIDs

Solution:

  • Removed MetricsCollector instantiation in TaskHandler parent process
  • Use prometheus_client.Counter directly in parent for restart metrics
  • Cleaner separation between parent (coordinator) and worker (executor) metrics

Changes

Modified Files:

  • src/conductor/client/telemetry/metrics_collector.py

    • Added import threading
    • Added self._lock = threading.RLock() in __init__
    • Wrapped all metric recording methods with with self._lock:
    • Updated docstring to document thread safety
  • src/conductor/client/automator/task_handler.py

    • Removed self._metrics_collector = MetricsCollector(...)
    • Added self._worker_restart_counter = Counter(...) using prometheus_client directly
    • Updated __inc_worker_restart_metric() to use counter directly
  • tests/unit/telemetry/test_metrics_collector_thread_safety.py (NEW)

    • Added 3 comprehensive thread safety tests
    • Tests concurrent counter increments
    • Tests mixed metric operations
    • Tests quantile recording under concurrent load

Testing

Run the new thread safety tests:

pytest tests/unit/telemetry/test_metrics_collector_thread_safety.py -v

All existing tests should continue to pass:

pytest tests/unit -v

Review Checklist

  • Thread safety added to all MetricsCollector dictionary access
  • Parent process no longer writes to worker metrics directory
  • Comprehensive tests added
  • No breaking changes to public API
  • Documentation updated (docstrings)

Related

This PR should be merged into PR #375 before #375 is merged to main.

v1r3n and others added 6 commits February 15, 2026 21:07
…stry

This commit addresses two critical issues identified in PR #375:

1. Thread Safety in MetricsCollector:
   - Added threading.RLock() to protect concurrent access to internal dictionaries
   - All metric recording methods now use locks to prevent race conditions
   - Added comprehensive thread safety tests
   - Updated docstring to document thread-safe behavior

   Race condition scenario: Monitor thread calls increment_worker_restart()
   while main thread calls other metric methods, both modifying the same
   dictionaries without synchronization.

2. Parent Process Metrics Registry Issue:
   - Removed MetricsCollector instantiation in TaskHandler parent process
   - Parent process now uses prometheus_client Counter directly
   - Avoids confusion between parent and worker process metrics
   - Prevents stale metrics from parent PID lingering after worker restarts

   Problem: Parent process (coordinator) was writing metrics to the same
   multiprocess directory as worker processes, causing confusion about
   which PID corresponds to which worker.

Testing:
- Added tests/unit/telemetry/test_metrics_collector_thread_safety.py
- Tests verify concurrent counter, gauge, and quantile operations
- All existing tests should continue to pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Base automatically changed from fix_worker_restart_onerror to main February 17, 2026 16:39
Resolve conflicts after PR #375 was merged to main. Keep thread safety
improvements (RLock in MetricsCollector, process lock in TaskHandler)
while adopting main's cleaner lazy-init metrics counter and simpler
process cleanup logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@manan164 manan164 merged commit 945e805 into main Feb 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants