Skip to content

Comments

Improve HTTP/2 resilience, worker supervision defaults, metrics, chaos tests, and FastAPI example#375

Merged
v1r3n merged 5 commits intomainfrom
fix_worker_restart_onerror
Feb 17, 2026
Merged

Improve HTTP/2 resilience, worker supervision defaults, metrics, chaos tests, and FastAPI example#375
v1r3n merged 5 commits intomainfrom
fix_worker_restart_onerror

Conversation

@v1r3n
Copy link
Contributor

@v1r3n v1r3n commented Feb 16, 2026

Summary

  • Fix HTTP/2 connection drops by resetting httpx client and retrying idempotent requests (src/conductor/client/http/rest.py, src/conductor/client/http/async_rest.py).
  • Enable worker-process monitoring and auto-restart by default; add health helpers and restart backoff/tuning (src/conductor/client/automator/task_handler.py).
  • Emit worker_restart_total metric to Prometheus for observability (src/conductor/client/telemetry/metrics_collector.py, src/conductor/client/telemetry/model/*).
  • Add chaos tests to prove workers survive connection termination and restart loops (tests/chaos/*, tests/unit/api_client/test_rest_client.py).
  • Document HTTP/2 env toggle, supervision defaults, metrics, and add runnable FastAPI “workflow-as-API” example (README.md, docs/WORKER.md, METRICS.md, examples/fastapi_worker_service.py, examples/README.md,
    workers.md).

Testing

  • pytest -q tests/unit
  • RUN_CHAOS_TESTS=1 pytest -q tests/chaos

manan164 added a commit that referenced this pull request Feb 16, 2026
…stry

This commit addresses two critical issues identified in PR #375:

1. Thread Safety in MetricsCollector:
   - Added threading.RLock() to protect concurrent access to internal dictionaries
   - All metric recording methods now use locks to prevent race conditions
   - Added comprehensive thread safety tests
   - Updated docstring to document thread-safe behavior

   Race condition scenario: Monitor thread calls increment_worker_restart()
   while main thread calls other metric methods, both modifying the same
   dictionaries without synchronization.

2. Parent Process Metrics Registry Issue:
   - Removed MetricsCollector instantiation in TaskHandler parent process
   - Parent process now uses prometheus_client Counter directly
   - Avoids confusion between parent and worker process metrics
   - Prevents stale metrics from parent PID lingering after worker restarts

   Problem: Parent process (coordinator) was writing metrics to the same
   multiprocess directory as worker processes, causing confusion about
   which PID corresponds to which worker.

Testing:
- Added tests/unit/telemetry/test_metrics_collector_thread_safety.py
- Tests verify concurrent counter, gauge, and quantile operations
- All existing tests should continue to pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@v1r3n v1r3n merged commit 8582146 into main Feb 17, 2026
1 check passed
@v1r3n v1r3n deleted the fix_worker_restart_onerror branch February 17, 2026 16:39
manan164 added a commit that referenced this pull request Feb 17, 2026
Resolve conflicts after PR #375 was merged to main. Keep thread safety
improvements (RLock in MetricsCollector, process lock in TaskHandler)
while adopting main's cleaner lazy-init metrics counter and simpler
process cleanup logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants