fix(pyamber): use monotonic_ns to prevent negative elapsed times in statistics_manager#5791
Conversation
time.time_ns() (wall clock) can go backward on NTP adjustments or in VMs, making end_time - start_time negative in _switch_context and _process_dcm. Passing that negative delta to StatisticsManager raised ValueError, which was unhandled by the DataProcessor thread and left the worker in a deadlocked wait. Fix: replace all time.time_ns() in main_loop.py with time.monotonic_ns(), which is guaranteed non-decreasing. Defense-in-depth: update increase_data_processing_time and increase_control_processing_time to log a warning and return early instead of raising ValueError for negative values. This keeps the worker alive in the unlikely event of any remaining clock anomaly. Closes apache#3768
cf0da76 to
e81241f
Compare
There was a problem hiding this comment.
Pull request overview
Fixes intermittent worker hangs in the PyAmber runtime by switching elapsed-time measurements to a monotonic clock and making statistics collection tolerant to negative durations.
Changes:
- Replace
time.time_ns()withtime.monotonic_ns()inMainLoopwhen measuring elapsed processing time and total execution time. - Make
StatisticsManager.increase_*_processing_timeignore (and warn on) negative durations instead of raising. - Update the statistics manager unit test to match the new “ignore negative durations” behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
amber/src/main/python/core/runnables/main_loop.py |
Switches processing/execution timing from wall-clock to monotonic time to avoid negative elapsed durations. |
amber/src/main/python/core/architecture/managers/statistics_manager.py |
Adds defense-in-depth by warning and returning early on negative processing durations. |
amber/src/test/python/core/architecture/managers/test_statistics_manager.py |
Updates the negative-time test to validate the new behavior (no exception, accumulator unchanged). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @pytest.mark.parametrize( | ||
| "method", | ||
| ["increase_data_processing_time", "increase_control_processing_time"], | ||
| "method, attr", | ||
| [ | ||
| ("increase_data_processing_time", "_data_processing_time"), | ||
| ("increase_control_processing_time", "_control_processing_time"), | ||
| ], | ||
| ) |
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 402 | 0.245 | 22,699/37,616/37,616 us | 🔴 +7.2% / 🔴 +7.5% |
| 🔴 | bs=100 sw=10 sl=64 | 942 | 0.575 | 104,597/134,381/134,381 us | 🔴 +19.0% / 🟢 -6.8% |
| ⚪ | bs=1000 sw=10 sl=64 | 1,103 | 0.673 | 906,025/962,805/962,805 us | ⚪ within ±5% / 🟢 -6.9% |
Baseline details
Latest main 731d671 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 402 tuples/sec | 421 tuples/sec | 410.82 tuples/sec | -4.5% | -2.1% |
| bs=10 sw=10 sl=64 | MB/s | 0.245 MB/s | 0.257 MB/s | 0.251 MB/s | -4.7% | -2.3% |
| bs=10 sw=10 sl=64 | p50 | 22,699 us | 22,952 us | 23,785 us | -1.1% | -4.6% |
| bs=10 sw=10 sl=64 | p95 | 37,616 us | 35,089 us | 34,980 us | +7.2% | +7.5% |
| bs=10 sw=10 sl=64 | p99 | 37,616 us | 35,089 us | 34,980 us | +7.2% | +7.5% |
| bs=100 sw=10 sl=64 | throughput | 942 tuples/sec | 970 tuples/sec | 891.94 tuples/sec | -2.9% | +5.6% |
| bs=100 sw=10 sl=64 | MB/s | 0.575 MB/s | 0.592 MB/s | 0.544 MB/s | -2.9% | +5.6% |
| bs=100 sw=10 sl=64 | p50 | 104,597 us | 104,735 us | 112,277 us | -0.1% | -6.8% |
| bs=100 sw=10 sl=64 | p95 | 134,381 us | 112,921 us | 139,802 us | +19.0% | -3.9% |
| bs=100 sw=10 sl=64 | p99 | 134,381 us | 112,921 us | 139,802 us | +19.0% | -3.9% |
| bs=1000 sw=10 sl=64 | throughput | 1,103 tuples/sec | 1,110 tuples/sec | 1,041 tuples/sec | -0.6% | +6.0% |
| bs=1000 sw=10 sl=64 | MB/s | 0.673 MB/s | 0.677 MB/s | 0.635 MB/s | -0.6% | +5.9% |
| bs=1000 sw=10 sl=64 | p50 | 906,025 us | 897,776 us | 972,714 us | +0.9% | -6.9% |
| bs=1000 sw=10 sl=64 | p95 | 962,805 us | 945,364 us | 1,023,057 us | +1.8% | -5.9% |
| bs=1000 sw=10 sl=64 | p99 | 962,805 us | 945,364 us | 1,023,057 us | +1.8% | -5.9% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,497.78,200,128000,402,0.245,22699.28,37616.06,37616.06
1,100,10,64,20,2124.27,2000,1280000,942,0.575,104596.76,134380.67,134380.67
2,1000,10,64,20,18139.02,20000,12800000,1103,0.673,906024.97,962805.12,962805.12|
Please make sure CI tests pass first |
What changes were proposed in this PR?
Root cause:
time.time_ns()is a wall-clock call that can go backward during NTP adjustments or inside VMs/containers. InMainLoop._switch_contextand_process_dcm, elapsed time is computed asend_time - start_timewhere both values come fromtime.time_ns(). When the clock steps backward between the two calls, the difference is negative. Passing a negative value toStatisticsManager.increase_data_processing_timeraisedValueError: Time must be non-negative, which propagated into the DataProcessor thread and left the worker in a deadlocked wait.Before → After
Changes (3 files):
core/runnables/main_loop.pytime.time_ns()calls withtime.monotonic_ns()core/architecture/managers/statistics_manager.pyValueErrortest/python/core/architecture/managers/test_statistics_manager.pytest_negative_time_raises→test_negative_time_is_clamped_to_zeroto match new behaviorAny related issues, documentation, discussions?
Closes #3768
How was this PR tested?
(Requires
bin/python-proto-gen.shto have been run first, as in CI.)All existing
TestStatisticsManager*tests pass. The updated testtest_negative_time_is_clamped_to_zeroreplacestest_negative_time_raisesand covers the new clamp-and-warn behavior.Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6