Skip to content

fix(pyamber): use monotonic_ns to prevent negative elapsed times in statistics_manager#5791

Open
lalalastella wants to merge 1 commit into
apache:mainfrom
lalalastella:fix/monotonic-clock-negative-elapsed-time
Open

fix(pyamber): use monotonic_ns to prevent negative elapsed times in statistics_manager#5791
lalalastella wants to merge 1 commit into
apache:mainfrom
lalalastella:fix/monotonic-clock-negative-elapsed-time

Conversation

@lalalastella

Copy link
Copy Markdown

What changes were proposed in this PR?

Root cause: time.time_ns() is a wall-clock call that can go backward during NTP adjustments or inside VMs/containers. In MainLoop._switch_context and _process_dcm, elapsed time is computed as end_time - start_time where both values come from time.time_ns(). When the clock steps backward between the two calls, the difference is negative. Passing a negative value to StatisticsManager.increase_data_processing_time raised ValueError: Time must be non-negative, which propagated into the DataProcessor thread and left the worker in a deadlocked wait.

Before → After

Before:
  start_time = time.time_ns()   # wall clock — can go backward
  ...
  end_time   = time.time_ns()   # can be < start_time after NTP step
  statistics_manager.increase_data_processing_time(end_time - start_time)
  #  → ValueError: Time must be non-negative
  #  → worker thread hangs (unhandled exception)

After:
  start_time = time.monotonic_ns()   # guaranteed non-decreasing
  ...
  end_time   = time.monotonic_ns()   # always ≥ start_time
  statistics_manager.increase_data_processing_time(end_time - start_time)
  #  → always ≥ 0, no exception

Changes (3 files):

File Change
core/runnables/main_loop.py Replace all 6 time.time_ns() calls with time.monotonic_ns()
core/architecture/managers/statistics_manager.py Defense-in-depth: log a warning and return early for negative elapsed times instead of raising ValueError
test/python/core/architecture/managers/test_statistics_manager.py Update test_negative_time_raisestest_negative_time_is_clamped_to_zero to match new behavior

Any related issues, documentation, discussions?

Closes #3768

How was this PR tested?

cd amber && pytest src/test/python/core/architecture/managers/test_statistics_manager.py -v

(Requires bin/python-proto-gen.sh to have been run first, as in CI.)

All existing TestStatisticsManager* tests pass. The updated test test_negative_time_is_clamped_to_zero replaces test_negative_time_raises and covers the new clamp-and-warn behavior.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

time.time_ns() (wall clock) can go backward on NTP adjustments or in
VMs, making end_time - start_time negative in _switch_context and
_process_dcm.  Passing that negative delta to StatisticsManager raised
ValueError, which was unhandled by the DataProcessor thread and left
the worker in a deadlocked wait.

Fix: replace all time.time_ns() in main_loop.py with time.monotonic_ns(),
which is guaranteed non-decreasing.

Defense-in-depth: update increase_data_processing_time and
increase_control_processing_time to log a warning and return early
instead of raising ValueError for negative values.  This keeps the
worker alive in the unlikely event of any remaining clock anomaly.

Closes apache#3768
@lalalastella lalalastella force-pushed the fix/monotonic-clock-negative-elapsed-time branch from cf0da76 to e81241f Compare June 19, 2026 06:24
@github-actions github-actions Bot removed the common label Jun 19, 2026
@aglinxinyuan aglinxinyuan requested review from Copilot and mengw15 June 19, 2026 06:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes intermittent worker hangs in the PyAmber runtime by switching elapsed-time measurements to a monotonic clock and making statistics collection tolerant to negative durations.

Changes:

  • Replace time.time_ns() with time.monotonic_ns() in MainLoop when measuring elapsed processing time and total execution time.
  • Make StatisticsManager.increase_*_processing_time ignore (and warn on) negative durations instead of raising.
  • Update the statistics manager unit test to match the new “ignore negative durations” behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
amber/src/main/python/core/runnables/main_loop.py Switches processing/execution timing from wall-clock to monotonic time to avoid negative elapsed durations.
amber/src/main/python/core/architecture/managers/statistics_manager.py Adds defense-in-depth by warning and returning early on negative processing durations.
amber/src/test/python/core/architecture/managers/test_statistics_manager.py Updates the negative-time test to validate the new behavior (no exception, accumulator unchanged).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 103 to 109
@pytest.mark.parametrize(
"method",
["increase_data_processing_time", "increase_control_processing_time"],
"method, attr",
[
("increase_data_processing_time", "_data_processing_time"),
("increase_control_processing_time", "_control_processing_time"),
],
)
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 4 worse · ⚪ 11 noise (<±5%) · 0 without baseline

Compared against main 731d671 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 402 0.245 22,699/37,616/37,616 us 🔴 +7.2% / 🔴 +7.5%
🔴 bs=100 sw=10 sl=64 942 0.575 104,597/134,381/134,381 us 🔴 +19.0% / 🟢 -6.8%
bs=1000 sw=10 sl=64 1,103 0.673 906,025/962,805/962,805 us ⚪ within ±5% / 🟢 -6.9%
Baseline details

Latest main 731d671 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 402 tuples/sec 421 tuples/sec 410.82 tuples/sec -4.5% -2.1%
bs=10 sw=10 sl=64 MB/s 0.245 MB/s 0.257 MB/s 0.251 MB/s -4.7% -2.3%
bs=10 sw=10 sl=64 p50 22,699 us 22,952 us 23,785 us -1.1% -4.6%
bs=10 sw=10 sl=64 p95 37,616 us 35,089 us 34,980 us +7.2% +7.5%
bs=10 sw=10 sl=64 p99 37,616 us 35,089 us 34,980 us +7.2% +7.5%
bs=100 sw=10 sl=64 throughput 942 tuples/sec 970 tuples/sec 891.94 tuples/sec -2.9% +5.6%
bs=100 sw=10 sl=64 MB/s 0.575 MB/s 0.592 MB/s 0.544 MB/s -2.9% +5.6%
bs=100 sw=10 sl=64 p50 104,597 us 104,735 us 112,277 us -0.1% -6.8%
bs=100 sw=10 sl=64 p95 134,381 us 112,921 us 139,802 us +19.0% -3.9%
bs=100 sw=10 sl=64 p99 134,381 us 112,921 us 139,802 us +19.0% -3.9%
bs=1000 sw=10 sl=64 throughput 1,103 tuples/sec 1,110 tuples/sec 1,041 tuples/sec -0.6% +6.0%
bs=1000 sw=10 sl=64 MB/s 0.673 MB/s 0.677 MB/s 0.635 MB/s -0.6% +5.9%
bs=1000 sw=10 sl=64 p50 906,025 us 897,776 us 972,714 us +0.9% -6.9%
bs=1000 sw=10 sl=64 p95 962,805 us 945,364 us 1,023,057 us +1.8% -5.9%
bs=1000 sw=10 sl=64 p99 962,805 us 945,364 us 1,023,057 us +1.8% -5.9%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,497.78,200,128000,402,0.245,22699.28,37616.06,37616.06
1,100,10,64,20,2124.27,2000,1280000,942,0.575,104596.76,134380.67,134380.67
2,1000,10,64,20,18139.02,20000,12800000,1103,0.673,906024.97,962805.12,962805.12

@mengw15

mengw15 commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Please make sure CI tests pass first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-deterministic ValueError: Time must be non-negative in statistics_manager.increase_data_processing_time causes workflow to hang

3 participants