Skip to content

fix(amber): avoid AtomicInteger.get_and_set deadlock#5010

Open
nathant27 wants to merge 2 commits into
apache:mainfrom
nathant27:fix/atomicinteger-get-and-set-deadlock
Open

fix(amber): avoid AtomicInteger.get_and_set deadlock#5010
nathant27 wants to merge 2 commits into
apache:mainfrom
nathant27:fix/atomicinteger-get-and-set-deadlock

Conversation

@nathant27
Copy link
Copy Markdown

@nathant27 nathant27 commented May 10, 2026

What changes were proposed in this PR?

Fixed deadlock happening in AtomicInteger.get_and_set (amber/src/main/python/core/util/atomic.py).

Before, get_and_set acquired non-reentrant lock and then accessed value property, which attempts to grab the same lock, causing a deadlock.

After the change, get_and_set now accesses the objects self._value property directly inside the existing critical section, which avoids the nested lock acquisition.
-Chose to do the inline change instead of changing to Reentrant lock to avoid potentially unnecessary overhead and because it seems like the more appropriate way to access in this context, since we're already grabbing the lock anyway in get_and_set

Any related issues, documentation, or discussions?

Fixes #4794

How was this PR tested?

in src/test/python/core/util/test_atomic.py
test_get_and_set_currently_deadlocks_on_non_reentrant_lock, bug pinned test
after the fixes now fails on both asserts lines 99 to 106 on removed test(test_get_and_set_currently_deadlocks_on_non_reentrant_lock)

        assert worker.is_alive(), (
            "worker thread exited unexpectedly — get_and_set neither deadlocked "
            "nor completed; the test no longer pins the documented bug."
        )
        assert not completed.is_set(), (
            "get_and_set unexpectedly returned — the deadlock bug appears fixed; "
            "delete this pinned test along with the xfail below."
        )

This is expected behavior because worker should not be alive after fixing deadlock, and should be completed.

2 changes to test_atomic.py

REPLACED
test_get_and_set_currently_deadlocks_on_non_reentrant_lock
WITH
test_get_and_set_does_not_deadlock_on_non_reentrant_lock

  • Mostly the same functionality of old test, but instead of checking if deadlocks, checks if it does not deadlock by asserting "not worker.is_alive()" and "completed.is_set" from the thread. Basically just swapped the two asserts at the end

UPDATED test_get_and_set_should_return_old_value_and_replace_state

  • REMOVED the xfail with strict = true, because now passes

Was this PR authored or co-authored using generative AI tooling?

No

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Thanks @nathant27! Please do the following:

  1. please update the PR description to match our template.
  2. please add a test case to ensure the mentioned issue is fixed and guarded in CI.

this seems to be a bug on both main and 1.1.0-incubating. @bobbai00 shall we backport it?

@nathant27
Copy link
Copy Markdown
Author

Thanks for the feedback! I changed the formatting to match template format, and Im looking into adding the testing(getting rid of the bug tracking test and adding a new one). Currently out at the moment but will be done by later tonight or early tomorrow morning

@nathant27
Copy link
Copy Markdown
Author

Just updated my description for the updated tests @Yicong-Huang. Let me know if these updates to the test seem fine for now. I'm happy to add more if needed

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.64%. Comparing base (14f8be4) to head (bcaa234).

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #5010   +/-   ##
=========================================
  Coverage     42.64%   42.64%           
  Complexity     2188     2188           
=========================================
  Files          1045     1045           
  Lines         39880    39880           
  Branches       4205     4205           
=========================================
+ Hits          17006    17008    +2     
+ Misses        21814    21812    -2     
  Partials       1060     1060           
Flag Coverage Δ *Carryforward flag
access-control-service 39.88% <ø> (ø) Carriedforward from 14f8be4
agent-service 33.72% <ø> (ø) Carriedforward from 14f8be4
amber 43.29% <ø> (ø) Carriedforward from 14f8be4
computing-unit-managing-service 0.00% <ø> (ø) Carriedforward from 14f8be4
config-service 0.00% <ø> (ø) Carriedforward from 14f8be4
file-service 32.10% <ø> (ø) Carriedforward from 14f8be4
frontend 33.85% <ø> (ø) Carriedforward from 14f8be4
python 88.95% <100.00%> (+0.05%) ⬆️
workflow-compiling-service 47.72% <ø> (ø) Carriedforward from 14f8be4

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nathant27 nathant27 force-pushed the fix/atomicinteger-get-and-set-deadlock branch from bcaa234 to d1f2e41 Compare May 12, 2026 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AtomicInteger.get_and_set deadlocks because of non-reentrant lock

3 participants