IGNITE-24963 Introduce fair wound wait deadlock prevention algorithm#7799
IGNITE-24963 Introduce fair wound wait deadlock prevention algorithm#7799ascherbakoff wants to merge 61 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements a wound-wait deadlock prevention policy for transactions, replacing the previous wait-die approach. It also introduces transaction kill messaging, refactors lock manager internals, adds runInTransaction to TxManager, and restructures test hierarchies for lock management tests.
Changes:
- Replaces
DeadlockPreventionPolicyImplwith specific policy classes (WoundWaitDeadlockPreventionPolicy,NoWaitDeadlockPreventionPolicy,TimeoutDeadlockPreventionPolicy,ReversedWaitDieDeadlockPreventionPolicy) and refactorsHeapLockManagerto useallowWaitcallback with a "sealable" tx map - Adds
TxKillMessagefor cross-node transaction kill signaling and integrates it intoTxManagerImplas the wound-wait fail action - Moves
runInTransactionfromIgniteTransactionsdefault methods intoTxManagerand refactors the retry logic inRunInTransactionInternalImpl
Reviewed changes
Copilot reviewed 54 out of 54 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
DeadlockPreventionPolicy.java |
Adds allowWait, failAction, reverse methods; removes usePriority |
WoundWaitDeadlockPreventionPolicy.java |
New wound-wait policy implementation |
WaitDieDeadlockPreventionPolicy.java |
Adds allowWait and reverse implementations |
NoWaitDeadlockPreventionPolicy.java |
New no-wait policy |
TimeoutDeadlockPreventionPolicy.java |
New timeout-only policy |
ReversedWaitDieDeadlockPreventionPolicy.java |
New reversed wait-die policy |
DeadlockPreventionPolicyImpl.java |
Deleted — replaced by specific policy classes |
HeapLockManager.java |
Major refactor: sealable tx map, tryAcquireInternal, findConflicts, callback-based conflict resolution |
TxKillMessage.java |
New network message for tx kill requests |
TxMessageGroup.java |
Registers new TX_KILL_MESSAGE type |
TxMessageSender.java |
Adds kill() method |
TxManagerImpl.java |
Switches to wound-wait policy, adds kill message handler, adds runInTransaction |
ReadWriteTransactionImpl.java |
Makes killed volatile, moves assignment, exposes enlistFailedException |
InternalTransaction.java |
Adds enlistFailedException default method |
PublicApiThreadingTransaction.java |
Delegates enlistFailedException |
IgniteTransactions.java |
Makes runInTransaction/runInTransactionAsync abstract |
IgniteTransactionsImpl.java |
Implements runInTransaction/runInTransactionAsync |
ClientTransactions.java |
Stub implementations throwing IllegalArgumentException |
RestartProofIgniteTransactions.java |
Delegates new methods |
PublicApiThreadingIgniteTransactions.java |
Delegates new methods |
RunInTransactionInternalImpl.java |
Refactored retry logic, made public |
TransactionIds.java |
Adds retryCnt field to tx ID encoding |
InternalTxOptions.java |
Adds retryId option |
Lock.java |
Adds equals/hashCode |
LockKey.java |
Improves toString for ByteBuffer keys |
LockManager.java |
Adds policy() method |
TransactionKilledException.java |
Adds simplified constructor |
InternalTableImpl.java |
Delegates to tx.enlistFailedException() |
PartitionReplicaListener.java |
Replaces TxCleanupReadyState with PartitionInflights |
PartitionInflights.java |
New inflight tracking for partition replicas |
TraceableFuture.java |
Debug utility (should be removed) |
ThreadAssertingMvPartitionStorage.java |
Disables thread assertion (should be reverted) |
TpccBenchmarkNodeRunner.java |
Developer-local benchmark runner |
ItDataConsistencyTest.java |
Test adjustments for new policy |
AbstractLockingTest.java |
Refactored base test class |
AbstractLockManagerTest.java |
Deleted — tests moved to HeapLockManagerTest |
HeapLockManagerTest.java |
Now extends AbstractLockingTest, contains moved tests |
| Various test files | Updated to use new policy classes and matchers |
LockWaiterMatcher.java, LockFutureMatcher.java, LockConflictMatcher.java |
New Hamcrest matchers for lock test assertions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
01373d9 to
c7e434f
Compare
bf3d308 to
46abfdc
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a new deadlock-prevention framework centered around a fair wound-wait policy, including coordinator-driven transaction kill, lock-manager behavior changes, and broad test updates to run under both wound-wait and wait-die/reversed policies.
Changes:
- Implement wound-wait deadlock prevention (including tx “kill” messaging) and refactor deadlock-prevention policies into dedicated implementations.
- Rework
HeapLockManagerconflict handling and add new concurrency guards to address rollback/enlistment races. - Update/parameterize a large suite of unit/integration tests to support multiple lock policies and new retry semantics.
Reviewed changes
Copilot reviewed 64 out of 64 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/transactions/src/testFixtures/java/org/apache/ignite/internal/tx/test/LockWaiterMatcher.java | New matcher for “waiting” lock futures (used by updated tests). |
| modules/transactions/src/testFixtures/java/org/apache/ignite/internal/tx/test/LockFutureMatcher.java | New matcher for verifying granted lock futures. |
| modules/transactions/src/testFixtures/java/org/apache/ignite/internal/tx/test/LockConflictMatcher.java | New matcher for verifying lock-conflict completion. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/WoundWaitDeadlockPreventionRollbackFailActionTest.java | New tests validating wound-wait behavior when failAction rolls back. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/WoundWaitDeadlockPreventionNoOpFailActionTest.java | New tests validating wound-wait behavior when failAction is a no-op. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/WaitDieDeadlockPreventionTest.java | Adjust tests to new policy/matcher infrastructure and lockKey helpers. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/TimeoutDeadlockPreventionTest.java | Switch to new timeout policy class and updated waiting/conflict matchers. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/ReversedWaitDieDeadlockPreventionTest.java | Replace old reversed-policy wiring with a dedicated reversed wait-die policy. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/NoWaitDeadlockPreventionTest.java | Switch to dedicated no-wait policy class and lockKey helper. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/NoneDeadlockPreventionTest.java | Update tests to shared deadlock-prevention test base and wait matcher. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/LockManagerTxLabelTest.java | Switch to dedicated no-wait policy for label-in-message coverage. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/impl/OrphanDetectorTxLabelTest.java | Parameterize over wait-die and wound-wait policies. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/impl/OrphanDetectorTest.java | Parameterize over policies; construct lock manager per policy. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/HeapLockManagerEventsTest.java | Parameterize lock-manager events tests over policies. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/CoarseGrainedLockManagerTest.java | Adjust coarse-lock tests for new “who waits” semantics under policy changes. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractLockManagerTest.java | Remove large legacy lock-manager test base (superseded by new bases). |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractLockManagerEventsTest.java | Update to lockKey helper usage. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractLockingTest.java | Centralize lock manager setup, tx ordering, and lock acquisition helpers. |
| modules/transactions/src/test/java/org/apache/ignite/internal/tx/AbstractDeadlockPreventionTest.java | Refactor common scenarios to use new matchers and policy hook points. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/TransactionIds.java | Add hash(UUID, divisor) helper for striped hashing. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/message/TxMessageGroup.java | Add message type for tx kill. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/message/TxKillMessage.java | New fire-and-forget “kill tx” network message. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockManager.java | Expose the active deadlock prevention policy via policy(). |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/LockKey.java | Improve toString formatting for byte-buffer based keys. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/Lock.java | Add equals/hashCode to support matcher-based assertions. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/InternalTxOptions.java | Add retryId plumbing for retriable transaction creation. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/InternalTransaction.java | Add enlist-failure exception hook used by table operations. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/WoundWaitDeadlockPreventionPolicy.java | New wound-wait policy implementation. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/WaitDieDeadlockPreventionPolicy.java | Implement allowWait/reverse semantics explicitly for wait-die. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TxMessageSender.java | Add kill send helper for coordinator-directed tx invalidation. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TxManagerImpl.java | Default to wound-wait policy; add kill handling and retryId-based tx creation. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/TimeoutDeadlockPreventionPolicy.java | New “timeout wait” policy class. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/ReversedWaitDieDeadlockPreventionPolicy.java | New dedicated reversed wait-die policy class. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/ReadWriteTransactionImpl.java | Make kill flag volatile; expose enlist failure exception; adjust finish semantics. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/PublicApiThreadingTransaction.java | Delegate new enlist-failure exception method. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/NoWaitDeadlockPreventionPolicy.java | New no-wait policy class. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/HeapLockManager.java | Major lock acquisition/conflict resolution rewrite; sealing to prevent enlist races. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/impl/DeadlockPreventionPolicyImpl.java | Remove old generic policy implementation. |
| modules/transactions/src/main/java/org/apache/ignite/internal/tx/DeadlockPreventionPolicy.java | Extend policy API with allowWait/failAction/reverse hooks. |
| modules/transactions/src/integrationTest/java/org/apache/ignite/tx/distributed/ItTransactionRecoveryTest.java | Make recovery tests conditional on policy semantics (reverse vs non-reverse). |
| modules/transactions/src/integrationTest/java/org/apache/ignite/internal/tx/ItRunInTransactionTest.java | Adjust retry test flow to account for policy differences. |
| modules/transactions/src/integrationTest/java/org/apache/ignite/internal/disaster/ItDisasterRecoveryReconfigurationTest.java | Skip scenario not compatible with wound-wait until follow-up ticket. |
| modules/table/src/testFixtures/java/org/apache/ignite/internal/table/TxAbstractTest.java | Update table tx tests for new waiting semantics and less timing-based assertions. |
| modules/table/src/main/java/org/apache/ignite/internal/table/distributed/storage/InternalTableImpl.java | Change retry eligibility; delegate enlist failure to transaction-provided exception. |
| modules/table/src/main/java/org/apache/ignite/internal/table/distributed/replicator/PartitionReplicaListener.java | Add PartitionInflights + guards to prevent rollback/enlist races and lock leaks. |
| modules/table/src/main/java/org/apache/ignite/internal/table/distributed/replicator/PartitionInflights.java | New inflight tracking primitive used to coordinate cleanup barriers. |
| modules/table/src/integrationTest/java/org/apache/ignite/internal/table/ItOperationRetryTest.java | Make some retry scenarios policy-dependent (only reversed/wait-die). |
| modules/table/src/integrationTest/java/org/apache/ignite/distributed/ItTxStateLocalMapTest.java | Adjust unlock-only state assertions (commit + read gating). |
| modules/table/src/integrationTest/java/org/apache/ignite/distributed/ItTxAbstractDistributedTestSingleNode.java | Skip implicit-timeout scenario unless applicable to wait-die behavior. |
| modules/sql-engine/src/integrationTest/java/org/apache/ignite/internal/sql/engine/systemviews/ItLocksSystemViewTest.java | Make lock-view conflict test resilient to policy direction. |
| modules/sql-engine/src/integrationTest/java/org/apache/ignite/internal/sql/engine/ItDmlTest.java | Adjust transactional scan test to ensure a true waiter/owner ordering. |
| modules/runner/src/integrationTest/java/org/apache/ignite/internal/table/ItTableScanTest.java | Make scan/insert concurrency tests policy-aware. |
| modules/runner/src/integrationTest/java/org/apache/ignite/internal/table/ItDataConsistencyTest.java | Update consistency stress test for new retry/kill behavior and stronger progress checks. |
| modules/runner/src/integrationTest/java/org/apache/ignite/internal/benchmark/TpccBenchmarkNodeRunner.java | Add a standalone runner for TPC-C style benchmarking setup. |
| modules/replicator/src/main/java/org/apache/ignite/internal/replicator/exception/ReplicaUnavailableException.java | Mark replica-unavailable as retriable for transaction retries. |
| modules/replicator/src/main/java/org/apache/ignite/internal/replicator/exception/ReplicationException.java | Remove RetriableTransactionException marker from base replication exception. |
| modules/platforms/cpp/tests/odbc-test/transaction_test.cpp | Disable a failing ODBC transaction test (linked ticket). |
| modules/core/src/testFixtures/java/org/apache/ignite/internal/testframework/IgniteTestUtils.java | Add ensureFutureNotCompleted helper to avoid sleep-based test timing. |
| modules/core/src/main/java/org/apache/ignite/internal/lang/NodeStoppingException.java | Remove retriable markers from node-stopping exception. |
| modules/client/src/integrationTest/java/org/apache/ignite/internal/streamer/ItClientDataStreamerLoadTest.java | Reduce N+1 gets; add policy-aware tolerances and extra logging. |
| modules/client/src/integrationTest/java/org/apache/ignite/internal/client/ItThinClientTransactionsTest.java | Make thin-client conflict tests policy-aware; adjust expected exception types. |
| modules/api/src/test/java/org/apache/ignite/tx/RunInTransactionRetryTest.java | Update expected retry behavior to include commit retryability. |
| modules/api/src/main/java/org/apache/ignite/tx/RunInTransactionInternalImpl.java | Make commit part of retriable closure path; simplify async flow; adjust retriable detection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR introduces fair wound wait deadlock prevention algorithm.
List of major changes:
ItDataConsistencyTestandItClientDataStreamerLoadTesttests with WW enabled.RunInTransactionInternalImplto ensure a killed transaction is properly retried.Multiple follow-up tickets are created as consequences of introduced changes:
https://issues.apache.org/jira/browse/IGNITE-28447
https://issues.apache.org/jira/browse/IGNITE-28450
https://issues.apache.org/jira/browse/IGNITE-28365
https://issues.apache.org/jira/browse/IGNITE-28448
https://issues.apache.org/jira/browse/IGNITE-28458
https://issues.apache.org/jira/browse/IGNITE-28461
https://issues.apache.org/jira/browse/IGNITE-28464
https://issues.apache.org/jira/browse/IGNITE-28509
https://issues.apache.org/jira/browse/IGNITE-28506
https://issues.apache.org/jira/browse/IGNITE-28507
A quick benchmark using ItDataConsistency scenario shows x5 throughput increase:
wound-wait:
[ItDataConsistencyTest] After test ops=53340 restarts=5162 fails=0 readOps=0 readFails=0
wait-die:
[ItDataConsistencyTest] After test ops=10731 restarts=1638 fails=0 readOps=0 readFails=0