NIFI-15883: Fix Wait/Notify race condition with CAS-aware replace and complete#11275
NIFI-15883: Fix Wait/Notify race condition with CAS-aware replace and complete#11275rakesh-rsky wants to merge 1 commit into
Conversation
| } | ||
| } else if (waitProgressed) { | ||
| protocol.replace(signal); | ||
| if (!protocol.replace(signal)) { |
There was a problem hiding this comment.
All though the fix works but the implementation of complete() and replace() are inconsistent. complete() throws the exception, but for replace the caller have to check.
It would be better to that replace() also throws exception when concurrent updates are detected.
sheelchand-bah
left a comment
There was a problem hiding this comment.
Suggestion: make replace() and complete() function consistent - either both return true/false or both throw the exception.
519d519 to
ddbe7f0
Compare
|
@sheelchand-bah Thank you for the review. Good point on API consistency. Both replace(Signal) and complete(Signal) now throw ConcurrentModificationException on concurrent modification — callers don't need to check return values. I also updated the retry loop in notify() to catch ConcurrentModificationException from replace() and continue retrying (up to MAX_REPLACE_RETRY_COUNT = 5), since that was the previous behavior when it returned false. |
turcsanyip
left a comment
There was a problem hiding this comment.
Thanks for working on the issue @rakesh-rsky!
The general concept looks good. Please find my comments below.
| final Object actualRevision = current.cachedEntry.getRevision().orElse(null); | ||
| if (expectedRevision != null && !expectedRevision.equals(actualRevision)) { | ||
| throw new ConcurrentModificationException(String.format( | ||
| "Failed to complete signal [%s]: signal was concurrently modified (expected revision %s, found %s). Will retry.", |
There was a problem hiding this comment.
It depends on the caller whether it retries the transaction or gives up, so I suggest removing the "Will retry." suffix.
| return cache.replace(signal.cachedEntry, stringSerializer, stringSerializer); | ||
| if (!cache.replace(signal.cachedEntry, stringSerializer, stringSerializer)) { | ||
| throw new ConcurrentModificationException(String.format( | ||
| "Failed to update signal [%s] in cache due to concurrent modification. Will retry.", signal.identifier)); |
| final Signal current = getSignal(signalId); | ||
| if (current == null) { | ||
| return; |
There was a problem hiding this comment.
I think this case should also be treated as a concurrent modification and should throw an exception.
Another process may have consumed and removed the signal after we initially read it, which means the value we based our decision on was no longer valid.
Possible scenario:
- signal counter = 1 initially
- 2 concurrent threads read the signal
- both decrease the counter to 0 and resume the waiting FlowFile
- both removes the signal (as the counter reached 0)
=> only 1 FF should have been resumed based on the counter
| } catch (final IOException | ConcurrentModificationException e) { | ||
| session.rollback(); | ||
| throw new ProcessException(String.format("Unable to communicate with cache while updating %s due to %s", signalId, e), e); | ||
| } |
There was a problem hiding this comment.
In the event of an optimistic lock failure, the Wait processor should retry the transaction silently rather than failing the FlowFile immediately. (similar to the for cycle and MAX_REPLACE_RETRY_COUNT in WaitNotifyProtocol.notify()).
I see the Wait flow is more complicated and a simple for cycle around it may not be possible.
Minimally, I would separate the exception handling:
- the
Unable to communicate with cacheerror message is not appropriate for the concurrent modification case - warning log may be enough instead of error
The preferred way would definitely be the internal retry with max retry count.
… complete - WaitNotifyProtocol.replace(Signal): Changed from returning boolean to void. Throws ConcurrentModificationException when the underlying cache.replace() returns false (CAS failure due to concurrent Notify), making the API consistent with complete(Signal) which also throws on concurrent modification. - WaitNotifyProtocol.notify(): Updated retry loop to catch ConcurrentModificationException from replace() instead of checking the boolean return value. - WaitNotifyProtocol.complete(Signal): Replaces raw cache.remove() with a version-aware approach that re-fetches the cache entry and compares revisions before removing. Throws ConcurrentModificationException if a concurrent Notify updated the signal between Wait's initial read and the remove call, preventing silent data loss. - Wait.java: Simplified call sites - protocol.replace(signal) and protocol.complete(signal) both now throw ConcurrentModificationException on concurrent modification. The catch block handles both IOException and ConcurrentModificationException with session rollback and ProcessException. - TestWaitNotifyProtocol: Add three tests for complete(Signal): happy-path removal, concurrent-modification detection, and no-op when the signal is already absent. - TestWait: Add two integration tests that verify session rollback and ProcessException wrapping when replace() throws on CAS failure (waitProgressed path) and when complete() detects a stale revision (waitCompleted path).
ddbe7f0 to
16d1446
Compare
NIFI-15883: Fix Wait/Notify race condition with CAS-aware replace and complete
WaitNotifyProtocol.replace(Signal): Changed from returning boolean to void.
Throws ConcurrentModificationException when the underlying cache.replace()
returns false (CAS failure due to concurrent Notify), making the API
consistent with complete(Signal) which also throws on concurrent modification.
WaitNotifyProtocol.notify(): Updated retry loop to catch
ConcurrentModificationException from replace() instead of checking
the boolean return value.
WaitNotifyProtocol.complete(Signal): Replaces raw cache.remove() with a
version-aware approach that re-fetches the cache entry and compares
revisions before removing. Throws ConcurrentModificationException if
a concurrent Notify updated the signal between Wait's initial read
and the remove call, preventing silent data loss.
Wait.java: Simplified call sites - protocol.replace(signal) and
protocol.complete(signal) both now throw ConcurrentModificationException
on concurrent modification. The catch block handles both IOException and
ConcurrentModificationException with session rollback and ProcessException.
TestWaitNotifyProtocol: Add three tests for complete(Signal):
happy-path removal, concurrent-modification detection, and no-op
when the signal is already absent.
TestWait: Add two integration tests that verify session rollback and
ProcessException wrapping when replace() throws on CAS failure
(waitProgressed path) and when complete() detects a stale revision
(waitCompleted path).
Summary
NIFI-15883
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000VerifiedstatusPull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation