-
Notifications
You must be signed in to change notification settings - Fork 114
[server] Global RT DIV: Max Age + Size-Based Sync #2302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions. |
…ded basic unit test for it. 😵
… server, during ingestion phase (before EOP) and after ingestion phase (after EOP, and consuming RT), and finally verifies that all ingested data can be successfully queried (no data loss). ❤️🩹
f551961 to
1533fa0
Compare
|
|
||
| // must be greater than the interval in shouldSendGlobalRtDiv() to not interfere | ||
| final long syncBytesInterval = getSyncBytesInterval(pcs); // size-based sync condition | ||
| return syncBytesInterval > 0 && (pcs.getProcessedRecordSizeSinceLastSync() >= 2 * syncBytesInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have a question about this implementation if we use the running sum of the record sizes from pcs here. The seems to be a problem:
shouldSyncOffsetFromSnapshotis called in the consumer thread.pcs.processedRecordSizeSinceLastSyncis cleared later in the drainer thread when syncing offset.- we have a memory buffer between consumer and drainer.
So, I imagine what could happen is that once size-based condition is triggered for one record in consumer thread, it will keep firing continuously for every following record for quite some time, until the first one record got synced in drainer thread. Because of that, we could have unnecessarily triggered a lot more sync offset operations.
| DISABLED, | ||
| producerStateMaxAgeMs); | ||
| // Could be accessed from multiple threads since there are multiple worker threads. | ||
| this.consumerDiv = new DataIntegrityValidator(kafkaVersionTopic, pubSubContext.getPubSubPositionDeserializer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass in the producerStateMaxAgeMs value to the consumer div here?
| if (entry.getValue().getLastRecordTimestamp() >= earliestAllowableTimestamp) { | ||
| destProducerTracker.setSegment(PartitionTracker.VERSION_TOPIC, entry.getKey(), new Segment(entry.getValue())); | ||
| } else { | ||
| vtSegments.remove(entry.getKey()); // The state is eligible to be cleared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing keys from a Map while iterating over it using standard for is a problem and will generally result in a ConcurrentModificationException. Can you check PartitionTracker.clearExpiredStateAndUpdateOffsetRecord as an example impl or even consider to call it from here to avoid duplicate code, if possible.
| TopicType realTimeTopicType = TopicType.of(TopicType.REALTIME_TOPIC_TYPE, broker2Segment.getKey()); | ||
| destProducerTracker.setSegment(realTimeTopicType, rtEntry.getKey(), new Segment(rtEntry.getValue())); | ||
| } else { | ||
| rtEntries.remove(rtEntry.getKey()); // The state is eligible to be cleared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
|
Claude Code PR Review Pull Request Review: #2302 Title: [server] Global RT DIV: Max Age + Size-Based Sync Status: Open (pending author response to review feedback) Summary This PR adds two synchronization mechanisms to Venice's Global Real-Time Data Integrity Validator (DIV):
The goal is to prevent unbounded state growth and enable proper cleanup during server restarts. Files Changed
Critical Issues Found by Reviewers 🚨
Severity: High Problem: shouldSyncOffsetFromSnapshot() is called in consumer thread Impact: Time window where the size condition continuously triggers before the counter resets, causing excessive/redundant sync operations. Fix Needed: Atomic counter operations or synchronized access to ensure the check and reset are coordinated between threads.
Severity: High The code removes entries during map iteration: Problem: This violates Java's iterator contract and will throw ConcurrentModificationException. Fix Needed: Use Iterator.remove() or collect keys to remove first, then delete after iteration. Reviewer suggests looking at PartitionTracker.clearExpiredStateAndUpdateOffsetRecord as reference.
Severity: Medium Question raised whether producerStateMaxAgeMs should be passed to the consumer DIV validator. This suggests incomplete dependency injection. Fix Needed: Clarify the configuration flow and ensure all validators receive necessary parameters. What I Like ✓
Design Concerns
Testing Gaps While unit/integration tests exist, they don't appear to catch:
Recommendation: Add multi-threaded stress tests that exercise concurrent reads/writes to exposed state. Code Quality: 5/10 The concept is solid and addresses real operational concerns (unbounded state growth). However, the implementation has blocking concurrency bugs that need resolution before merge. Verdict: ❌ Changes Requested Must Fix Before Merge:
Recommendation: Author should respond to reviewer lluwm's feedback and provide updated implementation addressing the threading issues. |
Problem Statement
Solution
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?