Skip to content

Conversation

@KaiSernLim
Copy link
Contributor

Problem Statement

Solution

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@KaiSernLim KaiSernLim self-assigned this Nov 18, 2025
@KaiSernLim KaiSernLim requested a review from lluwm November 18, 2025 23:05
@github-actions
Copy link

github-actions bot commented Jan 9, 2026

Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions.

@github-actions github-actions bot added the stale label Jan 9, 2026
@KaiSernLim KaiSernLim force-pushed the global-rt-div-max-age branch from f551961 to 1533fa0 Compare January 9, 2026 19:40
@github-actions github-actions bot removed the stale label Jan 10, 2026
@KaiSernLim KaiSernLim marked this pull request as ready for review January 11, 2026 09:54

// must be greater than the interval in shouldSendGlobalRtDiv() to not interfere
final long syncBytesInterval = getSyncBytesInterval(pcs); // size-based sync condition
return syncBytesInterval > 0 && (pcs.getProcessedRecordSizeSinceLastSync() >= 2 * syncBytesInterval);
Copy link
Contributor

@lluwm lluwm Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a question about this implementation if we use the running sum of the record sizes from pcs here. The seems to be a problem:

  1. shouldSyncOffsetFromSnapshot is called in the consumer thread.
  2. pcs.processedRecordSizeSinceLastSync is cleared later in the drainer thread when syncing offset.
  3. we have a memory buffer between consumer and drainer.

So, I imagine what could happen is that once size-based condition is triggered for one record in consumer thread, it will keep firing continuously for every following record for quite some time, until the first one record got synced in drainer thread. Because of that, we could have unnecessarily triggered a lot more sync offset operations.

DISABLED,
producerStateMaxAgeMs);
// Could be accessed from multiple threads since there are multiple worker threads.
this.consumerDiv = new DataIntegrityValidator(kafkaVersionTopic, pubSubContext.getPubSubPositionDeserializer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass in the producerStateMaxAgeMs value to the consumer div here?

if (entry.getValue().getLastRecordTimestamp() >= earliestAllowableTimestamp) {
destProducerTracker.setSegment(PartitionTracker.VERSION_TOPIC, entry.getKey(), new Segment(entry.getValue()));
} else {
vtSegments.remove(entry.getKey()); // The state is eligible to be cleared.
Copy link
Contributor

@lluwm lluwm Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing keys from a Map while iterating over it using standard for is a problem and will generally result in a ConcurrentModificationException. Can you check PartitionTracker.clearExpiredStateAndUpdateOffsetRecord as an example impl or even consider to call it from here to avoid duplicate code, if possible.

TopicType realTimeTopicType = TopicType.of(TopicType.REALTIME_TOPIC_TYPE, broker2Segment.getKey());
destProducerTracker.setSegment(realTimeTopicType, rtEntry.getKey(), new Segment(rtEntry.getValue()));
} else {
rtEntries.remove(rtEntry.getKey()); // The state is eligible to be cleared.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@manujose0
Copy link
Contributor

Claude Code PR Review

Pull Request Review: #2302

Title: [server] Global RT DIV: Max Age + Size-Based Sync

Status: Open (pending author response to review feedback)


Summary

This PR adds two synchronization mechanisms to Venice's Global Real-Time Data Integrity Validator (DIV):

  1. Max age-based pruning for producer state data
  2. Size-based synchronization for offset snapshots

The goal is to prevent unbounded state growth and enable proper cleanup during server restarts.


Files Changed

  • LeaderFollowerStoreIngestionTask.java - Size-based sync logic
  • StoreIngestionTask.java - Producer state max age initialization
  • PartitionTracker.java - Segment cloning with max age enforcement
  • Multiple test files (unit + integration)

Critical Issues Found by Reviewers 🚨

  1. Race Condition - Size-Based Sync

Severity: High

Problem: shouldSyncOffsetFromSnapshot() is called in consumer thread
pcs.processedRecordSizeSinceLastSync is cleared in drainer thread

Impact: Time window where the size condition continuously triggers before the counter resets, causing excessive/redundant sync operations.

Fix Needed: Atomic counter operations or synchronized access to ensure the check and reset are coordinated between threads.


  1. ConcurrentModificationException Risk

Severity: High

The code removes entries during map iteration:
vtSegments.remove(entry.getKey());
rtEntries.remove(rtEntry.getKey());

Problem: This violates Java's iterator contract and will throw ConcurrentModificationException.

Fix Needed: Use Iterator.remove() or collect keys to remove first, then delete after iteration. Reviewer suggests looking at PartitionTracker.clearExpiredStateAndUpdateOffsetRecord as reference.


  1. Missing Configuration Propagation

Severity: Medium

Question raised whether producerStateMaxAgeMs should be passed to the consumer DIV validator. This suggests incomplete dependency injection.

Fix Needed: Clarify the configuration flow and ensure all validators receive necessary parameters.


What I Like ✓

  1. Good Test Coverage: Parametrized integration tests covering restart scenarios (before/after EOP)
  2. Incremental Commits: 15 commits showing thoughtful, iterative development
  3. Concurrency Awareness: Comments acknowledge multi-threaded access patterns
  4. Configuration-Driven: Uses sensible defaults for max age parameters

Design Concerns

  1. Threading Model: The split between consumer thread (checking conditions) and drainer thread (clearing state) creates complexity and bug surface area
  2. State Management: The PR touches multiple stateful components (PartitionTracker, LeaderFollowerStoreIngestionTask) - ensure state transitions are atomic
  3. Synchronization Strategy: Size-based and age-based pruning operate independently - are there edge cases where they conflict?

Testing Gaps

While unit/integration tests exist, they don't appear to catch:

  • The race condition on size counter
  • The iterator modification issue
  • Concurrent access patterns under load

Recommendation: Add multi-threaded stress tests that exercise concurrent reads/writes to exposed state.


Code Quality: 5/10

The concept is solid and addresses real operational concerns (unbounded state growth). However, the implementation has blocking concurrency bugs that need resolution before merge.


Verdict: ❌ Changes Requested

Must Fix Before Merge:

  1. Resolve race condition on processedRecordSizeSinceLastSync
  2. Fix iterator modification exceptions
  3. Clarify configuration propagation to DIV validators
  4. Add explicit concurrency tests

Recommendation: Author should respond to reviewer lluwm's feedback and provide updated implementation addressing the threading issues.

@manujose0 manujose0 closed this Jan 15, 2026
@manujose0 manujose0 reopened this Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants