PHOENIX-7878 CDC perf improvement - skip redundant cell versions on data table scans by virajjasani · Pull Request #2493 · apache/phoenix

virajjasani · 2026-06-03T01:02:26Z

What changes were proposed in this pull request?

CDC perf improvement - skip redundant cell versions on data table scans

Why are the changes needed?

When a CDC query runs with pre, post, and/or change scopes, it scans the data table to reconstruct each change event (the change image plus the pre-image, and for the consumer path the full data-row state). Today that data table scan is a raw, all-versions scan, so for every data row we read back every version of every column - even though, for a given batch of changes, we only need two cells per column per change: the cell at the change timestamp, and the most recent cell just below it (the pre-image). On rows that are updated frequently this means we read, transfer, and process far more cells than the event reconstruction actually uses, which adds CPU, memory, and network overhead to CDC reads.

The purpose of this Jira is to add new CDCVersionFilter, in addition to SkipScanFilter on the data table scans. For each row it is given the set of change timestamps from the current batch and keeps only the cells that matter: the cell at each change timestamp, the first cell below each change timestamp (the pre-image), and all DeleteFamily markers (needed for deletion tracking), other cells are skipped to avoid redundant data transfer.

Does this PR introduce any user-facing change?

This is performance improvement

How was this patch tested?

UT and IT tests

Was this patch authored or co-authored using generative AI tooling?

Claude Opus 4.8

…ata table scans

virajjasani · 2026-06-03T01:03:09Z

Build: https://ci-hadoop.apache.org/job/Phoenix/job/Phoenix-PreCommit-GitHub-PR/job/PR-2493/

palashc

LGTM +1, couple of nits

palashc · 2026-06-03T23:03:26Z

+    String cdcFullName = SchemaUtil.getTableName(schemaName, cdcName);
+    try (Connection conn = newConnection(tenantId)) {
+      // For debug: uncomment to see the exact results logged to console.
+      dumpCDCResults(conn, cdcName, new TreeMap<String, String>() {


Was this meant to be commented out?

Sure, removing

do we need to dumpCDCResults in CI test runs? I though you would comment that out.

It's just being used in all tests, it's fine either way, there are not many rows to print but I can also remove it

palashc · 2026-06-03T23:22:28Z

    }
-    return CDCUtil.setupScanForCDC(dataScan);
+    CDCUtil.setupScanForCDC(dataScan);
+    Map<ImmutableBytesPtr, long[]> timestampMap = buildDataRowTimestampMap(dataRowKeys);


Can we avoid building this timestamp map in every task and precompute beforehand? But maybe it is okay since number of tasks will usually be small - based on number of regions involved and number of rowkeys?

It is possible but it is far bigger refactor, worth doing as separate PR because this PR is already too big. However, dataRowKeys are already available so generating map would not be that time consuming.

PHOENIX-7878 CDC perf improvement - skip redundant cell versions on d…

690fdd5

…ata table scans

palashc approved these changes Jun 3, 2026

View reviewed changes

removing debug comments

472dc5d

tkhurana approved these changes Jun 4, 2026

View reviewed changes

few nits

11b71a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOENIX-7878 CDC perf improvement - skip redundant cell versions on data table scans#2493

PHOENIX-7878 CDC perf improvement - skip redundant cell versions on data table scans#2493
virajjasani wants to merge 3 commits into
apache:masterfrom
virajjasani:PHOENIX-7878-master

virajjasani commented Jun 3, 2026

Uh oh!

virajjasani commented Jun 3, 2026

Uh oh!

palashc left a comment

Uh oh!

palashc Jun 3, 2026

Uh oh!

virajjasani Jun 4, 2026

Uh oh!

palashc Jun 5, 2026 •

edited

Loading

Uh oh!

virajjasani Jun 5, 2026

Uh oh!

palashc Jun 3, 2026

Uh oh!

virajjasani Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

virajjasani commented Jun 3, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

virajjasani commented Jun 3, 2026

Uh oh!

palashc left a comment

Choose a reason for hiding this comment

Uh oh!

palashc Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

virajjasani Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

palashc Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

virajjasani Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

palashc Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

virajjasani Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

palashc Jun 5, 2026 •

edited

Loading