Skip to content

Core: Add streaming CloseableIterable accessors to SnapshotChanges#16390

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:iceberg-15659-snapshot-changes-streaming
Open

Core: Add streaming CloseableIterable accessors to SnapshotChanges#16390
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:iceberg-15659-snapshot-changes-streaming

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Closes #15659

What

SnapshotChanges previously exposed only cached accessors that eagerly materialize all file changes into in-memory lists and return Iterable. Some callers (for example replaced-partition validation, see #13556) need to stream changes without loading every file into memory. This PR adds streaming CloseableIterable accessors and re-implements the cached accessors as thin wrappers over them, exactly as suggested in #15659.

Changes

  • Add addedDataFilesIterable(), removedDataFilesIterable(), addedDeleteFilesIterable() and removedDeleteFilesIterable(), which return lazily-evaluated CloseableIterables that the caller must close and that are not cached.
  • Re-implement the cached accessors (addedDataFiles(), removedDataFiles(), addedDeleteFiles(), removedDeleteFiles()) as materialize() wrappers over the streaming methods, preserving their existing caching identity contract.
  • Replace the per-type cache/read methods with a single generic manifest-reading pipeline. Manifests are still read single-threaded by default and in parallel with a bounded queue when an executor is configured via Builder.executeWith.

Trade-off

Reading both added and removed changes through the cached accessors now performs two manifest passes instead of one. This keeps the change minimal and strictly additive; a single-pass optimization can be added in a follow-up if it proves necessary.

Testing

Adds 10 focused tests in TestSnapshotChanges (the 3 existing tests are unchanged), each guarding a distinct code path: streaming added/removed data and delete files, equivalence with the cached results, non-caching semantics, statistics retention versus stripping (copy() vs copyWithoutStats()), EXISTING-entry exclusion, snapshot-id manifest filtering, and the parallel execution path. :iceberg-core:test, spotlessCheck and revapi all pass; the change is purely additive so no revapi exception is required.

🤖 Generated with Claude Code

SnapshotChanges previously exposed only cached accessors that eagerly materialize all file changes into in-memory lists and return Iterable. Some callers (for example replaced-partition validation, see apache#13556) need to stream changes without loading every file into memory. This adds streaming CloseableIterable accessors and re-implements the cached accessors as thin wrappers over them, as suggested in apache#15659.

Changes:

- Add addedDataFilesIterable(), removedDataFilesIterable(), addedDeleteFilesIterable() and removedDeleteFilesIterable(), which return lazily-evaluated CloseableIterables that the caller must close and that are not cached.
- Re-implement the cached accessors (addedDataFiles(), removedDataFiles(), addedDeleteFiles(), removedDeleteFiles()) as materialize() wrappers over the streaming methods, preserving their existing caching identity contract.
- Replace the per-type cache/read methods with a single generic manifest-reading pipeline. Manifests are still read single-threaded by default and in parallel with a bounded queue when an executor is configured via Builder.executeWith.

Reading both added and removed changes through the cached accessors now performs two manifest passes instead of one. This keeps the change minimal and additive; a single-pass optimization can be added later if it proves necessary.

Adds comprehensive tests in TestSnapshotChanges covering streaming added and removed data and delete files, equivalence with the cached results, non-caching semantics, statistics retention versus stripping, EXISTING-entry exclusion, snapshot-id manifest filtering, and the parallel execution path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the core label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snapshot Changes should have a Streaming Interface

1 participant