Skip to content

[python] Add StreamReadBuilder and AsyncStreamingTableScan#7350

Draft
tub wants to merge 15 commits intoapache:masterfrom
tub:python-streaming-2-core-v2
Draft

[python] Add StreamReadBuilder and AsyncStreamingTableScan#7350
tub wants to merge 15 commits intoapache:masterfrom
tub:python-streaming-2-core-v2

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 5, 2026

Summary

  • Add StreamReadBuilder for configuring streaming reads with consumer registration and sharding
  • Add AsyncStreamingTableScan for async iteration over incremental table changes
  • Add streaming read support to FileStoreTable, Table, FormatTable, IcebergTable
  • Add acceptance test for incremental diff reads
  • Add streaming documentation to python-api.md
  • Add oss/lance extras to setup.py

Stacked PR series

This is PR 2/5 in the Python streaming read series:

  • PR 1a: Caching infrastructure + utilities
  • PR 1b: Scanners, sharding, row kind
  • PR 1c: Consumer management
  • PR 2 (this): Core streaming (~2717 lines)
  • PR 3: CLI (paimon tail)

Incremental diff (vs 1c): python-streaming-1c-consumer...tub:paimon:python-streaming-2-core-v2

Test plan

  • flake8 passes on all changed files
  • python -m pytest passes
  • New tests: streaming_table_scan_test.py, stream_read_builder_sharding_test.py, incremental_diff_acceptance_test.py

tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub and others added 15 commits March 10, 2026 11:04
- Add FollowUpScanner hierarchy (base, delta, changelog)
- Add IncrementalDiffScanner for diff-based streaming reads
- Add sharding support to FileScanner
- Add row kind support to TableRead for changelog streams

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the include_row_kind feature out of this PR into a separate
branch (python-streaming-1b2-row-kind) to keep the scanners PR
focused on scanners and sharding only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add include_row_kind option to TableRead that prepends a _row_kind
string column to Arrow output. For RecordBatchReader (append-only
tables) all rows default to "+I"; for RowIterator (primary-key
tables) row kind is read per-row via OffsetRow.get_row_kind().

The feature is opt-in (include_row_kind=False by default) so
existing read paths are unaffected. StreamReadBuilder in the next
PR will enable it for changelog/streaming reads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Consumer dataclass for tracking consumption progress
- Add ConsumerManager for persisting/loading/expiring consumers
- Add unit tests for consumer operations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add StreamReadBuilder for configuring streaming reads
- Add AsyncStreamingTableScan with consumer registration and sharding
- Add streaming support to FileStoreTable, Table, FormatTable, IcebergTable
- Add acceptance test for incremental diff reads
- Add streaming documentation to python-api.md
- Add oss/lance extras to setup.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tEquals

Extract shared base class for ManifestFileCacheTest and ManifestListCacheTest,
add _make_snapshot() helper, and fix deprecated assertEquals (removed in 3.12).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cachetools 7.x requires Python >=3.10 but the project supports 3.6+.
Drop info=True and explicit key= from @cachedmethod (both 7.x-only
features) while keeping the decorator itself (available since 4.x).

Replace cache_info()-based test assertions with unittest.mock spies on
file_io.new_input_stream, testing the actual caching effect without any
production code counters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sistency

Align consumer module with the one-liner docstring style used across the rest of the streaming PR stack. Replace os.path.join with f-string path construction for consistency with paimon-python conventions. Add tests for _validate_consumer_id rejection cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ming

These methods were removed from the 1a-caching branch per review but are
needed here where streaming_table_scan.py actually uses them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert manifest_file_manager.py to upstream/master (caching split no longer needed)
- Restore original docstring for get_snapshot_by_id
- Revert assertEquals -> assertEqual drive-by change

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-2-core-v2 branch from a1d9372 to dfbdb01 Compare March 10, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant