Skip to content

[python] Add row kind support for TableRead#7394

Draft
tub wants to merge 6 commits intoapache:masterfrom
tub:python-streaming-1b2-row-kind
Draft

[python] Add row kind support for TableRead#7394
tub wants to merge 6 commits intoapache:masterfrom
tub:python-streaming-1b2-row-kind

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 10, 2026

Summary

  • Add include_row_kind parameter to TableRead for streaming change tracking
  • Prepend a _row_kind string column (+I, -D, +U, -U) to Arrow batches when enabled
  • Support row kind for both RecordBatchReader (default +I) and OffsetRow-based readers (from RowKind)

Stacked PR series

This is PR 1b part 2 in the Python streaming read series:

Incremental diff (vs 1b): tub/paimon@python-streaming-1b-scanners...tub:paimon:python-streaming-1b2-row-kind

Test plan

  • flake8 passes
  • python -m pytest passes
  • Manually verify row kind column appears in streaming reads

🤖 Generated with Claude Code

tub and others added 6 commits March 10, 2026 11:04
- Add FollowUpScanner hierarchy (base, delta, changelog)
- Add IncrementalDiffScanner for diff-based streaming reads
- Add sharding support to FileScanner
- Add row kind support to TableRead for changelog streams

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the include_row_kind feature out of this PR into a separate
branch (python-streaming-1b2-row-kind) to keep the scanners PR
focused on scanners and sharding only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add include_row_kind option to TableRead that prepends a _row_kind
string column to Arrow output. For RecordBatchReader (append-only
tables) all rows default to "+I"; for RowIterator (primary-key
tables) row kind is read per-row via OffsetRow.get_row_kind().

The feature is opt-in (include_row_kind=False by default) so
existing read paths are unaffected. StreamReadBuilder in the next
PR will enable it for changelog/streaming reads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant