Skip to content

[phase-31 3/4] Writer + pipeline wiring#6244

Merged
g-talbot merged 26 commits intomainfrom
gtt/phase-31-writer-wiring
Apr 8, 2026
Merged

[phase-31 3/4] Writer + pipeline wiring#6244
g-talbot merged 26 commits intomainfrom
gtt/phase-31-writer-wiring

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

Wire TableConfig into ParquetWriter sort path and add self-describing Parquet file metadata for compaction (Phase 31 Metadata Foundation, PR 3 of 4).

Stacks on gtt/phase-31-compaction-metadata (PR #6243).

What's included

storage/writer.rs (rewritten):

  • ParquetWriter::new() takes &TableConfig, resolves sort field names to physical columns
  • sort_batch() uses resolved fields with per-column ASC/DESC direction
  • SS-1 debug_assert verification: re-sort output and check identity permutation
  • build_compaction_key_value_metadata(): embeds sort_fields, window_start, window_duration, num_merge_ops, row_keys (base64+JSON) in Parquet kv_metadata
  • SS-5 verify_ss5_kv_consistency(): kv entries must match source struct
  • write_to_file_with_metadata() replaces write_to_file()
  • prepare_write() shared prep for both bytes and file write paths
  • resolve_sort_fields(): parse sort schema, map to ParquetField, skip missing columns

storage/config.rs:

  • to_writer_properties_with_metadata(sorting_cols, kv_metadata) accepts dynamic sort columns and optional KV metadata
  • to_writer_properties() delegates with empty defaults
  • Removed static sorting_columns() method (now in writer)

storage/split_writer.rs:

  • ParquetSplitWriter::new() takes &TableConfig parameter

quickwit-indexing (5 files):

  • All ParquetSplitWriter::new() callers updated with &TableConfig::default()

Verification

  • cargo build -p quickwit-parquet-engine -p quickwit-indexing
  • cargo test -p quickwit-parquet-engine -- storage:: ✅ (23 tests)
  • cargo clippy -p quickwit-parquet-engine --all-features --tests

Test plan

  • Writer sorts by TableConfig-driven fields (test_write_sorts_data)
  • Compaction KV metadata embedded and read back (test_write_to_file_with_compaction_metadata)
  • No qh.* keys when metadata=None (test_write_to_file_without_metadata_has_no_qh_keys)
  • Pre-Phase-31 splits produce empty KV vec
  • RowKeys base64+proto roundtrip
  • META-07 self-describing Parquet roundtrip (cold file reconstruction)
  • Clippy clean

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from c012908 to 780c585 Compare March 31, 2026 20:41
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch 2 times, most recently from 3bbfb71 to 95c3596 Compare March 31, 2026 20:55
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch 2 times, most recently from 08577b5 to 955f230 Compare March 31, 2026 21:03
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch 2 times, most recently from 179ccd2 to ed6d687 Compare March 31, 2026 21:08
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from 955f230 to 3e73d80 Compare March 31, 2026 21:08
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from ed6d687 to a4d0d36 Compare March 31, 2026 21:26
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from 3e73d80 to 2f78fe8 Compare March 31, 2026 21:26
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from a4d0d36 to f05d4e7 Compare March 31, 2026 21:31
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from 2f78fe8 to 2703ca5 Compare March 31, 2026 21:31
Base automatically changed from gtt/phase-31-compaction-metadata to gtt/phase-31-sort-schema March 31, 2026 21:31
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from f05d4e7 to 4a0507e Compare March 31, 2026 21:33
@g-talbot g-talbot force-pushed the gtt/phase-31-sort-schema branch from 2703ca5 to 8fce718 Compare March 31, 2026 21:33
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from 4a0507e to bc9458d Compare March 31, 2026 21:40
@g-talbot g-talbot force-pushed the gtt/phase-31-sort-schema branch from 8fce718 to 018a265 Compare March 31, 2026 21:40
…, window, TableConfig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from 0d97e82 to c48b0de Compare April 1, 2026 16:59
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from a8dc7a2 to 4ce16a3 Compare April 1, 2026 16:59
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from c48b0de to ef0ba36 Compare April 1, 2026 19:24
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from 4ce16a3 to b89e965 Compare April 1, 2026 19:24
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from b89e965 to b9566a6 Compare April 1, 2026 20:18
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from ef0ba36 to df6e699 Compare April 1, 2026 20:18
g-talbot and others added 2 commits April 1, 2026 16:24
… model, field lookup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire TableConfig-driven sort order into ParquetWriter and add
self-describing Parquet file metadata for compaction:

- ParquetWriter::new() takes &TableConfig, resolves sort fields at
  construction via parse_sort_fields() + ParquetField::from_name()
- sort_batch() uses resolved fields with per-column direction (ASC/DESC)
- SS-1 debug_assert verification: re-sort and check identity permutation
- build_compaction_key_value_metadata(): embeds sort_fields, window_start,
  window_duration, num_merge_ops, row_keys (base64) in Parquet kv_metadata
- SS-5 verify_ss5_kv_consistency(): kv_metadata matches source struct
- write_to_file_with_metadata() replaces write_to_file()
- prepare_write() shared method for bytes and file paths
- ParquetWriterConfig gains to_writer_properties_with_metadata()
- ParquetSplitWriter passes TableConfig through
- All callers in quickwit-indexing updated with TableConfig::default()
- 23 storage tests pass including META-07 self-describing roundtrip

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/phase-31-compaction-metadata branch from b9566a6 to b6eb595 Compare April 1, 2026 20:50
@g-talbot g-talbot force-pushed the gtt/phase-31-writer-wiring branch from df6e699 to 76b703a Compare April 1, 2026 20:50
@mattmkim
Copy link
Copy Markdown
Contributor

mattmkim commented Apr 6, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76b703ad24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

g-talbot and others added 11 commits April 6, 2026 21:11
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
Resolve merge conflicts by taking main's versions of otel_metrics.rs
and arrow_metrics.rs (the PR didn't modify these files — conflicts
came from the base branch divergence). Kept PR's table_config module
export in quickwit-parquet-engine/src/lib.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing splits were serialized before the parquet_file field was
added, so their JSON doesn't contain it. Adding #[serde(default)]
makes deserialization fall back to empty string for old splits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the commit timeout fires and the accumulator contains only
zero-column batches, union_fields is empty and concat_batches fails
with "must either specify a row count or at least one column".
Now flush_internal treats empty union_fields the same as empty
pending_batches — resets state and returns None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base automatically changed from gtt/phase-31-compaction-metadata to main April 8, 2026 13:37
- Resolve Cargo.lock/Cargo.toml merge conflicts
- P1 (sort column lookup): Already addressed by sort fields tag_ prefix
  fix — sort field names now match Parquet column names
- P2 (window_start at epoch 0): Remove time_range.start_secs > 0 guard
  so window_start is computed for all batches when window_duration > 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot merged commit 7645703 into main Apr 8, 2026
8 checks passed
@g-talbot g-talbot deleted the gtt/phase-31-writer-wiring branch April 8, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants