Skip to content

feat(parquet): add metrics for parquet reader observability#258

Open
duanyyyyyyy wants to merge 4 commits intoalibaba:mainfrom
duanyyyyyyy:duanyan/parquet_metrics
Open

feat(parquet): add metrics for parquet reader observability#258
duanyyyyyyy wants to merge 4 commits intoalibaba:mainfrom
duanyyyyyyy:duanyan/parquet_metrics

Conversation

@duanyyyyyyy
Copy link
Copy Markdown
Contributor

Add row groups, rows, batch count, and latency metrics to ParquetFileBatchReader, matching the observability level of the ORC reader.

Purpose

Linked issue: close #xxx

Tests

API and Format

Documentation

Generative AI tooling

Opus 4.6

duanyyyyyyy and others added 4 commits April 29, 2026 11:25
Add row groups, rows, batch count, and latency metrics to ParquetFileBatchReader,
matching the observability level of the ORC reader.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move parquet read metric constants into the existing ParquetMetrics
class in parquet_format_defs.h, rather than defining a duplicate class
in a new header. Fixes build error introduced in d5a8bfa.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
static inline const char READ_ROWS[] = "parquet.read.rows";
static inline const char READ_BATCH_COUNT[] = "parquet.read.batch.count";
static inline const char READ_NEXT_BATCH_LATENCY_MS[] = "parquet.read.next_batch.latency.ms";
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: for high-level categories, consider using . as the separator to represent hierarchy (e.g., parquet.read), while using - to separate words within the same level (e.g., batch-count).

read_next_batch_latency_ms_ += timer.Get();
metrics_->SetCounter(ParquetMetrics::READ_ROWS, read_rows_);
metrics_->SetCounter(ParquetMetrics::READ_BATCH_COUNT, read_batch_count_);
metrics_->SetCounter(ParquetMetrics::READ_NEXT_BATCH_LATENCY_MS, read_next_batch_latency_ms_);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a counter for READ_NEXT_BATCH_LATENCY_MS seems inappropriate. Since the prefetch reader aggregates metrics across all sub-readers, the resulting value is not very meaningful for a latency metric. If the goal is to capture end-to-end latency, I think it would be more appropriate to record it as a histogram at the framework level, rather than aggregating sub-reader metrics.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug seems also exist in the orc format, we will fix this in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants