chore: native_datafusion to report scan task input metrics#3842
chore: native_datafusion to report scan task input metrics#3842comphead wants to merge 12 commits intoapache:mainfrom
native_datafusion to report scan task input metrics#3842Conversation
mbutrovich
left a comment
There was a problem hiding this comment.
Not sure this solution is robust enough.
| */ | ||
| def findMetric(name: String): Option[SQLMetric] = { | ||
| metrics.get(name).orElse { | ||
| children.iterator.map(_.findMetric(name)).collectFirst { case Some(m) => m } |
There was a problem hiding this comment.
Doesn't this just return the first match it finds with the metric name? Can't multiple plans have nodes that have "output_rows"?
There was a problem hiding this comment.
mm, what if we try to restrict output_rows to scan nodes?
| val outputRowsMetric = nativeMetrics.findMetric("output_rows") | ||
| if (bytesScannedMetric.isDefined || outputRowsMetric.isDefined) { | ||
| val inputMetrics = ctx.taskMetrics().inputMetrics | ||
| bytesScannedMetric.foreach(m => inputMetrics.setBytesRead(m.value)) |
There was a problem hiding this comment.
foreach already handles the None case for finding the metric, so I find wrapping this in if unnecessary. You save ctx.taskMetrics().inputMetrics but the result is oddly-structured conditional logic.
| subqueries.foreach(sub => CometScalarSubquery.removeSubquery(it.id, sub)) | ||
|
|
||
| nativeMetrics.metrics | ||
| .get("bytes_scanned") |
There was a problem hiding this comment.
Before ac6b869 the logic was looking into the children (and collected just the first one).
Now it looks only in the root node.
Shouldn't it look into all Scan nodes ? Or maybe the logic should be moved to CometNativeScanExec#doExecuteColumnar() ?!
There was a problem hiding this comment.
this part where @mbutrovich was concerned as output_rows is too wide used name for other plan nodes and can be confused, so currently I'm using the closest node to find metrics
| spark.sparkContext.listenerBus.waitUntilEmpty() | ||
|
|
||
| withSQLConf(confs: _*) { | ||
| sql("SELECT * FROM tbl").collect() |
There was a problem hiding this comment.
| sql("SELECT * FROM tbl").collect() | |
| sql("SELECT * FROM tbl WHERE _1 > 5000").collect() |
add a filter to make it more realistic
There was a problem hiding this comment.
Thanks @martin-g why the filter would be needed? I'd prefer to keep repro as simple as possible
There was a problem hiding this comment.
A filter would show the discrepancy/incorrect values when scan isn't the first child node.
| val (cometBytes, cometRecords) = collectInputMetrics( | ||
| CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION) |
There was a problem hiding this comment.
| val (cometBytes, cometRecords) = collectInputMetrics( | |
| CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION) | |
| val (cometBytes, cometRecords) = collectInputMetrics( | |
| CometConf.COMET_ENABLED.key -> "true", | |
| CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION) |
There was a problem hiding this comment.
CometConf.COMET_ENABLED.key -> "true", is enabled on test level by default, but I think we might ensure Comet operators was applied
mbutrovich
left a comment
There was a problem hiding this comment.
I'm still not sure this works, and a more elaborate test would confirm. The metric propagation is in CometExecRDD.compute(), which runs for all Comet operators, not just native scan. bytes_scanned is safe because it only exists in nativeScanMetrics. But output_rows exists in baselineMetrics too, so for any CometExec plan (e.g., Filter -> Scan), recordsRead gets set to the post-filter count rather than actual records read from storage.
A test with a WHERE clause would expose this since output_rows and actual scan count would diverge.
|
Right! You can make it an anonymous class instance like: CometExecRDD(...) {
override def compute(split: Partition, context: TaskContext): Iterator[ColumnarBatch] =
{
val res = super.compute(split, context);
// new logic here
res
}
} |
this actually a neat way to isolate scan metrics collection to scan only |
|
The code now returns correctly input metrics(rows/bytes) for scan native node(checked this visually), however the test cannot capture such scenario correctly, working if the test can be improved |
|
@mbutrovich @martin-g PTAL the |
| encryptedFilePaths: Seq[String] = Seq.empty, | ||
| shuffleScanIndices: Set[Int] = Set.empty) | ||
| shuffleScanIndices: Set[Int] = Set.empty, | ||
| hasNativeScan: Boolean = false) |
There was a problem hiding this comment.
I see it being passed in, but is hasNativeScan ever used?
There was a problem hiding this comment.
good catch, it is leftover after experimenting
| } | ||
| } | ||
|
|
||
| // Called via JNI from `comet_metric_node.rs` |
There was a problem hiding this comment.
Is that the only place this will ever be called from? Otherwise I'm not sure the comment is necessary.
There was a problem hiding this comment.
IDE highlights the method as unused because it is called via JNI only, can be accidentally cleaned up. Added comments to clarify
| withTempPath { dir => | ||
| val rng = new scala.util.Random(42) | ||
| spark | ||
| .createDataFrame((0 until totalRows).map(_ => (rng.nextInt(), rng.nextLong()))) |
There was a problem hiding this comment.
Can we use a smaller range for the random values, or just a shuffle from 0 to totalRows? That way we'd know exactly the amount of data we should get back. Right now it's likely selecting every row.
mbutrovich
left a comment
There was a problem hiding this comment.
I'm still not convinced of the generality of this approach. CometNativeScanExec is an input source in foreachUntilCometInput (line 598 of operators.scala), so CometNativeExec.doExecuteColumnar() always calls executeColumnar() on it (line 516) and the scan always gets its own CometExecRDD. This means @martin-g's anonymous subclass approach works and might be the cleaner solution since it scopes the metric propagation to scan RDDs without needing the leafNode traversal. leafNode is fragile for branching metric trees (e.g., unions) since it always follows children.head.
|
i tried options with overriding compute and So I had to apply the logic to
|


Which issue does this PR close?
Closes #3735
Prerequisites for #3817
Rationale for this change
Problem
When using Comet's native_datafusion scan (CometNativeScanExec), Spark's task-level input metrics (bytesRead, recordsRead) are always zero. These metrics feed the "Input" column in the Spark UI Stages tab and are aggregated by AppStatusListener for job-level reporting.
Standard Spark reports input metrics in FileScanRDD.compute() by reading Hadoop FileSystem thread-local statistics via SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(). Since the native DataFusion scan reads Parquet files entirely in Rust, it never touches Hadoop's Java I/O layer, so those
thread-local counters are never incremented.
What Comet already tracks
The native side already tracks the relevant data:
These flow back to the JVM via CometMetricNode.set_all_from_bytes() and appear as SQL-level metrics in the Spark UI operator details. However, they were never propagated to the task-level TaskMetrics.inputMetrics.
Solution
In the existing TaskCompletionListener inside CometExecRDD.compute(), after closing the iterator, read the final values of bytes_scanned and output_rows from the CometMetricNode tree and set them on TaskContext.taskMetrics().inputMetrics. This adds zero per-batch overhead -- metrics are written once at
task completion.
A findMetric helper on CometMetricNode performs a depth-first search through the metric tree, so it works whether the scan is standalone (CometNativeScanExec creates the RDD directly) or wrapped inside a larger native plan (CometNativeExec with Filter/Project above the scan).
Changes
What changes are included in this PR?
How are these changes tested?