[AURON #2229] bugfix Silent fallback to empty partitions may cause data loss#2230
[AURON #2229] bugfix Silent fallback to empty partitions may cause data loss#2230guixiaowen wants to merge 5 commits into
Conversation
| s"Failed to obtain input partitions via reflection for ${exec.getClass.getName}.", | ||
| t) | ||
| Seq.empty | ||
| throw new IllegalStateException( |
There was a problem hiding this comment.
Since we are throwing an exception anyways, do we need to catch it in L189?
| try { | ||
| inputPartitions(exec) | ||
| } catch { | ||
| case e: IllegalStateException => |
There was a problem hiding this comment.
is it possible to catch any unexpected exceptions (like fs or metadata errors)?
weiqingy
left a comment
There was a problem hiding this comment.
Thanks for picking this up. Two questions on the failure path inline.
| try { | ||
| inputPartitions(exec) | ||
| } catch { | ||
| case e: Throwable => | ||
| logWarning(s"Get Partition error: ${e.getMessage}") | ||
| return None | ||
| } |
There was a problem hiding this comment.
This catch converts the IllegalStateException thrown at L200 into return None, so on a planning failure the effective behavior is a fallback to Spark's scan path (plus two WARN logs) rather than the "fail fast with a clear exception" the description mentions — no exception reaches the caller. Falling back to Spark is arguably the safer outcome, since the query still returns correct results, so this may well be what you want — worth confirming it's intentional rather than a hard failure.
If Spark-fallback is the intent, the throw-at-L200-then-catch-here round-trip is using an exception purely for control flow. Would signaling failure through the return type read more directly — e.g. inputPartitions returning Option[Seq[InputPartition]], with None = planning failed and Some(empty) = genuinely empty table, and plan() matching on it? That also answers the question on the other thread about whether this catch is removable: it isn't — dropping it would route a failure back into the empty-partition branch and reintroduce the empty-plan case.
Minor: this catches Throwable, while the sibling catch at L196 uses NonFatal. Matching NonFatal here would avoid folding fatal errors like OutOfMemoryError into a silent Spark fallback.
| logWarning( | ||
| s"Failed to obtain input partitions via reflection for ${exec.getClass.getName}.", | ||
| t) |
There was a problem hiding this comment.
This catch now throws on a reflective failure, which is what routes plan() to the Spark fallback. The V2 batch path earlier in this method (the exec.scan.toBatch / planInputPartitions() block, ~L158-165) still catches Throwable and returns Seq.empty, then falls through to reflection. So if V2 planning fails for a real reason — an fs or metadata error — that failure is masked, and if reflection then yields empty, plan() builds a native scan with empty partitions: the same silent-empty outcome #2229 is about, one level up. Is the V2 swallow-to-empty intentional, or should a V2 planning failure surface the same way this one now does?
… #2229
Which issue does this PR close?
Closes #2229
Rationale for this change
The current implementation silently falls back to Seq.empty when both DataSource V2 and reflective partition retrieval fail. This may lead to empty input partitions without any failure signal, potentially causing silent data loss or empty query results.
Problem Description
In inputPartitions(exec: BatchScanExec), when partition planning fails:
V2 API failure → caught and ignored
Reflection failure → caught and ignored
Final fallback → Seq.empty
catch { case t: Throwable => logWarning(...) Seq.empty }and
catch { case t: Throwable => logWarning(...) Seq.empty }Risk
Returning an empty partition list is dangerous because:
Silent data loss
Downstream execution may interpret empty partitions as:
no data available
valid empty scan result
→ leads to incorrect empty query results without failure
Hard to detect in production
Only warning logs are emitted
No exception or job failure
Easy to miss in large pipelines
A scan failure (cannot plan partitions) is semantically different from:
dataset is actually empty
But both are currently indistinguishable.
What changes are included in this PR?
Expected Behavior
When both:
V2 planInputPartitions
reflection-based partition retrieval
fail, the system should:
Fail fast with a clear exception:
throw new IllegalStateException( s"Failed to plan input partitions for ${exec.getClass.getName}")
Are there any user-facing changes?
Nothing.
How was this patch tested?
Depends on existing unit tests.