[SPARK-55631][SQL] ALTER TABLE must invalidate cache for DSv2 tables#54427
[SPARK-55631][SQL] ALTER TABLE must invalidate cache for DSv2 tables#54427aokolnychyi wants to merge 1 commit intoapache:masterfrom
Conversation
| table.catalog, | ||
| table.identifier, | ||
| a.changes, | ||
| recacheTable(table, includeTimeTravel = false)) :: Nil |
There was a problem hiding this comment.
We may debate that only some table changes should trigger a refresh but it is very fragile and dangerous. For instance, Iceberg still allows revert of table state by setting a predefined table property (not something I encourage). This must invalidate the cache in Spark too. To sum up, it is the safest call to recache but I can be convinced otherwise.
There was a problem hiding this comment.
Shall we add a SQL conf and migration guide for the new behavior? There can be users who wants the old behavior.
There was a problem hiding this comment.
The problem is that the old behavior breaks correctness. What if someone modified a comment or a property and does SHOW TABLE and doesn't see the changes because the cache wasn't invalidated? Or worse. The schema is not updated. So you evolve the schema but it is not reflected?
There was a problem hiding this comment.
For behavior changes, we always add a flag spark.sql.legacy.ABC. See https://spark.apache.org/docs/latest/sql-migration-guide.html for details. Most of the legacy behaviors doesn't make sense.
There was a problem hiding this comment.
Note: this is a general suggestion. Since the previous DSV2 cache behavior are not gated with a config, I am also fine to merge without that.
szehon-ho
left a comment
There was a problem hiding this comment.
it makes sense / safer to invalidate to me. I guess it is for V2 (pr title could reflect it)
|
|
||
| val result = sql(s"SELECT * FROM $t ORDER BY id") | ||
| assertCached(result) | ||
| checkAnswer(result, Seq(Row(1, "a", null), Row(2, "b", null))) |
There was a problem hiding this comment.
what was the error message before this PR?
What changes were proposed in this pull request?
This PR makes ALTER TABLE command invalidate cache.
Why are the changes needed?
These changes are needed to reflect changes made in the session correctly.
Does this PR introduce any user-facing change?
ALTER TABLE commands will now invalidate cache.
How was this patch tested?
This PR comes with test that would previously fail.
Was this patch authored or co-authored using generative AI tooling?
No.