Optimization: Prune manifest in snapshot overwrite operations#3011
Optimization: Prune manifest in snapshot overwrite operations#3011Fokko merged 3 commits intoapache:mainfrom
Conversation
e3aaa5e to
190bf4d
Compare
190bf4d to
7525c9c
Compare
geruh
left a comment
There was a problem hiding this comment.
Thanks for raising this @gabeiglio! I went through the core behavior here and tested out some of the pruning logic. but ultimately, the existing integration tests like test_overwrite_partitioned_table, and test_delete_overwrite would catch regressions. Just a few inline comments from my side.
| write_manifest_list, | ||
| ) | ||
| from pyiceberg.partitioning import ( | ||
| PartitionSpec, |
There was a problem hiding this comment.
This moved a bunch of the imports to multi line for some reason
There was a problem hiding this comment.
its the linter, so every commit the pre-commit hook will flatten it again
There was a problem hiding this comment.
yeah looks cosmetic since it's linting alr committed code
Fokko
left a comment
There was a problem hiding this comment.
This makes sense to me @gabeiglio Thanks for fixing this 👍
|
Let's goo! Thanks @gabeiglio for working on this, this is super valuable. Thanks @geruh and @yingjianwu98 for the review 🚀 |
Rationale for this change
Doing some performance tests for overwriting partitions, we noticed that PyIceberg took double the time it usually takes java based implementation, we noticed that
_exisiting_manifestsdoes not take advantage of manifest pruning before reading all Manifest EntriesIn this PR I:
Are these changes tested?
I believe current tests in tests/integration/test_writes.py cover all cases
Are there any user-facing changes?
Nope