feat(lance): Implement canWrite() in HoodieSparkLanceWriter with configurable max file size for Lance#18341
Open
wombatu-kun wants to merge 2 commits intoapache:masterfrom
Open
feat(lance): Implement canWrite() in HoodieSparkLanceWriter with configurable max file size for Lance#18341wombatu-kun wants to merge 2 commits intoapache:masterfrom
wombatu-kun wants to merge 2 commits intoapache:masterfrom
Conversation
Member
|
Ack, will review tomorrow! |
Collaborator
|
Thanks @wombatu-kun for the help! @voonhous can you review this if you get a chance, since gonna be ooto. Once back will review |
be23647 to
57c518e
Compare
…igurable max file size for Lance
rahil-c
reviewed
Mar 20, 2026
| HoodieStorage storage, | ||
| boolean populateMetaFields, | ||
| Option<BloomFilter> bloomFilterOpt) { | ||
| this(file, sparkSchema, instantTime, taskContextSupplier, storage, populateMetaFields, bloomFilterOpt, Long.MAX_VALUE); |
Collaborator
There was a problem hiding this comment.
Shouldn't we have some reasonable default here for a maxFileSize rather than Long.MAX Value?
Contributor
Author
There was a problem hiding this comment.
ok, use LANCE_MAX_FILE_SIZE.defaultValue() instead of Long.MAX_VALUE
Collaborator
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18341 +/- ##
============================================
+ Coverage 68.41% 68.46% +0.04%
- Complexity 27408 27485 +77
============================================
Files 2423 2427 +4
Lines 132458 132684 +226
Branches 15972 15995 +23
============================================
+ Hits 90623 90840 +217
- Misses 34784 34792 +8
- Partials 7051 7052 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Collaborator
|
@voonhous can you also take a pass? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
Closes #17684
Summary and Changelog
Implement canWrite() in HoodieSparkLanceWriter analogously to HoodieBaseParquetWriter.canWrite() by tracking cumulative Arrow buffer sizes in the base class and adding periodic size-limit checks in the Spark writer.
HoodieStorageConfig: AddedLANCE_MAX_FILE_SIZEconfig property (keyhoodie.lance.max.file.size, default 120 MB) and alanceMaxFileSize(long)builder method, consistent with the existing Parquet/ORC/HFile config entries.HoodieBaseLanceWriter: AddedtotalFlushedDataSizefield, andgetDataSize()accessor. InflushBatch(), afterarrowWriter.finishBatch()sets the row count, the method now iterates overroot.getFieldVectors()and accumulatesvector.getBufferSize()intototalFlushedDataSizebefore writing to Lance. This provides an uncompressed Arrow buffer size estimate analogous toParquetWriter.getDataSize().HoodieSparkLanceWriter:MIN_RECORDS_FOR_SIZE_CHECK= 100 andMAX_RECORDS_FOR_SIZE_CHECK= 10000 constants (mirrors the Parquet constants).maxFileSizeandrecordCountForNextSizeCheckfields.maxFileSize; the no-arg secondary constructor now delegates withLong.MAX_VALUE(no limit); a new secondary constructor accepting explicitmaxFileSizeis added for use byHoodieInternalRowFileWriterFactory.canWrite()implementation: checks periodically based onrecordCountForNextSizeCheck, computes average record size fromgetDataSize()/writtenCount, returns false when within two average records ofmaxFileSize, and adaptively schedules the next check.HoodieSparkFileWriterFactory: ReadsLANCE_MAX_FILE_SIZEfrom config and passes it to theHoodieSparkLanceWriterconstructor.HoodieInternalRowFileWriterFactory: methodgetInternalRowFileWriterreadsLANCE_MAX_FILE_SIZEand passes it (throughnewLanceInternalRowFileWriter) to the newHoodieSparkLanceWriterconstructor.Impact
track a proper implementation that checks to see if the file has reached some threshold in size and if so should roll over the write to a new file
Risk Level
none
Documentation Update
Need to add
LANCE_MAX_FILE_SIZEconfig property (hoodie.lance.max.file.size, default 120 MB)Contributor's checklist