Background
When running ExpireSnapshots, iceberg-cpp may need to clean up files that are no longer referenced by expired snapshots. These files can include:
- data files
- delete files
- manifest files
- manifest list files
- statistics files
Currently, file deletion in iceberg-cpp is primarily based on single-file deletion through:
When a large number of files need to be deleted, deleting them one by one can be inefficient, especially for object stores or remote filesystems where each delete request may involve non-trivial network latency.
Java Iceberg already has a similar abstraction:
SupportsBulkOperations#deleteFiles
This allows cleanup logic to use bulk deletion when supported by the underlying FileIO, and fall back to regular per-file deletion otherwise.
iceberg-cpp should consider adding a similar mechanism.
Current Problem
iceberg-cpp does not currently have a unified bulk deletion entry point.
FileIO currently exposes single-file deletion:
virtual Status DeleteFile(const std::string& file_location);
As a result:
- Deleting many files can be slow.
ExpireSnapshots cannot take advantage of storage-native bulk deletion.
- FileIO implementations do not have a common extension point for optimized deletion.
- There is no clear API layer for adding parallel deletion fallback in the future.
Proposed Approach
This can be implemented incrementally.
Step 1: Add a bulk delete API to FileIO
Add a new bulk deletion entry point, for example:
virtual Status DeleteFiles(std::span<const std::string> file_locations);
The initial implementation can provide a backward-compatible default fallback:
virtual Status DeleteFiles(std::span<const std::string> file_locations) {
for (const auto& file_location : file_locations) {
auto status = DeleteFile(file_location);
if (!status.has_value()) {
return status;
}
}
return {};
}
The goal of this step is to:
- add a unified API
- preserve backward compatibility
- avoid requiring every FileIO implementation to immediately support native bulk deletion
- prepare for future optimizations
This step should only add the API and sequential fallback. It should not introduce parallel deletion yet, and it does not need to modify ExpireSnapshots.
Step 2: Use FileIO::DeleteFiles in ExpireSnapshots
Step 3: Add optimized deletion implementations
Background
When running
ExpireSnapshots, iceberg-cpp may need to clean up files that are no longer referenced by expired snapshots. These files can include:Currently, file deletion in iceberg-cpp is primarily based on single-file deletion through:
When a large number of files need to be deleted, deleting them one by one can be inefficient, especially for object stores or remote filesystems where each delete request may involve non-trivial network latency.
Java Iceberg already has a similar abstraction:
This allows cleanup logic to use bulk deletion when supported by the underlying FileIO, and fall back to regular per-file deletion otherwise.
iceberg-cpp should consider adding a similar mechanism.
Current Problem
iceberg-cpp does not currently have a unified bulk deletion entry point.
FileIOcurrently exposes single-file deletion:As a result:
ExpireSnapshotscannot take advantage of storage-native bulk deletion.Proposed Approach
This can be implemented incrementally.
Step 1: Add a bulk delete API to FileIO
Add a new bulk deletion entry point, for example:
The initial implementation can provide a backward-compatible default fallback:
The goal of this step is to:
This step should only add the API and sequential fallback. It should not introduce parallel deletion yet, and it does not need to modify
ExpireSnapshots.Step 2: Use FileIO::DeleteFiles in ExpireSnapshots
Step 3: Add optimized deletion implementations