Skip to content

Support bulk and parallel file deletion for ExpireSnapshots #658

@slfan1989

Description

@slfan1989

Background

When running ExpireSnapshots, iceberg-cpp may need to clean up files that are no longer referenced by expired snapshots. These files can include:

  • data files
  • delete files
  • manifest files
  • manifest list files
  • statistics files

Currently, file deletion in iceberg-cpp is primarily based on single-file deletion through:

FileIO::DeleteFile(...)

When a large number of files need to be deleted, deleting them one by one can be inefficient, especially for object stores or remote filesystems where each delete request may involve non-trivial network latency.

Java Iceberg already has a similar abstraction:

SupportsBulkOperations#deleteFiles

This allows cleanup logic to use bulk deletion when supported by the underlying FileIO, and fall back to regular per-file deletion otherwise.

iceberg-cpp should consider adding a similar mechanism.

Current Problem

iceberg-cpp does not currently have a unified bulk deletion entry point.

FileIO currently exposes single-file deletion:

virtual Status DeleteFile(const std::string& file_location);

As a result:

  1. Deleting many files can be slow.
  2. ExpireSnapshots cannot take advantage of storage-native bulk deletion.
  3. FileIO implementations do not have a common extension point for optimized deletion.
  4. There is no clear API layer for adding parallel deletion fallback in the future.

Proposed Approach

This can be implemented incrementally.

Step 1: Add a bulk delete API to FileIO

Add a new bulk deletion entry point, for example:

virtual Status DeleteFiles(std::span<const std::string> file_locations);

The initial implementation can provide a backward-compatible default fallback:

virtual Status DeleteFiles(std::span<const std::string> file_locations) {
  for (const auto& file_location : file_locations) {
    auto status = DeleteFile(file_location);
    if (!status.has_value()) {
      return status;
    }
  }
  return {};
}

The goal of this step is to:

  • add a unified API
  • preserve backward compatibility
  • avoid requiring every FileIO implementation to immediately support native bulk deletion
  • prepare for future optimizations

This step should only add the API and sequential fallback. It should not introduce parallel deletion yet, and it does not need to modify ExpireSnapshots.

Step 2: Use FileIO::DeleteFiles in ExpireSnapshots

Step 3: Add optimized deletion implementations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions