Fix Iceberg REST catalog commit safety under post-commit network failure#1982
Fix Iceberg REST catalog commit safety under post-commit network failure#1982il9ue wants to merge 3 commits into
Conversation
On INSERT via a REST catalog, a commit whose response is lost was retried, got a 409, collapsed to a bool false, and triggered cleanup that deleted the files the now-live snapshot referenced. Replace the bool with a typed CommitOutcome and, on failure, re-read the catalog to classify whether our snapshot actually landed before deciding to clean up.
SequencesWhat happens on an INSERT when the commit response is lost:
How the pieces fitThe whole design hangs on one contract: the The catalog layer decides the outcome. Three implementations satisfy the contract:
The caller layer acts on the outcome. Four call sites all call the same
TestsBoth tests use a fault proxy (
|
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b837ae8819
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "preserving written files to avoid corrupting a possibly-committed snapshot. " | ||
| "Orphaned files may need manual cleanup if the commit did not actually land.", | ||
| namespace_name, table_name); | ||
| return false; |
There was a problem hiding this comment.
Abort ambiguous insert commits instead of retrying
When CommitOutcome::Unknown means the first catalog POST may already have landed but the verification read failed, returning false sends control back to IcebergStorageSink::finalizeBuffers' retry loop. At this point generateNextMetadata has already mutated metadata and the writers still hold the same data files, so the next initializeMetadata attempt can append those same files again on top of the possibly-committed snapshot instead of surfacing the ambiguous result to the user.
Useful? React with 👍 / 👎.
| "Iceberg mutation commit for {}.{} is of unknown status after a lost response; " | ||
| "preserving written files to avoid corrupting a possibly-committed snapshot.", | ||
| namespace_name, table_name); | ||
| return false; |
There was a problem hiding this comment.
Abort ambiguous mutation commits instead of retrying
For CommitOutcome::Unknown, the original mutation commit may have succeeded, but returning false makes the outer mutation loop retry from the latest catalog state. If the first commit did land, a retry can apply the same DELETE/UPDATE a second time (for example an UPDATE x = x + 1 can be applied again) rather than preserving files and failing the ambiguous operation, so this should not be treated like a clean conflict retry.
Useful? React with 👍 / 👎.
Found while working on #1609 (TRUNCATE storage-leak cleanup).
RestCatalog::updateMetadatareturned a boolean result, which folded two outcomes that need different handling into one value:Both came back as
false, so the caller ran cleanup in either case and deleted the manifest/data files the live snapshot still references. That's the same cleanup-on-failure pattern I'd otherwise be carrying into TRUNCATE.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix possible data loss in Iceberg writes via REST catalog when a commit response is lost
Documentation entry for user-facing changes
When the network drops the response to a catalog commit that actually succeeded, ClickHouse retries the commit, sees the retry fail, and runs cleanup that deletes object-storage files the catalog at that point still references, leading to possible table corruption. The fix runs cleanup only when the commit was cleanly rejected; ambiguous outcomes leave the files in place.
CI/CD Options
Exclude tests:
Regression jobs to run:
Tests
Two integration tests under
tests/integration/test_storage_iceberg_no_spark/, both asserting one invariant: no catalog-referenced snapshot may point at a manifest list missing from object storage.test_insert_commit_response_loss_is_handled: the fix, end to end. A fault proxy lets the commit POST reach the catalog (so it lands) then drops the response; ClickHouse retries and gets a 409. With the fix the INSERT completes, no snapshot loses its manifest list, and the row is readable. Without the fix this same test reproduces the corruption.test_insert_without_fault_is_clean_baseline: control. With no fault armed the same INSERT must succeed cleanly, proving the corruption above comes from the injected response loss, not from the build or the harness.Files:
test_catalog_commit_safety.py: the two tests above.catalog_fault_proxy.py: a transparent reverse proxy in front of the REST catalog. Forwards everything by default; when armed, it forwards a commit POST upstream (so the catalog commits) then RSTs the client without returning the response, reproducing the "commit succeeded, response lost" window.docker_compose_rest_proxy.yml: runs the proxy as arest-proxyservice on the cluster network and publishes its control port.conftest.py: merges that compose file into the cluster so the proxy starts with the suite. ClickHouse talks to the catalog through the proxy; pyiceberg talks to the real catalog directly, so verification sees true committed state.