fix(#593): write integrity for obj_store_lib — multipart retry + s3dlio 0.9.104#40
Closed
russfellows wants to merge 1 commit into
Closed
fix(#593): write integrity for obj_store_lib — multipart retry + s3dlio 0.9.104#40russfellows wants to merge 1 commit into
russfellows wants to merge 1 commit into
Conversation
Addresses silent write-corruption reported in mlcommons/storage#593. Data written during datagen or checkpointing could be silently truncated; the write appeared to succeed but the stored object was shorter than expected. This commit wires in the two-layer retry/verification architecture introduced in s3dlio 0.9.104. ## obj_store_lib.py — multipart upload retry (Python layer) ObjStoreLibStorage._mpu_upload_with_retry() is the Python-side retry wrapper for MultipartUploadWriter (used for objects at or above S3DLIO_MULTIPART_THRESHOLD_MB, default 16 MiB): - On RuntimeError from writer.write() or writer.close(), calls writer.abort() to free the in-progress upload slot on the server, then sleeps S3DLIO_MPU_RETRY_DELAY_S seconds and retries with a fresh MultipartUploadWriter. - After S3DLIO_MPU_MAX_RETRIES total attempts, raises RuntimeError chained to the original exception so the root cause is not lost. - Logs a WARNING on each retry and ERROR on final failure. ## obj_store_lib.py — single-part PUT retry (Rust layer, transparent) For objects below the multipart threshold, put_data() calls self._s3dlio.put_bytes(id, payload). As of s3dlio 0.9.104, put_bytes() / put_bytes_async() internally run put_verified_with_retry: after every PUT it issues a HEAD to verify the stored byte count, and retries automatically (up to S3DLIO_PUT_MAX_RETRIES, default 3) if there is a mismatch. No Python-layer retry is needed or added for this path — the Rust layer handles it transparently. ## obj_store_lib.py — environment variable documentation Class-level constants block fully annotated with # comments explaining every write-path environment variable, its type, default, allowable values, and the interaction between the two retry layers: S3DLIO_MULTIPART_THRESHOLD_MB (int ≥ 0 MiB, default 16) S3DLIO_MPU_MAX_RETRIES (int ≥ 1, default 3) S3DLIO_MPU_RETRY_DELAY_S (float ≥ 0 s, default 5) S3DLIO_PUT_MAX_RETRIES (int ≥ 1, default 3, Rust layer) S3DLIO_PUT_RETRY_DELAY_MS (int ≥ 0 ms, default 1000, Rust layer) _mpu_upload_with_retry() docstring expanded to NumPy-style with full Parameters, Retry policy, and Raises sections. ## pyproject.toml — bump minimum s3dlio version to 0.9.104 s3dlio>=0.9.102 → s3dlio>=0.9.104 0.9.104 is the first release that includes put_verified_with_retry (single-part integrity) and the multipart __exit__ error-propagation and stored-size verification fixes. Pinning below this version would silently omit the Rust-layer half of the write-integrity guarantee. ## README.md Added two bullet points under "Storage Backends" and "Correctness Fixes" summarising the write-integrity changes for storage#593. ## Tests 84 passed, 1 skipped (dftracer skip is pre-existing) via: uv run pytest tests/test_fast_ci.py
Author
|
Still working on this one. DO NOT MERGE YET! |
6 tasks
Author
|
Superseded by #41 — retargeted from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses silent write corruption reported in mlcommons/storage#593. Data written during datagen or checkpointing could be silently truncated — the write appeared to succeed but the stored object was shorter than expected.
This PR wires in a two-layer write-integrity architecture that covers every object write path in
ObjStoreLibStorage:put_bytes()(single-part)S3DLIO_MULTIPART_THRESHOLD_MBMultipartUploadWriter_mpu_upload_with_retry()What changed
dlio_benchmark/storage/obj_store_lib.py_mpu_upload_with_retry()— new multipart retry wrapperWraps every
MultipartUploadWritersession with automatic retry onRuntimeError:writer.write()orwriter.close(), callswriter.abort()to free the in-progress upload slot on the server (abort errors are silently swallowed so they do not mask the original failure).WARNINGwith the attempt count and sleepsS3DLIO_MPU_RETRY_DELAY_Sseconds.MultipartUploadWriterand retries.S3DLIO_MPU_MAX_RETRIEStotal attempts, raisesRuntimeErrorchained to the original exception — the root cause is never lost.Single-part path — no Python retry needed
put_data()callsself._s3dlio.put_bytes(id, payload)unchanged. As of s3dlio 0.9.104,put_bytes()internally runsput_verified_with_retry: it issues aHEADafter everyPUTto confirm the stored byte count, deletes and retries on mismatch (up toS3DLIO_PUT_MAX_RETRIES, default 3). This is entirely transparent at the Python level.Environment variable documentation
The class-level constants block is fully annotated explaining all five write-path variables, their types, defaults, allowable values, and how the two retry layers interact:
S3DLIO_MULTIPART_THRESHOLD_MB16S3DLIO_MPU_MAX_RETRIES3S3DLIO_MPU_RETRY_DELAY_S5S3DLIO_PUT_MAX_RETRIES3S3DLIO_PUT_RETRY_DELAY_MS1000_mpu_upload_with_retry()docstring expanded to NumPy-style with Parameters, Retry policy, and Raises sections.pyproject.tomlMinimum s3dlio version bumped from
>=0.9.102to>=0.9.104. Version 0.9.104 is the first release that includesput_verified_with_retry(single-part integrity) and the multipart__exit__error-propagation and stored-size verification fixes. Pinning below this would silently omit the Rust-layer half of the write-integrity guarantee.README.mdTwo new bullet points summarising the write-integrity and correctness fixes for storage#593.
Relationship to s3dlio
The Rust-layer fixes (single-part PUT verification + multipart
__exit__error propagation + stored-size check afterCompleteMultipartUpload) shipped in russfellows/s3dlio PR #145 and are published as s3dlio 0.9.104 on PyPI. This DLIO PR is the companion that:Test results
The 1 skip (
test_dftracer_core) is pre-existing and unrelated to this change.Checklist
s3dlio>=0.9.104inpyproject.toml— Rust-layer integrity guarantee always presentuv.lockregenerated from PyPI (no local wheel path override)_mpu_upload_with_retry()docstring complete with NumPy-style Parameters / Raisesmlcommons/DLIO_local_changes, notargonne-lcf/dlio_benchmark