Skip to content

fix(#593): write integrity for obj_store_lib — multipart retry + s3dlio 0.9.104#40

Closed
russfellows wants to merge 1 commit into
mainfrom
fix/593-write-integrity-s3dlio-0.9.104
Closed

fix(#593): write integrity for obj_store_lib — multipart retry + s3dlio 0.9.104#40
russfellows wants to merge 1 commit into
mainfrom
fix/593-write-integrity-s3dlio-0.9.104

Conversation

@russfellows

Copy link
Copy Markdown

Summary

Addresses silent write corruption reported in mlcommons/storage#593. Data written during datagen or checkpointing could be silently truncated — the write appeared to succeed but the stored object was shorter than expected.

This PR wires in a two-layer write-integrity architecture that covers every object write path in ObjStoreLibStorage:

Write path Object size Where retry lives
put_bytes() (single-part) below S3DLIO_MULTIPART_THRESHOLD_MB Rust layer in s3dlio — transparent to Python
MultipartUploadWriter at or above threshold Python layer — _mpu_upload_with_retry()

What changed

dlio_benchmark/storage/obj_store_lib.py

_mpu_upload_with_retry() — new multipart retry wrapper

Wraps every MultipartUploadWriter session with automatic retry on RuntimeError:

  1. On failure from writer.write() or writer.close(), calls writer.abort() to free the in-progress upload slot on the server (abort errors are silently swallowed so they do not mask the original failure).
  2. Logs a WARNING with the attempt count and sleeps S3DLIO_MPU_RETRY_DELAY_S seconds.
  3. Creates a fresh MultipartUploadWriter and retries.
  4. After S3DLIO_MPU_MAX_RETRIES total attempts, raises RuntimeError chained to the original exception — the root cause is never lost.

Single-part path — no Python retry needed

put_data() calls self._s3dlio.put_bytes(id, payload) unchanged. As of s3dlio 0.9.104, put_bytes() internally runs put_verified_with_retry: it issues a HEAD after every PUT to confirm the stored byte count, deletes and retries on mismatch (up to S3DLIO_PUT_MAX_RETRIES, default 3). This is entirely transparent at the Python level.

Environment variable documentation

The class-level constants block is fully annotated explaining all five write-path variables, their types, defaults, allowable values, and how the two retry layers interact:

Variable Layer Default What it controls
S3DLIO_MULTIPART_THRESHOLD_MB Python 16 MiB above which multipart is used
S3DLIO_MPU_MAX_RETRIES Python 3 Total multipart attempts before raising
S3DLIO_MPU_RETRY_DELAY_S Python 5 Seconds between multipart retries
S3DLIO_PUT_MAX_RETRIES Rust 3 Total single-part PUT attempts before raising
S3DLIO_PUT_RETRY_DELAY_MS Rust 1000 Milliseconds between single-part retries

_mpu_upload_with_retry() docstring expanded to NumPy-style with Parameters, Retry policy, and Raises sections.

pyproject.toml

Minimum s3dlio version bumped from >=0.9.102 to >=0.9.104. Version 0.9.104 is the first release that includes put_verified_with_retry (single-part integrity) and the multipart __exit__ error-propagation and stored-size verification fixes. Pinning below this would silently omit the Rust-layer half of the write-integrity guarantee.

README.md

Two new bullet points summarising the write-integrity and correctness fixes for storage#593.


Relationship to s3dlio

The Rust-layer fixes (single-part PUT verification + multipart __exit__ error propagation + stored-size check after CompleteMultipartUpload) shipped in russfellows/s3dlio PR #145 and are published as s3dlio 0.9.104 on PyPI. This DLIO PR is the companion that:

  • Adds the Python-side multipart retry loop that completes the write-integrity story.
  • Bumps the minimum s3dlio pin to ensure the Rust half is always present.

Test results

uv run pytest tests/test_fast_ci.py
84 passed, 1 skipped in 17.69s

The 1 skip (test_dftracer_core) is pre-existing and unrelated to this change.


Checklist

  • s3dlio>=0.9.104 in pyproject.toml — Rust-layer integrity guarantee always present
  • uv.lock regenerated from PyPI (no local wheel path override)
  • All five write-path environment variables documented with type, default, and allowable values
  • _mpu_upload_with_retry() docstring complete with NumPy-style Parameters / Raises
  • Fast CI: 84 passed, 1 skipped (pre-existing)
  • PR targets mlcommons/DLIO_local_changes, not argonne-lcf/dlio_benchmark

Addresses silent write-corruption reported in mlcommons/storage#593.
Data written during datagen or checkpointing could be silently truncated;
the write appeared to succeed but the stored object was shorter than
expected.  This commit wires in the two-layer retry/verification
architecture introduced in s3dlio 0.9.104.

## obj_store_lib.py — multipart upload retry (Python layer)

ObjStoreLibStorage._mpu_upload_with_retry() is the Python-side retry
wrapper for MultipartUploadWriter (used for objects at or above
S3DLIO_MULTIPART_THRESHOLD_MB, default 16 MiB):

- On RuntimeError from writer.write() or writer.close(), calls
  writer.abort() to free the in-progress upload slot on the server,
  then sleeps S3DLIO_MPU_RETRY_DELAY_S seconds and retries with a
  fresh MultipartUploadWriter.
- After S3DLIO_MPU_MAX_RETRIES total attempts, raises RuntimeError
  chained to the original exception so the root cause is not lost.
- Logs a WARNING on each retry and ERROR on final failure.

## obj_store_lib.py — single-part PUT retry (Rust layer, transparent)

For objects below the multipart threshold, put_data() calls
self._s3dlio.put_bytes(id, payload).  As of s3dlio 0.9.104,
put_bytes() / put_bytes_async() internally run put_verified_with_retry:
after every PUT it issues a HEAD to verify the stored byte count, and
retries automatically (up to S3DLIO_PUT_MAX_RETRIES, default 3) if
there is a mismatch.  No Python-layer retry is needed or added for
this path — the Rust layer handles it transparently.

## obj_store_lib.py — environment variable documentation

Class-level constants block fully annotated with # comments explaining
every write-path environment variable, its type, default, allowable
values, and the interaction between the two retry layers:

  S3DLIO_MULTIPART_THRESHOLD_MB  (int ≥ 0 MiB, default 16)
  S3DLIO_MPU_MAX_RETRIES         (int ≥ 1, default 3)
  S3DLIO_MPU_RETRY_DELAY_S       (float ≥ 0 s, default 5)
  S3DLIO_PUT_MAX_RETRIES         (int ≥ 1, default 3, Rust layer)
  S3DLIO_PUT_RETRY_DELAY_MS      (int ≥ 0 ms, default 1000, Rust layer)

_mpu_upload_with_retry() docstring expanded to NumPy-style with full
Parameters, Retry policy, and Raises sections.

## pyproject.toml — bump minimum s3dlio version to 0.9.104

  s3dlio>=0.9.102  →  s3dlio>=0.9.104

0.9.104 is the first release that includes put_verified_with_retry
(single-part integrity) and the multipart __exit__ error-propagation
and stored-size verification fixes.  Pinning below this version would
silently omit the Rust-layer half of the write-integrity guarantee.

## README.md

Added two bullet points under "Storage Backends" and "Correctness
Fixes" summarising the write-integrity changes for storage#593.

## Tests

84 passed, 1 skipped (dftracer skip is pre-existing) via:
  uv run pytest tests/test_fast_ci.py
@russfellows russfellows requested a review from a team June 30, 2026 20:44
@russfellows russfellows marked this pull request as draft June 30, 2026 21:10
@russfellows russfellows self-assigned this Jun 30, 2026
@russfellows

Copy link
Copy Markdown
Author

Still working on this one.

DO NOT MERGE YET!

@russfellows

Copy link
Copy Markdown
Author

Superseded by #41 — retargeted from fix/593-write-integrity-s3dlio-0.9.104 to fix/593-write-integrity-s3dlio-0.9.106 after s3dlio 0.9.106 changed write verification from always-on to opt-in (default off). Same underlying storage#593 fix, plus the opt-in flag wiring.

@russfellows russfellows closed this Jul 1, 2026
@russfellows russfellows deleted the fix/593-write-integrity-s3dlio-0.9.104 branch July 1, 2026 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant