fix(#583): bridge URI scheme between DLIO preflight (bare) and checkpoint writer (qualified)#586
Merged
Merged
Conversation
…cheme Adds 21 tests covering the asymmetric-scheme requirement issue #583 identifies: DLIO's ObjStoreLibStorage preflight wants bare bucket/prefix (prepends scheme itself, the #392 pattern), while the streaming writer in mlpstorage_py/checkpointing/storage_writers/ wants a scheme-qualified URI to dispatch the right backend. Locks the planned bridge: tests/unit/test_dlio_object_storage.py — TestAddCheckpointParamsSchemeStripping - strips s3 / az / gs schemes in object mode (mirrors #459 for storage_root) - leaves bare paths alone - leaves file-mode mismatches alone (preserves _check_storage_scheme_consistency guardrail that catches --file --checkpoint-folder s3://… user mistakes) - leaves direct_fs (--o-direct) alone (local path, no preflight issue) - sets MLPSTORAGE_CHECKPOINT_URI_SCHEME env when stripping - does NOT set env for bare paths, file mode, or empty checkpoint_folder tests/unit/test_checkpoint_writer_scheme.py — new file, three classes: - TestNormalizeCheckpointURI: the helper itself (no scheme + env set → prepend; otherwise unchanged; empty env value treated as unset) - TestStorageWriterFactoryNormalization: factory runs the helper before dispatch on both auto-detect and explicit backend='s3dlio' paths; bare path + no env still defaults to FileStorageWriter (existing behavior) - TestS3DLIOStorageWriterNormalization: writer __init__ runs helper too; unsupported-scheme error still reachable when env unset Also extends the dep-stub block to handle find_spec raising after a parent module has been MagicMock'd (same pattern as the #568 fix to test_capacity_gate.py). 14 fail as expected with AttributeError / ImportError / wrong dispatch; 7 pass (negative-case tests verifying no regression). Fix lands next.
…lified)
Object-mode checkpointing (mlpstorage closed checkpointing run object
--checkpoint-folder s3://...) cannot write to S3 today. Two consumers
of checkpoint_folder want opposite forms:
- DLIO's ObjStoreLibStorage._preflight prepends the scheme itself
(the #392 pattern), so it wants a bare bucket/prefix. Feeding it
s3://... produces s3://s3://... and aborts the run.
- mlpstorage's streaming-checkpoint writer (S3DLIOStorageWriter)
auto-dispatches by URI scheme, so it wants the scheme-qualified
form. A bare path lands on FileStorageWriter (silently writing to
the local FS) or raises Unsupported URI scheme.
#459 fixed the same double-prefix bug for storage.storage_root by
stripping the scheme in process_dlio_params. This PR extends the same
fix to checkpoint_folder, conditioned on object storage mode so the
file-mode mismatch guardrail in _check_storage_scheme_consistency keeps
catching --file --checkpoint-folder s3://... user mistakes.
Bridge to the writer is an env var: MLPSTORAGE_CHECKPOINT_URI_SCHEME.
mlpstorage sets it in the parent process before launching DLIO; DLIO
inherits, forks the writer; the writer reads it in
_normalize_checkpoint_uri and reconstructs s3://bucket/... at dispatch
time. Both the StorageWriterFactory and S3DLIOStorageWriter call sites
go through the helper.
Conditions on storage_type in {'s3','s3_torch'} — the same signal #581
keys on in DLIOBenchmark._is_object_storage(). After #581 lands this
can refactor to call the helper directly.
direct_fs (--o-direct) is unaffected: its namespace is a local path,
no preflight double-prefix problem to solve.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
russfellows
approved these changes
Jun 29, 2026
russfellows
left a comment
Contributor
There was a problem hiding this comment.
Last one for today... maybe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #583. Recreates @gaikwadabhishek's analysis as a CLA-clean implementation — credit for the bug report and root-cause writeup is his.
The bug
mlpstorage closed checkpointing run object --checkpoint-folder s3://…cannot write to S3. Two consumers ofcheckpoint_folderwant opposite forms:ObjStoreLibStorage._preflight) prepends the scheme itself (the Datagen to S3 fails using s3dlio #392 pattern), so it wants a bare bucket/prefix. Feeding its3://…producess3://s3://…and aborts.S3DLIOStorageWriter) auto-dispatches by URI scheme, so it wants the scheme-qualified form. A bare path either lands silently onFileStorageWriter(writing to the local FS instead of S3 — data integrity failure) or raisesUnsupported URI scheme.Neither workaround works today — both shapes break one of the two consumers.
The fix
Mirror the #459 storage_root fix for
checkpoint_folder, plus a one-env-var bridge to the writer subprocess:CheckpointingBenchmark.add_checkpoint_params— strip the URI scheme fromcheckpoint_folderso DLIO preflight gets a bare namespace. SetMLPSTORAGE_CHECKPOINT_URI_SCHEME=<scheme>in the parent env. Conditioned onstorage.storage_type in {'s3','s3_torch'}so the file-mode mismatch guardrail in_check_storage_scheme_consistency(which catches--file --checkpoint-folder s3://…user mistakes) keeps working.direct_fs(--o-direct) is unaffected — its namespace is a local path.storage_writers/_normalize_checkpoint_uri— new helper. If URI lacks a scheme AND env var is set, prepend{scheme}://. No-op otherwise.StorageWriterFactory.createandS3DLIOStorageWriter.__init__— both call the helper before dispatch so the writer subprocess (forked from DLIO, inherits env) reconstructs the qualified URI cleanly.Test plan
tests/unit/test_dlio_object_storage.py(10 new inTestAddCheckpointParamsSchemeStripping) and a newtests/unit/test_checkpoint_writer_scheme.py(11 tests: 6 for the helper, 4 for factory, 2 for writer init). 14 fail withAttributeError/ImportError/ wrong dispatch; 7 pass (negative-case regression guards).tests/: 2406 passed, 13 deselectedmlpstorage_py/tests: 780 passed, 1 xfailedvdb_benchmark/tests: 144 passedkv_cache_benchmark/tests: 238 passedmlpstorage closed checkpointing run object --checkpoint-folder s3://… --model llama3-8b …) writes a full checkpoint to the bucket without[E401]orUnsupported URI scheme.Relationship to other open PRs
End-to-end object-mode checkpointing also requires:
validation_helpers._is_object_storagefix for_is_object_storage()misses theobjectpositional (data_access_protocol) → object-mode runs fail[E401] Data directory not found/ checkpoint parent-dir checks #584) — without it the[E401] Data directory not foundvalidation error fires before reaching this code path. Independent fix, different file.object(S3) mode —statvfsruns on thes3://URI #568) — uses the samestorage.storage_typesignal as this PR but insideDLIOBenchmark._is_object_storage(). This PR's inline check could refactor to call that helper once both land. Different functions in the same file; rebases cleanly either order.All three rebase cleanly regardless of merge order; together they unblock object-mode runs end-to-end.
Files
mlpstorage_py/benchmarks/dlio.py—add_checkpoint_paramsscheme strip + env var; import forCHECKPOINT_URI_SCHEME_ENV.mlpstorage_py/checkpointing/storage_writers/__init__.py— new_normalize_checkpoint_urihelper + factory wiring;CHECKPOINT_URI_SCHEME_ENVconstant.mlpstorage_py/checkpointing/storage_writers/s3dlio_writer.py— call the helper in__init__.tests/unit/test_dlio_object_storage.py— newTestAddCheckpointParamsSchemeStripping(10 tests).tests/unit/test_checkpoint_writer_scheme.py— new file (3 test classes, 11 tests).