Skip to content

fix(#583): bridge URI scheme between DLIO preflight (bare) and checkpoint writer (qualified)#586

Merged
russfellows merged 5 commits into
mainfrom
fix/583-checkpoint-folder-scheme
Jun 29, 2026
Merged

fix(#583): bridge URI scheme between DLIO preflight (bare) and checkpoint writer (qualified)#586
russfellows merged 5 commits into
mainfrom
fix/583-checkpoint-folder-scheme

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Fixes #583. Recreates @gaikwadabhishek's analysis as a CLA-clean implementation — credit for the bug report and root-cause writeup is his.

The bug

mlpstorage closed checkpointing run object --checkpoint-folder s3://… cannot write to S3. Two consumers of checkpoint_folder want opposite forms:

  • DLIO preflight (ObjStoreLibStorage._preflight) prepends the scheme itself (the Datagen to S3 fails using s3dlio #392 pattern), so it wants a bare bucket/prefix. Feeding it s3://… produces s3://s3://… and aborts.
  • mlpstorage's streaming-checkpoint writer (S3DLIOStorageWriter) auto-dispatches by URI scheme, so it wants the scheme-qualified form. A bare path either lands silently on FileStorageWriter (writing to the local FS instead of S3 — data integrity failure) or raises Unsupported URI scheme.

Neither workaround works today — both shapes break one of the two consumers.

The fix

Mirror the #459 storage_root fix for checkpoint_folder, plus a one-env-var bridge to the writer subprocess:

  1. CheckpointingBenchmark.add_checkpoint_params — strip the URI scheme from checkpoint_folder so DLIO preflight gets a bare namespace. Set MLPSTORAGE_CHECKPOINT_URI_SCHEME=<scheme> in the parent env. Conditioned on storage.storage_type in {'s3','s3_torch'} so the file-mode mismatch guardrail in _check_storage_scheme_consistency (which catches --file --checkpoint-folder s3://… user mistakes) keeps working. direct_fs (--o-direct) is unaffected — its namespace is a local path.
  2. storage_writers/_normalize_checkpoint_uri — new helper. If URI lacks a scheme AND env var is set, prepend {scheme}://. No-op otherwise.
  3. StorageWriterFactory.create and S3DLIOStorageWriter.__init__ — both call the helper before dispatch so the writer subprocess (forked from DLIO, inherits env) reconstructs the qualified URI cleanly.

Test plan

  • RED commit (5d2b5cc) adds 21 tests across tests/unit/test_dlio_object_storage.py (10 new in TestAddCheckpointParamsSchemeStripping) and a new tests/unit/test_checkpoint_writer_scheme.py (11 tests: 6 for the helper, 4 for factory, 2 for writer init). 14 fail with AttributeError / ImportError / wrong dispatch; 7 pass (negative-case regression guards).
  • GREEN commit (22a99d4) makes all 21 pass. No regressions:
    • tests/: 2406 passed, 13 deselected
    • mlpstorage_py/tests: 780 passed, 1 xfailed
    • vdb_benchmark/tests: 144 passed
    • kv_cache_benchmark/tests: 238 passed
  • On-cluster: reporter's repro (mlpstorage closed checkpointing run object --checkpoint-folder s3://… --model llama3-8b …) writes a full checkpoint to the bucket without [E401] or Unsupported URI scheme.

Relationship to other open PRs

End-to-end object-mode checkpointing also requires:

All three rebase cleanly regardless of merge order; together they unblock object-mode runs end-to-end.

Files

  • mlpstorage_py/benchmarks/dlio.pyadd_checkpoint_params scheme strip + env var; import for CHECKPOINT_URI_SCHEME_ENV.
  • mlpstorage_py/checkpointing/storage_writers/__init__.py — new _normalize_checkpoint_uri helper + factory wiring; CHECKPOINT_URI_SCHEME_ENV constant.
  • mlpstorage_py/checkpointing/storage_writers/s3dlio_writer.py — call the helper in __init__.
  • tests/unit/test_dlio_object_storage.py — new TestAddCheckpointParamsSchemeStripping (10 tests).
  • tests/unit/test_checkpoint_writer_scheme.py — new file (3 test classes, 11 tests).

…cheme

Adds 21 tests covering the asymmetric-scheme requirement issue #583
identifies: DLIO's ObjStoreLibStorage preflight wants bare bucket/prefix
(prepends scheme itself, the #392 pattern), while the streaming writer
in mlpstorage_py/checkpointing/storage_writers/ wants a scheme-qualified
URI to dispatch the right backend.

Locks the planned bridge:

tests/unit/test_dlio_object_storage.py — TestAddCheckpointParamsSchemeStripping
  - strips s3 / az / gs schemes in object mode (mirrors #459 for storage_root)
  - leaves bare paths alone
  - leaves file-mode mismatches alone (preserves _check_storage_scheme_consistency
    guardrail that catches --file --checkpoint-folder s3://… user mistakes)
  - leaves direct_fs (--o-direct) alone (local path, no preflight issue)
  - sets MLPSTORAGE_CHECKPOINT_URI_SCHEME env when stripping
  - does NOT set env for bare paths, file mode, or empty checkpoint_folder

tests/unit/test_checkpoint_writer_scheme.py — new file, three classes:
  - TestNormalizeCheckpointURI: the helper itself (no scheme + env set
    → prepend; otherwise unchanged; empty env value treated as unset)
  - TestStorageWriterFactoryNormalization: factory runs the helper
    before dispatch on both auto-detect and explicit backend='s3dlio' paths;
    bare path + no env still defaults to FileStorageWriter (existing behavior)
  - TestS3DLIOStorageWriterNormalization: writer __init__ runs helper too;
    unsupported-scheme error still reachable when env unset

Also extends the dep-stub block to handle find_spec raising after a
parent module has been MagicMock'd (same pattern as the #568 fix to
test_capacity_gate.py).

14 fail as expected with AttributeError / ImportError / wrong dispatch;
7 pass (negative-case tests verifying no regression). Fix lands next.
…lified)

Object-mode checkpointing (mlpstorage closed checkpointing run object
--checkpoint-folder s3://...) cannot write to S3 today. Two consumers
of checkpoint_folder want opposite forms:

  - DLIO's ObjStoreLibStorage._preflight prepends the scheme itself
    (the #392 pattern), so it wants a bare bucket/prefix. Feeding it
    s3://... produces s3://s3://... and aborts the run.
  - mlpstorage's streaming-checkpoint writer (S3DLIOStorageWriter)
    auto-dispatches by URI scheme, so it wants the scheme-qualified
    form. A bare path lands on FileStorageWriter (silently writing to
    the local FS) or raises Unsupported URI scheme.

#459 fixed the same double-prefix bug for storage.storage_root by
stripping the scheme in process_dlio_params. This PR extends the same
fix to checkpoint_folder, conditioned on object storage mode so the
file-mode mismatch guardrail in _check_storage_scheme_consistency keeps
catching --file --checkpoint-folder s3://... user mistakes.

Bridge to the writer is an env var: MLPSTORAGE_CHECKPOINT_URI_SCHEME.
mlpstorage sets it in the parent process before launching DLIO; DLIO
inherits, forks the writer; the writer reads it in
_normalize_checkpoint_uri and reconstructs s3://bucket/... at dispatch
time. Both the StorageWriterFactory and S3DLIOStorageWriter call sites
go through the helper.

Conditions on storage_type in {'s3','s3_torch'} — the same signal #581
keys on in DLIOBenchmark._is_object_storage(). After #581 lands this
can refactor to call the helper directly.

direct_fs (--o-direct) is unaffected: its namespace is a local path,
no preflight double-prefix problem to solve.
@FileSystemGuy FileSystemGuy requested a review from a team June 29, 2026 22:35
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@russfellows russfellows left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last one for today... maybe.

@russfellows russfellows merged commit 237add7 into main Jun 29, 2026
3 checks passed
@russfellows russfellows deleted the fix/583-checkpoint-folder-scheme branch June 29, 2026 23:07
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

2 participants