fix(#584): _is_object_storage recognises data_access_protocol=='object'#585
Merged
russfellows merged 3 commits intoJun 29, 2026
Merged
Conversation
validation_helpers._is_object_storage decided object-vs-local using only two signals: an explicit `--params storage.storage_type=s3`, or an `s3://`-scheme on data_dir / checkpoint_folder. It did NOT consult `args.data_access_protocol == 'object'` — the canonical signal set by the `object` positional, and the signal every other site already keys on (CAP-01 gate per #568/#579, dlio.py, run_summary, cli_parser, every CLI module). For a bare `mlpstorage … run object --data-dir data/unet3d …`: - data_access_protocol='object' is set at parse time ✓ - storage.storage_type=s3 is injected onto DLIOBenchmark.params_dict AFTER parsing (mlpstorage_py/benchmarks/dlio.py:202-203), never reaches args.params at validation time - the bare data_dir has no s3:// scheme So _is_object_storage returned False, _validate_paths fell through to os.path.exists() on a bucket-key string, and the run died with `[E401] Data directory not found`. Only --skip-validation could unblock it, which also disables the MPI/SSH/DLIO checks — too broad. The fix is one line: check data_access_protocol first, then fall through to the legacy signals. Strictly additive — every config that returned True before still does. Tests: - TestIsObjectStorage (9 cases): canonical signal triggers True, file mode does NOT misclassify, missing attr falls through to legacy signals, all three legacy signals still work, unrelated --params don't flip the gate, empty args don't raise. - TestValidatePathsObjectStorageBypass (3 cases): the exact #584 reproducer config now returns no errors; file mode with a missing data_dir still errors (guardrail against bypass leakage); checkpoint parent-dir check is also bypassed for object mode. Resolves #584.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Merged
3 tasks
russfellows
approved these changes
Jun 29, 2026
russfellows
left a comment
Contributor
There was a problem hiding this comment.
Not sure all this is needed to determine if we have object storage, but Ok.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #584.
validation_helpers._is_object_storage(args)decided object-vs-local using only two signals:--params storage.storage_type=s3present inargs.paramss3://-scheme prefix onargs.data_dir/args.checkpoint_folderIt did not consult
args.data_access_protocol == 'object'— the canonical signal set by theobjectpositional, and the signal every other site already keys on (CAP-01 gate per #568/#579,dlio.py,run_summary,cli_parser, every CLI module).For a bare
mlpstorage … run object --data-dir data/unet3d …:data_access_protocol='object'is set at parse time ✓storage.storage_type=s3is injected ontoDLIOBenchmark.params_dictafter parsing (mlpstorage_py/benchmarks/dlio.py:202-203), so it never reachesargs.paramsat validation time--data-dirhas nos3://scheme_is_object_storagereturnedFalse,_validate_pathsfell through toos.path.exists()on what is conceptually a bucket key, and the run died with[E401] Data directory not found. Only--skip-validationcould unblock it, which also disables the MPI / SSH / DLIO checks — too broad.The fix
One line, additive — check
data_access_protocolfirst, then fall through to the legacy signals. Every config that returnedTruebefore still does.Tests
12 new tests, all pass; full unit suite stays green.
TestIsObjectStorage(9 cases):data_access_protocol='object') →True(the_is_object_storage()misses theobjectpositional (data_access_protocol) → object-mode runs fail[E401] Data directory not found/ checkpoint parent-dir checks #584 regression lock)data_access_protocol='file'→False(must not misclassify file mode)data_access_protocolattr → falls through to legacy signals--params storage.storage_type=s3→True(and=objectvariant)s3://ondata_dir→Trues3://oncheckpoint_folder→True--paramsentries don't flip the gateNamespace()doesn't raiseTestValidatePathsObjectStorageBypass(3 cases — integration through_validate_paths):_is_object_storage()misses theobjectpositional (data_access_protocol) → object-mode runs fail[E401] Data directory not found/ checkpoint parent-dir checks #584 reproducer config (objectpositional, bare relativedata_dir, no--params) → no errorsdata_dirstill errors (the bypass must not leak)Verification
uv run pytest tests/unit/test_validation_helpers.py -v→ 46 passed (34 existing + 12 new).uv run pytest tests/unit -q→ 2303 passed. No regressions.mlpstorage closed training unet3d run object --data-dir data/unet3d …, no--skip-validation) should now pass environment validation.Notes
_is_object_storage—DLIOBenchmark._is_object_storage()inbenchmarks/dlio.py, scoped to the CAP-01 capacity gate with access toparams_dict. Same name, different scope. They could converge into a shared helper eventually, but_is_object_storage()misses theobjectpositional (data_access_protocol) → object-mode runs fail[E401] Data directory not found/ checkpoint parent-dir checks #584 is the user-facing CLI-validation bug and is solved in isolation here.