Skip to content

fix(#584): _is_object_storage recognises data_access_protocol=='object'#585

Merged
russfellows merged 3 commits into
mainfrom
fix/584-is-object-storage-data-access-protocol
Jun 29, 2026
Merged

fix(#584): _is_object_storage recognises data_access_protocol=='object'#585
russfellows merged 3 commits into
mainfrom
fix/584-is-object-storage-data-access-protocol

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

Resolves #584.

validation_helpers._is_object_storage(args) decided object-vs-local using only two signals:

  1. --params storage.storage_type=s3 present in args.params
  2. s3://-scheme prefix on args.data_dir / args.checkpoint_folder

It did not consult args.data_access_protocol == 'object' — the canonical signal set by the object positional, and the signal every other site already keys on (CAP-01 gate per #568/#579, dlio.py, run_summary, cli_parser, every CLI module).

For a bare mlpstorage … run object --data-dir data/unet3d …:

  • data_access_protocol='object' is set at parse time ✓
  • storage.storage_type=s3 is injected onto DLIOBenchmark.params_dict after parsing (mlpstorage_py/benchmarks/dlio.py:202-203), so it never reaches args.params at validation time
  • the bare --data-dir has no s3:// scheme

_is_object_storage returned False, _validate_paths fell through to os.path.exists() on what is conceptually a bucket key, and the run died with [E401] Data directory not found. Only --skip-validation could unblock it, which also disables the MPI / SSH / DLIO checks — too broad.

The fix

One line, additive — check data_access_protocol first, then fall through to the legacy signals. Every config that returned True before still does.

def _is_object_storage(args) -> bool:
    if getattr(args, 'data_access_protocol', None) == 'object':
        return True
    # … existing legacy signals unchanged …

Tests

12 new tests, all pass; full unit suite stays green.

TestIsObjectStorage (9 cases):

TestValidatePathsObjectStorageBypass (3 cases — integration through _validate_paths):

Verification

  • uv run pytest tests/unit/test_validation_helpers.py -v → 46 passed (34 existing + 12 new).
  • uv run pytest tests/unit -q2303 passed. No regressions.
  • On-cluster: reporter's repro (mlpstorage closed training unet3d run object --data-dir data/unet3d …, no --skip-validation) should now pass environment validation.

Notes

validation_helpers._is_object_storage decided object-vs-local using only
two signals: an explicit `--params storage.storage_type=s3`, or an
`s3://`-scheme on data_dir / checkpoint_folder. It did NOT consult
`args.data_access_protocol == 'object'` — the canonical signal set by
the `object` positional, and the signal every other site already keys
on (CAP-01 gate per #568/#579, dlio.py, run_summary, cli_parser, every
CLI module).

For a bare `mlpstorage … run object --data-dir data/unet3d …`:
  - data_access_protocol='object' is set at parse time ✓
  - storage.storage_type=s3 is injected onto DLIOBenchmark.params_dict
    AFTER parsing (mlpstorage_py/benchmarks/dlio.py:202-203), never
    reaches args.params at validation time
  - the bare data_dir has no s3:// scheme

So _is_object_storage returned False, _validate_paths fell through to
os.path.exists() on a bucket-key string, and the run died with
`[E401] Data directory not found`. Only --skip-validation could
unblock it, which also disables the MPI/SSH/DLIO checks — too broad.

The fix is one line: check data_access_protocol first, then fall
through to the legacy signals. Strictly additive — every config that
returned True before still does.

Tests:
- TestIsObjectStorage (9 cases): canonical signal triggers True, file
  mode does NOT misclassify, missing attr falls through to legacy
  signals, all three legacy signals still work, unrelated --params
  don't flip the gate, empty args don't raise.
- TestValidatePathsObjectStorageBypass (3 cases): the exact #584
  reproducer config now returns no errors; file mode with a missing
  data_dir still errors (guardrail against bypass leakage); checkpoint
  parent-dir check is also bypassed for object mode.

Resolves #584.
@FileSystemGuy FileSystemGuy requested a review from a team June 29, 2026 22:14
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@russfellows russfellows left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure all this is needed to determine if we have object storage, but Ok.

@russfellows russfellows merged commit 5f8c349 into main Jun 29, 2026
3 checks passed
@russfellows russfellows deleted the fix/584-is-object-storage-data-access-protocol branch June 29, 2026 23:02
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

2 participants