Skip to content

fix(reader): O_DIRECT path triggers on uri_scheme=direct (mlcommons/storage#567)#39

Merged
russfellows merged 1 commit into
mainfrom
fix/direct-fs-storage-library-sentinel-mismatch
Jun 29, 2026
Merged

fix(reader): O_DIRECT path triggers on uri_scheme=direct (mlcommons/storage#567)#39
russfellows merged 1 commit into
mainfrom
fix/direct-fs-storage-library-sentinel-mismatch

Conversation

@FileSystemGuy

@FileSystemGuy FileSystemGuy commented Jun 29, 2026

Copy link
Copy Markdown

Summary

Fixes a storage_library sentinel mismatch that made --o-direct / storage_type=direct_fs crash on every local NPZ / NPY / JPEG workload (UNet3D, ResNet, …) in the warmup batch with FileNotFoundError: 'direct:///mnt/…/foo.npz'.

  • dlio_benchmark/utils/config.py validates that storage_type=direct_fs requires storage_options.storage_library == "s3dlio".
  • dlio_benchmark/reader/_local_fs_iterable_mixin.py previously selected the s3dlio O_DIRECT prefetch path only when storage_options.storage_library == "direct".

No value of storage_library satisfied both, so _use_direct was always False for direct_fs runs. The buffered path was taken, plain open() was handed a direct://… URI, and the run crashed.

The fix

Select the O_DIRECT path when EITHER:

  • storage_options.uri_scheme == "direct"canonical signal, matches what direct_fs validation produces (and what mlpstorage's --o-direct emits).
  • storage_options.storage_library == "direct" — legacy single-knob form, preserved for backward compatibility.

uri_scheme is the semantically honest gate: it directly asks "is this a direct:// URI?" — exactly the precondition _prefetch_direct needs. storage_library is a coarser library-selector that must keep meaning something different for real S3 callers, where storage_library=s3dlio + uri_scheme=s3 is correct and must NOT trigger local O_DIRECT.

Test plan

  • New unit suite tests/test_direct_fs_iterable_mixin_gate.py — 13 tests, all pass:
    • Locks the exact storage#567 config shape (storage_library=s3dlio + uri_scheme=direct) → _use_direct=True.
    • uri_scheme=direct alone is sufficient.
    • Legacy storage_library=direct alone still works.
    • storage_library=s3dlio + uri_scheme in {s3, file} does NOT spuriously trigger local O_DIRECT (S3 / file:// regression guards).
    • Empty / None storage_options → buffered fallback, no crash.
    • Bookkeeping counters seeded on every path.
  • Fast CI suite: tests/test_fast_ci.py — 84 passed, 1 skipped.
  • Adjacent suites unaffected: test_data_generator_improvements.py, test_odirect_preflight.py, test_issue_regressions.py, test_skip_listing_config.py — 47 passed.
  • On-cluster (recommended before merge): reproducer from mlcommons/storage#567 should now reach the workload instead of crashing in warmup.

Linked

  • mlcommons/storage#567 — the original report, with full FileNotFoundError trace and the s3dlio / config.py vs. mixin disagreement diagnosed.
  • Once this lands, the downstream pin in mlcommons/storage's pyproject.toml should be bumped (the file already has a comment block for tracking DLIO PR history at pyproject.toml:131-145).

Notes for reviewers

  • Behavior change is strictly additive: every config that selected _use_direct=True before still does. Only the previously-broken direct_fs shape becomes reachable.
  • Docstrings (module-level and _localfs_init) updated to describe both signals and explain why uri_scheme is the canonical one.
  • Marked draft because the issue reporter still owes a re-test confirming the reproducer holds on mlcommons/storage@6553c0e. Open for early review meanwhile.

_LocalFSIterableMixin._localfs_init previously selected the s3dlio
O_DIRECT prefetch path only when storage_options.storage_library was
exactly "direct". But storage_type=direct_fs (the configuration
mlcommons/storage reaches via --o-direct) is *required* by
utils/config.py to set storage_library=s3dlio, NOT "direct". The two
gates disagreed: no value of storage_library satisfied both. The
buffered path was taken, plain open() was handed a direct://... URI,
and every local NPZ / NPY / JPEG run crashed in the warmup batch
with FileNotFoundError.

Fix: select the O_DIRECT path when EITHER
  - storage_options.uri_scheme == "direct"   (canonical signal —
    matches what direct_fs validation produces), OR
  - storage_options.storage_library == "direct"  (legacy single-knob
    form, kept for backward compatibility).

uri_scheme is the semantically honest gate: it asks "is this a
direct:// URI?" — which is precisely what _prefetch_direct requires.
storage_library is a coarser library-selector that has to mean
something different to S3 callers (where storage_library=s3dlio +
uri_scheme=s3 is correct and must NOT trigger local O_DIRECT).

Regression tests in tests/test_direct_fs_iterable_mixin_gate.py lock:
  - the exact storage#567 config shape (storage_library=s3dlio +
    uri_scheme=direct) reaches the O_DIRECT path,
  - the legacy single-knob form still works,
  - storage_library=s3dlio + uri_scheme in {s3, file} does NOT
    spuriously trigger local O_DIRECT,
  - bookkeeping counters are seeded on every code path.

Resolves mlcommons/storage#567.
@FileSystemGuy

Copy link
Copy Markdown
Author

@russfellows You've been very productive today! Could you please do the needful here too?

@russfellows russfellows left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems correct, I guess now that we have someone to test O_DIRECT we should be able to make progress. Fingers crossed.

@russfellows russfellows merged commit f16fe43 into main Jun 29, 2026
7 checks passed
@russfellows russfellows deleted the fix/direct-fs-storage-library-sentinel-mismatch branch June 29, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants