Skip to content

fix(#538): wire StorageType.DIRECT_FS for --o-direct local O_DIRECT I/O#38

Merged
FileSystemGuy merged 1 commit into
mainfrom
feat/538-storage-type-direct
Jun 26, 2026
Merged

fix(#538): wire StorageType.DIRECT_FS for --o-direct local O_DIRECT I/O#38
FileSystemGuy merged 1 commit into
mainfrom
feat/538-storage-type-direct

Conversation

@russfellows

Copy link
Copy Markdown

Summary

Fix #538 (companion to mlcommons/storage PR): wire StorageType.DIRECT_FS through
DLIO so --o-direct --file uses local O_DIRECT I/O via s3dlio's direct:// URI
scheme rather than hitting an S3 endpoint.

  • StorageFactory: DIRECT_FSObjStoreLibStorage (was wrongly FileStorage)
  • torch_data_loader._s3_types: added DIRECT_FS so the s3dlio async reader is used during training
  • config.py: validation that direct_fs requires storage_library=s3dlio; validation that storage_type=s3 with a local storage_root path is rejected with a clear error
  • config.py: write_threads auto-sizing treats DIRECT_FS as I/O-bound (same as S3)
  • obj_store_lib._preflight(): auto-creates a missing checkpoint subdirectory when its parent directory exists and is writable (first run has no checkpoint dir yet)

New tests (DLIO)

# Test What it checks
10 test_creates_missing_dir_when_parent_exists Preflight auto-creates missing checkpoint model subdirectory when parent is writable

Two existing preflight tests were also updated to use two-level-deep missing paths so they still raise ValueError after the auto-create logic was added.

Test plan

  • pytest tests/test_fast_ci.py tests/test_odirect_preflight.py — 92/92 pass

🤖 Generated with Claude Code

Introduces a clean distinction between:
  storage_type=local_fs  — standard POSIX I/O (unchanged)
  storage_type=direct_fs — O_DIRECT local FS via s3dlio direct:// URI
  storage_type=s3        — S3 object storage (always a bucket, never a path)

Previously, --o-direct in mlp-storage injected storage_type=s3 to route
through ObjStoreLibStorage, which is conceptually wrong: s3 always means
an S3 bucket. A local filesystem path as storage_root with storage_type=s3
is now explicitly rejected with a clear error message.

Changes:
- storage_factory.py: DIRECT_FS → ObjStoreLibStorage (not FileStorage),
  so s3dlio handles all reads/writes with the direct:// URI scheme
- torch_data_loader.py: add DIRECT_FS to _s3_types so the s3dlio iterable
  reader is selected for training data loading under --o-direct
- config.py: validate DIRECT_FS (requires s3dlio library); validate that
  storage_type=s3 is never combined with a local path in storage_root;
  auto-size write_threads as I/O-bound for DIRECT_FS (same as S3)
- obj_store_lib.py: preflight auto-creates checkpoint model subdirectory
  if the parent exists and is writable (first run has no checkpoint yet)
- test_odirect_preflight.py: update missing-dir tests to use two-level
  deep paths; add test confirming auto-create behavior

Fixes: mlcommons/storage#538

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@russfellows

Copy link
Copy Markdown
Author

@FileSystemGuy , can you check this update, approve and merge if you're OK with it.

This SHOULD fix the issue with --o-direct attempting to use object in some cases. The "--o-direct" flag is now an OPTIONAL feature flag to "--file". It will fail if used with "--object" along with other checks and test conditions. This needs to go in first so that we can then merge in the mlp-storage PR.

@FileSystemGuy FileSystemGuy merged commit 252a54b into main Jun 26, 2026
7 checks passed
@FileSystemGuy FileSystemGuy deleted the feat/538-storage-type-direct branch June 26, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants