Skip to content

feat: object storage, --mpi-btl flag, DLRM/Flux/UNet3D configs, sweep scripts, bug fixes (v3.0.2)#378

Open
russfellows wants to merge 1 commit into
mlcommons:mainfrom
russfellows:pr/squash-to-mlcommons
Open

feat: object storage, --mpi-btl flag, DLRM/Flux/UNet3D configs, sweep scripts, bug fixes (v3.0.2)#378
russfellows wants to merge 1 commit into
mlcommons:mainfrom
russfellows:pr/squash-to-mlcommons

Conversation

@russfellows
Copy link
Copy Markdown
Contributor

@russfellows russfellows commented May 15, 2026

PR Summary: russfellows/mlc-storage → mlcommons/storage (v3.0.2)

Branch: pr/squash-to-mlcommons
Base: mlcommons/storage:main
Version: 2.0.0b1 → 3.0.2
Author: Russell Fellows
Date: May 15, 2026
Tests: 138 passed, 0 failed (was 112 passed, 13 failed on clean main)

Single squash commit — 32 development commits collapsed to 1 for reviewability.
80 files changed, 4,176 insertions, 7,128 deletions.


Issues Fixed

Of the 8 most recent open issues on mlcommons/storage, 7 are fixed by this PR:

Issue Title Fix location
#362 Training stuck at epoch 1, no NVMe reads dlio_benchmarkreader_factory.py
#363 collect_cluster_info() missing required results_dir benchmarks/base.py
#364 Flux AU limited by Parquet deserialization throughput dlio_benchmarkreader_factory.py + s3dlio
#365 Checkpointing split-phase reports wrong operation counts benchmarks/base.py
#367 reportgen crashes with AttributeError on Namespace.file cli_parser.py
#369 orte_init failed — No permission (-17) in containers/root utils.py, common_args.py
#371 --params storage.storage_type=direct_fs silently uses page cache dlio_benchmarkpytorch_checkpointing.py
#372 32 GB hard cap blocks large-memory runs (256 GB / 512 GB hosts) dlio_benchmarkutils/config.py

Bug Fix Details

fix #369: --mpi-btl {auto,vader,tcp} — MPI broken in containers and as root

Symptom:

orte_init failed → getting local rank failed → Returned value No permission (-17)

Root cause: a prior commit added --mca btl ^vader unconditionally to all single-host mpirun commands. Disabling the Vader shared-memory transport causes OpenMPI's ORTE rank initialization to fail in container and root environments.

Fix: New --mpi-btl choice flag:

--mpi-btl MPI flag injected When to use
auto (default) (none) Works on most systems, including containers and root
vader --mca btl vader,self Force POSIX shared-memory; may fail in containers/root
tcp --mca btl tcp,self TCP loopback; universal; recommended for containers/root

The default auto restores pre-regression behavior. The selected BTL is logged at INFO on every run.

# Default (auto) — just works, including in containers
mlpstorage training run ... --allow-run-as-root

# Explicit TCP if still seeing issues
mlpstorage training run ... --allow-run-as-root --mpi-btl tcp

Files: mlpstorage_py/cli/common_args.py, mlpstorage_py/utils.py, mlpstorage_py/benchmarks/dlio.py, tests/unit/test_utils.py


fix #363: collect_cluster_info() missing required results_dir

Benchmark._collect_cluster_information() called collect_cluster_info() without the required positional argument results_dir, causing:

WARNING: MPI cluster info collection failed: collect_cluster_info() missing 1 required positional argument: 'results_dir'

This propagated as None into reportgen, causing a downstream crash:

[INVALID] None: Check check_num_files_train failed: 'NoneType' has no attribute 'total_memory_bytes'

Fix: Pass results_dir, shared_staging_dir, and ssh_username to collect_cluster_info(). Added TestCollectClusterInfoSignatureBinding regression tests so future signature drift is caught at unit-test time.

Files: mlpstorage_py/benchmarks/base.py, mlpstorage_py/tests/test_benchmarks.py


fix #365: CLI override_parameters not reflected in metadata.json

Problem: The submission checker reads num_checkpoints_write / num_checkpoints_read from metadata['parameters'] (the YAML defaults). CLI overrides such as override_parameters.num_checkpoints_write=10 landed in metadata['override_parameters'] only, which the checker ignores. A 10-write + 10-read split-phase run would be aggregated to 20+20 and marked INVALID.

Fix: Added _apply_dotted_overrides(params, overrides) static method in Benchmark that merges dotted-key CLI overrides into metadata['parameters']. The raw override_parameters dict is still emitted unchanged for audit.

Note: PR #370 (crossmeta/zettalane) addresses the same root cause. That PR is blocked pending CLA signature. This implementation is independent and functionally equivalent.

Files: mlpstorage_py/benchmarks/base.py, mlpstorage_py/rules/models.py


fix #367: reportgen crashes with AttributeError on Namespace.file

The reportgen, history, and lockfile subcommands do not call add_storage_type_arguments(), so their Namespace objects have no .file or .object attribute. The unconditional read and del in parse_arguments() crashed with AttributeError.

Fix: Guard the --file/--object consolidation block with hasattr() checks. New unit tests in tests/unit/test_cli.py cover all subcommand types.

Files: mlpstorage_py/cli_parser.py, tests/unit/test_cli.py


fix #372: 32 GB hard cap blocks large-memory runs

On 256 GB / 512 GB hosts the hardcoded BUDGET_MB = 32 * 1024 artificially rejects valid configurations:

Exception: Memory budget exceeded: reader.read_threads=2 x comm_size=64 = 128
workers, estimated ~64 GB (hard cap: 32 GB). Reduce read_threads to at most 1.

On a 377 GB host running 64 B200 ranks × 2 read_threads, the cap limited throughput to ~2.3 GB/s (well below a Gen5 NVMe's 14 GB/s).

Fix: BUDGET_MB = psutil.virtual_memory().total // (1024 * 1024) — scales with the machine.

File: dlio_benchmark/utils/config.py (in pinned dlio_benchmark fork)


fix #362 / #364: Training stuck at epoch 1; Flux AU limited by CPU Parquet decode

reader_factory.py routed LOCAL_FS + Parquet to ParquetReader, which calls pf.read_row_group() — full PyArrow deserialization on every read. Entirely CPU-bound, saturates the Python GIL, starves DataLoader workers. Symptom: benchmark reaches "Starting epoch 1" and makes no NVMe I/O while CPU pegs at 88–95%.

Fix: Route LOCAL_FS + Parquet to the new ParquetReaderFileIterable — raw byte-range reads via a 64-thread ThreadPoolExecutor with no PyArrow decode.

Results (c6in.16xlarge, data on tmpfs, issue #364):

Accelerators Before (AU) After (AU) Throughput
4 54.38% 99.79% 141.80 MB/s ✅
8 99.68% 283.07 MB/s ✅

File: dlio_benchmark/reader/reader_factory.py (in pinned dlio_benchmark fork)


fix #371: direct_fs checkpointing silently uses page cache

After PR #359 renamed mlpstoragemlpstorage_py, one import path in dlio_benchmark was missed. SimpleStreamingCheckpointing (the silent fallback) ignores backend='direct_fs' entirely and uses plain open(). Result: page cache was never bypassed even when explicitly requested.

Fix: One-line import correction — from mlpstorage_py.checkpointing import StreamingCheckpointing. Confirmed with free -h that page cache no longer grows during the write phase.

File: dlio_benchmark/checkpointing/pytorch_checkpointing.py (in pinned dlio_benchmark fork)


New Features

Full S3 / Object Storage Integration

Three client libraries supported — select per-workload via storage.storage_options.storage_library:

Library Install Notes
s3dlio pip install s3dlio Recommended — Rust-backed, multi-endpoint load balancing, off-GIL Parquet decode
s3torchconnector pip install s3torchconnector PyTorch only
minio pip install minio MinIO Python SDK
  • Multi-library object-store checkpointing (PT_OBJ_SAVE checkpoint type)
  • Parquet reader via s3dlio: row-group granular iteration, 32-thread Tokio prefetch, 2,138 MB/s on 7-endpoint cluster
  • Iterable DataLoader for NPZ/NPY/JPEG/PNG with O_DIRECT local FS path
  • Universal --file / --object flags for single-flag pipeline invocation

uv Workflow

Full [project] table in pyproject.toml + uv.lock with Linux-only resolution (s3dlio ships Linux-only wheels).


New Workload Configs

File Description
configs/dlio/workload/unet3d_b200.yaml UNet3D on NVIDIA B200 (new)
configs/dlio/workload/dlrm_b200.yaml DLRM on NVIDIA B200 (updated)
configs/dlio/workload/dlrm_datagen.yaml DLRM data generation (updated)
configs/dlio/workload/flux_datagen.yaml Flux data generation (updated)

New Test & Sweep Scripts

  • tests/object-store/sweeps/ — NP/RT sweep scripts for DLRM, Flux, RetinaNet, UNet3D
  • tests/object-store/run_dlrm_bench.sh, run_flux_bench.sh
  • tests/object-store/gen_retinanet_jpeg.sh, gen_unet3d_npz.sh, test_retinanet.sh, test_unet3d.sh
  • tests/unit/test_cli.py, tests/unit/test_utils.py
  • 138 unit tests pass (was 112 passing, 13 failing before this PR)
  • Old stale scripts archived to tests/object-store/old-archive/

New Performance Documentation

File Contents
docs/DATALOADER_ARCHITECTURE.md DataLoader design — iterable vs map-style, O_DIRECT, off-GIL
docs/DLRM_NP_Scaling_Results.md DLRM NP scaling on object storage
docs/Flux_NP_ReadThreads_Scaling_Results.md Flux NP × read_threads scaling study
docs/RetinaNet_NP_Scaling_Results.md RetinaNet NP scaling (TorchIterableDatasetSimple)
docs/UNet3D_NP_Scaling_Results.md UNet3D NP scaling results

Dependency Note

dlio-benchmark is pinned to russfellows/dlio_benchmark@21c0723 (v3.0.2, includes fix #372).
mlcommons/storage already references russfellows/dlio_benchmark (branch ref) — this PR refines that to a specific pinned commit.
Will update to point to mlcommons/DLIO_local_changes once PR #20 is merged there.

… scripts, bug fixes (v3.0.2)

Squash of all russfellows development since last upstream sync (ancestor 258483b).

Bug fixes:
- fix mlcommons#369: replace --disable-vader-btl with --mpi-btl {auto,vader,tcp} choice flag
  (was unconditionally blocking OpenMPI on containers/root; auto is now the safe default)
- fix mlcommons#363: pass results_dir to collect_cluster_info
- fix mlcommons#365, mlcommons#372: metadata override propagation, test suite fixes, env lock
- fix mlcommons#349: guard --file/--object consolidation for non-benchmark subcommands
- resolve all 129 unit test failures; update tests for mlpstorage_py rename

Features:
- Universal --file/--object flags and progress spinner improvements
- S3 / object storage: s3dlio, s3torchconnector, minio backends fully integrated
- Multi-library object-store checkpointing (PT_OBJ_SAVE)
- Parquet reader/generator via s3dlio (row-group granular, off-GIL Rust decode)
- uv workflow: pyproject.toml [project] table + uv.lock (Linux-only resolution)
- s3dlio>=0.9.100 from PyPI (was branch-pinned)
- dgen-py>=0.2.4, pyarrow>=21.0.0

New workload configs:
- configs/dlio/workload/dlrm_b200.yaml (updated)
- configs/dlio/workload/unet3d_b200.yaml (new)
- configs/dlio/workload/dlrm_datagen.yaml, flux_datagen.yaml (updated)

New docs (performance results):
- docs/DATALOADER_ARCHITECTURE.md
- docs/DLRM_NP_Scaling_Results.md
- docs/Flux_NP_ReadThreads_Scaling_Results.md
- docs/RetinaNet_NP_Scaling_Results.md
- docs/UNet3D_NP_Scaling_Results.md

New test scripts:
- tests/object-store/sweeps/ — NP/RT sweep scripts for all workloads
- tests/object-store/run_dlrm_bench.sh, run_flux_bench.sh
- tests/object-store/gen_retinanet_jpeg.sh, gen_unet3d_npz.sh
- tests/unit/test_cli.py, tests/unit/test_utils.py (138 tests pass)

Cleanup:
- tests/object-store/old-archive/ — archived stale scripts
- Removed superseded perf result docs and analysis files

Dependency note: dlio-benchmark is currently pinned to russfellows/dlio_benchmark@21c0723.
Will update to mlcommons/DLIO_local_changes once PR #20 is merged there.
@russfellows russfellows requested a review from a team May 15, 2026 23:26
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@russfellows
Copy link
Copy Markdown
Contributor Author

Note: We may want to / need to review the YAML config files. Some changes were made for testability and checking different conditions. These configuration will work, but may NOT represent the test scenarios we want to use.

@dslik , @idevasena and @FileSystemGuy : we should discuss. WE can either modify the configs here in this PR, or do so afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment