feat: object storage, --mpi-btl flag, DLRM/Flux/UNet3D configs, sweep scripts, bug fixes (v3.0.2) by russfellows · Pull Request #378 · mlcommons/storage

russfellows · 2026-05-15T23:26:10Z

PR Summary: russfellows/mlc-storage → mlcommons/storage (v3.0.2)

Branch: pr/squash-to-mlcommons
Base: mlcommons/storage:main
Version: 2.0.0b1 → 3.0.2
Author: Russell Fellows
Date: May 15, 2026
Tests: 138 passed, 0 failed (was 112 passed, 13 failed on clean main)

Single squash commit — 32 development commits collapsed to 1 for reviewability.
80 files changed, 4,176 insertions, 7,128 deletions.

Issues Fixed

Of the 8 most recent open issues on mlcommons/storage, 7 are fixed by this PR:

Issue	Title	Fix location
#362	Training stuck at epoch 1, no NVMe reads	`dlio_benchmark` — `reader_factory.py`
#363	`collect_cluster_info()` missing required `results_dir`	`benchmarks/base.py`
#364	Flux AU limited by Parquet deserialization throughput	`dlio_benchmark` — `reader_factory.py` + s3dlio
#365	Checkpointing split-phase reports wrong operation counts	`benchmarks/base.py`
#367	`reportgen` crashes with `AttributeError` on `Namespace.file`	`cli_parser.py`
#369	`orte_init` failed — No permission (-17) in containers/root	`utils.py`, `common_args.py`
#371	`--params storage.storage_type=direct_fs` silently uses page cache	`dlio_benchmark` — `pytorch_checkpointing.py`
#372	32 GB hard cap blocks large-memory runs (256 GB / 512 GB hosts)	`dlio_benchmark` — `utils/config.py`

Bug Fix Details

fix #369: `--mpi-btl {auto,vader,tcp}` — MPI broken in containers and as root

Symptom:

orte_init failed → getting local rank failed → Returned value No permission (-17)

Root cause: a prior commit added --mca btl ^vader unconditionally to all single-host mpirun commands. Disabling the Vader shared-memory transport causes OpenMPI's ORTE rank initialization to fail in container and root environments.

Fix: New --mpi-btl choice flag:

`--mpi-btl`	MPI flag injected	When to use
`auto` (default)	(none)	Works on most systems, including containers and root
`vader`	`--mca btl vader,self`	Force POSIX shared-memory; may fail in containers/root
`tcp`	`--mca btl tcp,self`	TCP loopback; universal; recommended for containers/root

The default auto restores pre-regression behavior. The selected BTL is logged at INFO on every run.

# Default (auto) — just works, including in containers
mlpstorage training run ... --allow-run-as-root

# Explicit TCP if still seeing issues
mlpstorage training run ... --allow-run-as-root --mpi-btl tcp

Files: mlpstorage_py/cli/common_args.py, mlpstorage_py/utils.py, mlpstorage_py/benchmarks/dlio.py, tests/unit/test_utils.py

fix #363: `collect_cluster_info()` missing required `results_dir`

Benchmark._collect_cluster_information() called collect_cluster_info() without the required positional argument results_dir, causing:

WARNING: MPI cluster info collection failed: collect_cluster_info() missing 1 required positional argument: 'results_dir'

This propagated as None into reportgen, causing a downstream crash:

[INVALID] None: Check check_num_files_train failed: 'NoneType' has no attribute 'total_memory_bytes'

Fix: Pass results_dir, shared_staging_dir, and ssh_username to collect_cluster_info(). Added TestCollectClusterInfoSignatureBinding regression tests so future signature drift is caught at unit-test time.

Files: mlpstorage_py/benchmarks/base.py, mlpstorage_py/tests/test_benchmarks.py

fix #365: CLI `override_parameters` not reflected in `metadata.json`

Problem: The submission checker reads num_checkpoints_write / num_checkpoints_read from metadata['parameters'] (the YAML defaults). CLI overrides such as override_parameters.num_checkpoints_write=10 landed in metadata['override_parameters'] only, which the checker ignores. A 10-write + 10-read split-phase run would be aggregated to 20+20 and marked INVALID.

Fix: Added _apply_dotted_overrides(params, overrides) static method in Benchmark that merges dotted-key CLI overrides into metadata['parameters']. The raw override_parameters dict is still emitted unchanged for audit.

Note: PR #370 (crossmeta/zettalane) addresses the same root cause. That PR is blocked pending CLA signature. This implementation is independent and functionally equivalent.

Files: mlpstorage_py/benchmarks/base.py, mlpstorage_py/rules/models.py

fix #367: `reportgen` crashes with `AttributeError` on `Namespace.file`

The reportgen, history, and lockfile subcommands do not call add_storage_type_arguments(), so their Namespace objects have no .file or .object attribute. The unconditional read and del in parse_arguments() crashed with AttributeError.

Fix: Guard the --file/--object consolidation block with hasattr() checks. New unit tests in tests/unit/test_cli.py cover all subcommand types.

Files: mlpstorage_py/cli_parser.py, tests/unit/test_cli.py

fix #372: 32 GB hard cap blocks large-memory runs

On 256 GB / 512 GB hosts the hardcoded BUDGET_MB = 32 * 1024 artificially rejects valid configurations:

Exception: Memory budget exceeded: reader.read_threads=2 x comm_size=64 = 128
workers, estimated ~64 GB (hard cap: 32 GB). Reduce read_threads to at most 1.

On a 377 GB host running 64 B200 ranks × 2 read_threads, the cap limited throughput to ~2.3 GB/s (well below a Gen5 NVMe's 14 GB/s).

Fix: BUDGET_MB = psutil.virtual_memory().total // (1024 * 1024) — scales with the machine.

File: dlio_benchmark/utils/config.py (in pinned dlio_benchmark fork)

fix #362 / #364: Training stuck at epoch 1; Flux AU limited by CPU Parquet decode

reader_factory.py routed LOCAL_FS + Parquet to ParquetReader, which calls pf.read_row_group() — full PyArrow deserialization on every read. Entirely CPU-bound, saturates the Python GIL, starves DataLoader workers. Symptom: benchmark reaches "Starting epoch 1" and makes no NVMe I/O while CPU pegs at 88–95%.

Fix: Route LOCAL_FS + Parquet to the new ParquetReaderFileIterable — raw byte-range reads via a 64-thread ThreadPoolExecutor with no PyArrow decode.

Results (c6in.16xlarge, data on tmpfs, issue #364):

Accelerators	Before (AU)	After (AU)	Throughput
4	54.38%	99.79%	141.80 MB/s ✅
8	—	99.68%	283.07 MB/s ✅

File: dlio_benchmark/reader/reader_factory.py (in pinned dlio_benchmark fork)

fix #371: `direct_fs` checkpointing silently uses page cache

After PR #359 renamed mlpstorage → mlpstorage_py, one import path in dlio_benchmark was missed. SimpleStreamingCheckpointing (the silent fallback) ignores backend='direct_fs' entirely and uses plain open(). Result: page cache was never bypassed even when explicitly requested.

Fix: One-line import correction — from mlpstorage_py.checkpointing import StreamingCheckpointing. Confirmed with free -h that page cache no longer grows during the write phase.

File: dlio_benchmark/checkpointing/pytorch_checkpointing.py (in pinned dlio_benchmark fork)

New Features

Full S3 / Object Storage Integration

Three client libraries supported — select per-workload via storage.storage_options.storage_library:

Library	Install	Notes
s3dlio	`pip install s3dlio`	Recommended — Rust-backed, multi-endpoint load balancing, off-GIL Parquet decode
`s3torchconnector`	`pip install s3torchconnector`	PyTorch only
`minio`	`pip install minio`	MinIO Python SDK

Multi-library object-store checkpointing (PT_OBJ_SAVE checkpoint type)
Parquet reader via s3dlio: row-group granular iteration, 32-thread Tokio prefetch, 2,138 MB/s on 7-endpoint cluster
Iterable DataLoader for NPZ/NPY/JPEG/PNG with O_DIRECT local FS path
Universal --file / --object flags for single-flag pipeline invocation

uv Workflow

Full [project] table in pyproject.toml + uv.lock with Linux-only resolution (s3dlio ships Linux-only wheels).

New Workload Configs

File	Description
`configs/dlio/workload/unet3d_b200.yaml`	UNet3D on NVIDIA B200 (new)
`configs/dlio/workload/dlrm_b200.yaml`	DLRM on NVIDIA B200 (updated)
`configs/dlio/workload/dlrm_datagen.yaml`	DLRM data generation (updated)
`configs/dlio/workload/flux_datagen.yaml`	Flux data generation (updated)

New Test & Sweep Scripts

tests/object-store/sweeps/ — NP/RT sweep scripts for DLRM, Flux, RetinaNet, UNet3D
tests/object-store/run_dlrm_bench.sh, run_flux_bench.sh
tests/object-store/gen_retinanet_jpeg.sh, gen_unet3d_npz.sh, test_retinanet.sh, test_unet3d.sh
tests/unit/test_cli.py, tests/unit/test_utils.py
138 unit tests pass (was 112 passing, 13 failing before this PR)
Old stale scripts archived to tests/object-store/old-archive/

New Performance Documentation

File	Contents
`docs/DATALOADER_ARCHITECTURE.md`	DataLoader design — iterable vs map-style, O_DIRECT, off-GIL
`docs/DLRM_NP_Scaling_Results.md`	DLRM NP scaling on object storage
`docs/Flux_NP_ReadThreads_Scaling_Results.md`	Flux NP × read_threads scaling study
`docs/RetinaNet_NP_Scaling_Results.md`	RetinaNet NP scaling (TorchIterableDatasetSimple)
`docs/UNet3D_NP_Scaling_Results.md`	UNet3D NP scaling results

Dependency Note

dlio-benchmark is pinned to russfellows/dlio_benchmark@21c0723 (v3.0.2, includes fix #372).
mlcommons/storage already references russfellows/dlio_benchmark (branch ref) — this PR refines that to a specific pinned commit.
Will update to point to mlcommons/DLIO_local_changes once PR #20 is merged there.

… scripts, bug fixes (v3.0.2) Squash of all russfellows development since last upstream sync (ancestor 258483b). Bug fixes: - fix mlcommons#369: replace --disable-vader-btl with --mpi-btl {auto,vader,tcp} choice flag (was unconditionally blocking OpenMPI on containers/root; auto is now the safe default) - fix mlcommons#363: pass results_dir to collect_cluster_info - fix mlcommons#365, mlcommons#372: metadata override propagation, test suite fixes, env lock - fix mlcommons#349: guard --file/--object consolidation for non-benchmark subcommands - resolve all 129 unit test failures; update tests for mlpstorage_py rename Features: - Universal --file/--object flags and progress spinner improvements - S3 / object storage: s3dlio, s3torchconnector, minio backends fully integrated - Multi-library object-store checkpointing (PT_OBJ_SAVE) - Parquet reader/generator via s3dlio (row-group granular, off-GIL Rust decode) - uv workflow: pyproject.toml [project] table + uv.lock (Linux-only resolution) - s3dlio>=0.9.100 from PyPI (was branch-pinned) - dgen-py>=0.2.4, pyarrow>=21.0.0 New workload configs: - configs/dlio/workload/dlrm_b200.yaml (updated) - configs/dlio/workload/unet3d_b200.yaml (new) - configs/dlio/workload/dlrm_datagen.yaml, flux_datagen.yaml (updated) New docs (performance results): - docs/DATALOADER_ARCHITECTURE.md - docs/DLRM_NP_Scaling_Results.md - docs/Flux_NP_ReadThreads_Scaling_Results.md - docs/RetinaNet_NP_Scaling_Results.md - docs/UNet3D_NP_Scaling_Results.md New test scripts: - tests/object-store/sweeps/ — NP/RT sweep scripts for all workloads - tests/object-store/run_dlrm_bench.sh, run_flux_bench.sh - tests/object-store/gen_retinanet_jpeg.sh, gen_unet3d_npz.sh - tests/unit/test_cli.py, tests/unit/test_utils.py (138 tests pass) Cleanup: - tests/object-store/old-archive/ — archived stale scripts - Removed superseded perf result docs and analysis files Dependency note: dlio-benchmark is currently pinned to russfellows/dlio_benchmark@21c0723. Will update to mlcommons/DLIO_local_changes once PR #20 is merged there.

github-actions · 2026-05-15T23:26:19Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

russfellows · 2026-05-20T05:51:40Z

Note: We may want to / need to review the YAML config files. Some changes were made for testability and checking different conditions. These configuration will work, but may NOT represent the test scenarios we want to use.

@dslik , @idevasena and @FileSystemGuy : we should discuss. WE can either modify the configs here in this PR, or do so afterwards.

russfellows requested a review from a team May 15, 2026 23:26

FileSystemGuy approved these changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: object storage, --mpi-btl flag, DLRM/Flux/UNet3D configs, sweep scripts, bug fixes (v3.0.2)#378

feat: object storage, --mpi-btl flag, DLRM/Flux/UNet3D configs, sweep scripts, bug fixes (v3.0.2)#378
russfellows wants to merge 1 commit into
mlcommons:mainfrom
russfellows:pr/squash-to-mlcommons

russfellows commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

russfellows commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

russfellows commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary: russfellows/mlc-storage → mlcommons/storage (v3.0.2)

Issues Fixed

Bug Fix Details

fix #369: --mpi-btl {auto,vader,tcp} — MPI broken in containers and as root

fix #363: collect_cluster_info() missing required results_dir

fix #365: CLI override_parameters not reflected in metadata.json

fix #367: reportgen crashes with AttributeError on Namespace.file

fix #372: 32 GB hard cap blocks large-memory runs

fix #362 / #364: Training stuck at epoch 1; Flux AU limited by CPU Parquet decode

fix #371: direct_fs checkpointing silently uses page cache

New Features

Full S3 / Object Storage Integration

uv Workflow

New Workload Configs

New Test & Sweep Scripts

New Performance Documentation

Dependency Note

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

russfellows commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

russfellows commented May 15, 2026 •

edited

Loading

fix #369: `--mpi-btl {auto,vader,tcp}` — MPI broken in containers and as root

fix #363: `collect_cluster_info()` missing required `results_dir`

fix #365: CLI `override_parameters` not reflected in `metadata.json`

fix #367: `reportgen` crashes with `AttributeError` on `Namespace.file`

fix #371: `direct_fs` checkpointing silently uses page cache