feat: object storage, --mpi-btl flag, DLRM/Flux/UNet3D configs, sweep scripts, bug fixes (v3.0.2)#378
Open
russfellows wants to merge 1 commit into
Open
Conversation
… scripts, bug fixes (v3.0.2) Squash of all russfellows development since last upstream sync (ancestor 258483b). Bug fixes: - fix mlcommons#369: replace --disable-vader-btl with --mpi-btl {auto,vader,tcp} choice flag (was unconditionally blocking OpenMPI on containers/root; auto is now the safe default) - fix mlcommons#363: pass results_dir to collect_cluster_info - fix mlcommons#365, mlcommons#372: metadata override propagation, test suite fixes, env lock - fix mlcommons#349: guard --file/--object consolidation for non-benchmark subcommands - resolve all 129 unit test failures; update tests for mlpstorage_py rename Features: - Universal --file/--object flags and progress spinner improvements - S3 / object storage: s3dlio, s3torchconnector, minio backends fully integrated - Multi-library object-store checkpointing (PT_OBJ_SAVE) - Parquet reader/generator via s3dlio (row-group granular, off-GIL Rust decode) - uv workflow: pyproject.toml [project] table + uv.lock (Linux-only resolution) - s3dlio>=0.9.100 from PyPI (was branch-pinned) - dgen-py>=0.2.4, pyarrow>=21.0.0 New workload configs: - configs/dlio/workload/dlrm_b200.yaml (updated) - configs/dlio/workload/unet3d_b200.yaml (new) - configs/dlio/workload/dlrm_datagen.yaml, flux_datagen.yaml (updated) New docs (performance results): - docs/DATALOADER_ARCHITECTURE.md - docs/DLRM_NP_Scaling_Results.md - docs/Flux_NP_ReadThreads_Scaling_Results.md - docs/RetinaNet_NP_Scaling_Results.md - docs/UNet3D_NP_Scaling_Results.md New test scripts: - tests/object-store/sweeps/ — NP/RT sweep scripts for all workloads - tests/object-store/run_dlrm_bench.sh, run_flux_bench.sh - tests/object-store/gen_retinanet_jpeg.sh, gen_unet3d_npz.sh - tests/unit/test_cli.py, tests/unit/test_utils.py (138 tests pass) Cleanup: - tests/object-store/old-archive/ — archived stale scripts - Removed superseded perf result docs and analysis files Dependency note: dlio-benchmark is currently pinned to russfellows/dlio_benchmark@21c0723. Will update to mlcommons/DLIO_local_changes once PR #20 is merged there.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
FileSystemGuy
approved these changes
May 16, 2026
Contributor
Author
|
Note: We may want to / need to review the YAML config files. Some changes were made for testability and checking different conditions. These configuration will work, but may NOT represent the test scenarios we want to use. @dslik , @idevasena and @FileSystemGuy : we should discuss. WE can either modify the configs here in this PR, or do so afterwards. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Summary: russfellows/mlc-storage → mlcommons/storage (v3.0.2)
Branch:
pr/squash-to-mlcommonsBase:
mlcommons/storage:mainVersion: 2.0.0b1 → 3.0.2
Author: Russell Fellows
Date: May 15, 2026
Tests: 138 passed, 0 failed (was 112 passed, 13 failed on clean
main)Issues Fixed
Of the 8 most recent open issues on mlcommons/storage, 7 are fixed by this PR:
dlio_benchmark—reader_factory.pycollect_cluster_info()missing requiredresults_dirbenchmarks/base.pydlio_benchmark—reader_factory.py+ s3dliobenchmarks/base.pyreportgencrashes withAttributeErroronNamespace.filecli_parser.pyorte_initfailed — No permission (-17) in containers/rootutils.py,common_args.py--params storage.storage_type=direct_fssilently uses page cachedlio_benchmark—pytorch_checkpointing.pydlio_benchmark—utils/config.pyBug Fix Details
fix #369:
--mpi-btl {auto,vader,tcp}— MPI broken in containers and as rootSymptom:
Root cause: a prior commit added
--mca btl ^vaderunconditionally to all single-hostmpiruncommands. Disabling the Vader shared-memory transport causes OpenMPI's ORTE rank initialization to fail in container and root environments.Fix: New
--mpi-btlchoice flag:--mpi-btlauto(default)vader--mca btl vader,selftcp--mca btl tcp,selfThe default
autorestores pre-regression behavior. The selected BTL is logged atINFOon every run.Files:
mlpstorage_py/cli/common_args.py,mlpstorage_py/utils.py,mlpstorage_py/benchmarks/dlio.py,tests/unit/test_utils.pyfix #363:
collect_cluster_info()missing requiredresults_dirBenchmark._collect_cluster_information()calledcollect_cluster_info()without the required positional argumentresults_dir, causing:This propagated as
Noneintoreportgen, causing a downstream crash:Fix: Pass
results_dir,shared_staging_dir, andssh_usernametocollect_cluster_info(). AddedTestCollectClusterInfoSignatureBindingregression tests so future signature drift is caught at unit-test time.Files:
mlpstorage_py/benchmarks/base.py,mlpstorage_py/tests/test_benchmarks.pyfix #365: CLI
override_parametersnot reflected inmetadata.jsonProblem: The submission checker reads
num_checkpoints_write/num_checkpoints_readfrommetadata['parameters'](the YAML defaults). CLI overrides such asoverride_parameters.num_checkpoints_write=10landed inmetadata['override_parameters']only, which the checker ignores. A 10-write + 10-read split-phase run would be aggregated to 20+20 and marked INVALID.Fix: Added
_apply_dotted_overrides(params, overrides)static method inBenchmarkthat merges dotted-key CLI overrides intometadata['parameters']. The rawoverride_parametersdict is still emitted unchanged for audit.Files:
mlpstorage_py/benchmarks/base.py,mlpstorage_py/rules/models.pyfix #367:
reportgencrashes withAttributeErroronNamespace.fileThe
reportgen,history, andlockfilesubcommands do not calladd_storage_type_arguments(), so theirNamespaceobjects have no.fileor.objectattribute. The unconditional read anddelinparse_arguments()crashed withAttributeError.Fix: Guard the
--file/--objectconsolidation block withhasattr()checks. New unit tests intests/unit/test_cli.pycover all subcommand types.Files:
mlpstorage_py/cli_parser.py,tests/unit/test_cli.pyfix #372: 32 GB hard cap blocks large-memory runs
On 256 GB / 512 GB hosts the hardcoded
BUDGET_MB = 32 * 1024artificially rejects valid configurations:On a 377 GB host running 64 B200 ranks × 2
read_threads, the cap limited throughput to ~2.3 GB/s (well below a Gen5 NVMe's 14 GB/s).Fix:
BUDGET_MB = psutil.virtual_memory().total // (1024 * 1024)— scales with the machine.File:
dlio_benchmark/utils/config.py(in pinneddlio_benchmarkfork)fix #362 / #364: Training stuck at epoch 1; Flux AU limited by CPU Parquet decode
reader_factory.pyroutedLOCAL_FS+ Parquet toParquetReader, which callspf.read_row_group()— full PyArrow deserialization on every read. Entirely CPU-bound, saturates the Python GIL, starves DataLoader workers. Symptom: benchmark reaches "Starting epoch 1" and makes no NVMe I/O while CPU pegs at 88–95%.Fix: Route
LOCAL_FS+ Parquet to the newParquetReaderFileIterable— raw byte-range reads via a 64-threadThreadPoolExecutorwith no PyArrow decode.Results (c6in.16xlarge, data on tmpfs, issue #364):
File:
dlio_benchmark/reader/reader_factory.py(in pinneddlio_benchmarkfork)fix #371:
direct_fscheckpointing silently uses page cacheAfter PR #359 renamed
mlpstorage→mlpstorage_py, one import path indlio_benchmarkwas missed.SimpleStreamingCheckpointing(the silent fallback) ignoresbackend='direct_fs'entirely and uses plainopen(). Result: page cache was never bypassed even when explicitly requested.Fix: One-line import correction —
from mlpstorage_py.checkpointing import StreamingCheckpointing. Confirmed withfree -hthat page cache no longer grows during the write phase.File:
dlio_benchmark/checkpointing/pytorch_checkpointing.py(in pinneddlio_benchmarkfork)New Features
Full S3 / Object Storage Integration
Three client libraries supported — select per-workload via
storage.storage_options.storage_library:pip install s3dlios3torchconnectorpip install s3torchconnectorminiopip install minioPT_OBJ_SAVEcheckpoint type)--file/--objectflags for single-flag pipeline invocationuv Workflow
Full
[project]table inpyproject.toml+uv.lockwith Linux-only resolution (s3dlio ships Linux-only wheels).New Workload Configs
configs/dlio/workload/unet3d_b200.yamlconfigs/dlio/workload/dlrm_b200.yamlconfigs/dlio/workload/dlrm_datagen.yamlconfigs/dlio/workload/flux_datagen.yamlNew Test & Sweep Scripts
tests/object-store/sweeps/— NP/RT sweep scripts for DLRM, Flux, RetinaNet, UNet3Dtests/object-store/run_dlrm_bench.sh,run_flux_bench.shtests/object-store/gen_retinanet_jpeg.sh,gen_unet3d_npz.sh,test_retinanet.sh,test_unet3d.shtests/unit/test_cli.py,tests/unit/test_utils.pytests/object-store/old-archive/New Performance Documentation
docs/DATALOADER_ARCHITECTURE.mddocs/DLRM_NP_Scaling_Results.mddocs/Flux_NP_ReadThreads_Scaling_Results.mddocs/RetinaNet_NP_Scaling_Results.mddocs/UNet3D_NP_Scaling_Results.mdDependency Note
dlio-benchmarkis pinned torussfellows/dlio_benchmark@21c0723(v3.0.2, includes fix #372).mlcommons/storagealready referencesrussfellows/dlio_benchmark(branch ref) — this PR refines that to a specific pinned commit.Will update to point to
mlcommons/DLIO_local_changesonce PR #20 is merged there.