[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour by shengliangxu · Pull Request #1345 · NVIDIA/Model-Optimizer

shengliangxu · 2026-04-24T23:30:44Z

Summary

Local .jsonl paths fed to get_dataset_samples / get_dataset_dataloader previously went through a text-key-only reader, while HF dataset names flowed through an auto-preprocess pipeline that recognizes messages / conversations / prompt / text / input columns. This split meant a calibration dataset behaved differently depending on whether it lived on HF Hub or on disk.

This PR routes local .jsonl through HF's json builder and the same auto-preprocess pipeline so the format is detected from columns, not the source. The legacy text-field reader is preserved as a fallback for files where the HF builder fails (e.g. PyArrow schema unification across heterogeneous rows). Existing callers passing a plain {"text": ...} JSONL file keep working unchanged; chat-shaped JSONL now works without a separate code path.

Changes

modelopt/torch/utils/dataset_utils.py
- Route .jsonl paths through load_dataset(path="json", data_files=...) + _auto_preprocess_sample.
- Wrap the HF load/preprocess block in try/except; on failure for .jsonl, fall back to get_jsonl_text_samples. If the fallback also fails, re-raise the original HF error so the diagnostic is preserved.
- Update docstrings; clarify the splits = ["train"] invariant for HF's file-based builders.

Tests

tests/unit/torch/utils/test_dataset_utils.py — three new test classes (18 cases, all passing):

TestLocalJsonlLoading — text / messages / conversations / prompt+completion / input+output columns; num_samples honored; tools kwarg forwarded; ValueError on unrecognized columns; legacy text-key fallback on schema-unification failure.
TestGetDatasetDataloaderBlending — single JSONL, list of JSONL files concatenated, mixed-format JSONL files blended, length-mismatch assertion.
TestHfTinyDataset — uses hf-internal-testing/dataset_with_data_files (10 rows x {train, test}) for end-to-end coverage: single split, multiple splits, default split, HF -> JSONL -> reload round-trip, two-HF-dataset blending, HF + local-JSONL mixing.

Test plan

pytest tests/unit/torch/utils/test_dataset_utils.py — 28 passed
mypy modelopt/torch/utils/dataset_utils.py — no new errors
ruff check — clean

Summary by CodeRabbit

New Features
- JSONL files now load through HuggingFace's dataset pipeline with consistent preprocessing (text/chat/prompt extraction).
- Support mixing JSONL paths and dataset identifiers in a single request.
Bug Fixes
- Added fallback to legacy JSONL reader if HuggingFace loading fails, with warning.
Documentation
- Updated function docstrings to document JSONL and mixed dataset support.
Tests
- Comprehensive test coverage for JSONL and mixed dataset loading scenarios.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

copy-pr-bot · 2026-04-24T23:30:47Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-24T23:30:50Z

📝 Walkthrough

Walkthrough

Enhanced JSONL file loading in get_dataset_samples to use HuggingFace's load_dataset with auto-preprocessing (chat templates, column extraction). Fallback to line-by-line reading if loading fails. Updated docstrings for both functions. Comprehensive unit tests validate JSONL handling, preprocessing logic, dataloader concatenation, and source mixing.

Changes

Cohort / File(s)	Summary
Implementation Updates `modelopt/torch/utils/dataset_utils.py`	Modified `get_dataset_samples` to detect JSONL inputs and load via HuggingFace `load_dataset(..., path="json", data_files=...)` instead of legacy line-by-line reader. Added try/except wrapper for graceful fallback to text field extraction when HF loading or preprocessing fails. Updated docstrings for both `get_dataset_samples` and `get_dataset_dataloader` to document JSONL support and mixed dataset/JSONL list behavior.
Test Coverage `tests/unit/torch/utils/test_dataset_utils.py`	Extensive new unit tests for JSONL preprocessing (text, messages, prompt/completion, input/output extraction), chat template rendering with `tools` forwarding, error handling, and fallback behavior on schema inference failures. Validates dataloader concatenation across single JSONL, multiple JSONL sources, mixed JSONL+HF datasets, and split handling. End-to-end tests confirm consistency between HF dataset samples and locally dumped JSONL reloading.

Sequence Diagram

sequenceDiagram
    participant Client as Code
    participant Detector as get_dataset_samples
    participant HF as HuggingFace<br/>load_dataset
    participant Preprocess as Auto-preprocess<br/>(chat/text/columns)
    participant Fallback as Fallback Reader<br/>(line-by-line)
    participant Return as Return Samples

    Client->>Detector: Call with JSONL path
    Detector->>Detector: Detect JSONL format
    Detector->>HF: load_dataset(path="json",<br/>data_files=...)
    
    alt HF Loading & Preprocessing Success
        HF->>Preprocess: Stream dataset rows
        Preprocess->>Preprocess: Extract/render columns<br/>(text, messages, prompt+completion, etc.)
        Preprocess->>Return: Formatted samples
    else HF Loading or Preprocessing Fails
        HF--xDetector: Fail (schema inference,<br/>unification issues)
        Detector->>Fallback: Fallback: read JSONL<br/>line-by-line
        Fallback->>Fallback: Extract 'text' field
        Fallback->>Detector: Emit warning
        Detector->>Return: Backward-compatible samples
    end
    
    Return->>Client: Samples list

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title references the ticket ID and clearly describes the main change: aligning local JSONL loading with HF datasets while preserving original behavior.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	The only production file changed is modelopt/torch/utils/dataset_utils.py, which contains no security anti-patterns listed in SECURITY.md.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch shengliangx/local-jsonl-dataset

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-24T23:37:34Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1345/
Built to branch `gh-pages` at 2026-04-24 23:37 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-04-24T23:45:26Z

Codecov Report

❌ Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.85%. Comparing base (7c80d85) to head (8f2f0ae).

Files with missing lines	Patch %	Lines
modelopt/torch/utils/dataset_utils.py	90.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1345      +/-   ##
==========================================
+ Coverage   75.63%   75.85%   +0.22%     
==========================================
  Files         471      471              
  Lines       50323    50336      +13     
==========================================
+ Hits        38060    38181     +121     
+ Misses      12263    12155     -108

Flag	Coverage Δ
examples	`41.59% <50.00%> (+1.23%)`	⬆️
gpu	`58.37% <50.00%> (-0.82%)`	⬇️
regression	`14.79% <0.00%> (+0.07%)`	⬆️
unit	`52.83% <90.00%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (1)

tests/unit/torch/utils/test_dataset_utils.py (1)
445-506: Consider adding a network-skip marker for CI reliability.

These tests download from hf-internal-testing/* datasets, which are stable but still involve network I/O. If CI environments have unreliable network access, these tests may flake.

Consider marking them with a custom pytest marker (e.g., @pytest.mark.network) so they can be selectively skipped in constrained environments, while still running in standard CI.
Example marker usage
`@pytest.mark.network`
class TestHfTinyDataset:
    """End-to-end coverage with a real (tiny) HF dataset."""
    ...
Then configure pytest.ini or pyproject.toml:
markers =
    network: tests that require network access to HuggingFace Hub
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 445 - 506, Add a
network-skippable marker to the TestHfTinyDataset test class so CI can skip HF
Hub network tests when desired: annotate the TestHfTinyDataset class with a
pytest marker such as `@pytest.mark.network` (referencing the TestHfTinyDataset
class and tests like test_load_single_split_directly,
test_dataloader_blending_two_hf_datasets, etc.), and add the corresponding
marker declaration ("network: tests that require network access to HuggingFace
Hub") to pytest.ini or pyproject.toml so pytest recognizes it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 445-506: Add a network-skippable marker to the TestHfTinyDataset
test class so CI can skip HF Hub network tests when desired: annotate the
TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network`
(referencing the TestHfTinyDataset class and tests like
test_load_single_split_directly, test_dataloader_blending_two_hf_datasets,
etc.), and add the corresponding marker declaration ("network: tests that
require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so
pytest recognizes it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e2e1cb6a-ec09-41ed-ab40-d1b508f61467

📥 Commits

Reviewing files that changed from the base of the PR and between 7c80d85 and 8f2f0ae.

📒 Files selected for processing (2)

modelopt/torch/utils/dataset_utils.py
tests/unit/torch/utils/test_dataset_utils.py

shengliangxu added 4 commits April 24, 2026 15:47

support hf jsonl file format

bbcfa4f

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

keep backward compatibility

3527399

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

more tests for general dataset loading functionalities

ad3a37a

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

Fix comments

165c745

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

Merge branch 'main' into shengliangx/local-jsonl-dataset

8f2f0ae

shengliangxu changed the title ~~feat: align local JSONL loading with HF datasets path + keep original behaviour~~ [OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour Apr 24, 2026

shengliangxu marked this pull request as ready for review April 24, 2026 23:40

shengliangxu requested a review from a team as a code owner April 24, 2026 23:40

shengliangxu requested review from meenchen and sugunav14 April 24, 2026 23:40

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345

[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345
shengliangxu wants to merge 5 commits intomainfrom
shengliangx/local-jsonl-dataset

shengliangxu commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Uh oh!

github-actions Bot commented Apr 24, 2026

Built to branch `gh-pages` at 2026-04-24 23:37 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shengliangxu commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Uh oh!

github-actions Bot commented Apr 24, 2026

Built to branch gh-pages at 2026-04-24 23:37 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shengliangxu commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-24 23:37 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Apr 24, 2026 •

edited

Loading