[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345
[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345shengliangxu wants to merge 5 commits intomainfrom
Conversation
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughEnhanced JSONL file loading in Changes
Sequence DiagramsequenceDiagram
participant Client as Code
participant Detector as get_dataset_samples
participant HF as HuggingFace<br/>load_dataset
participant Preprocess as Auto-preprocess<br/>(chat/text/columns)
participant Fallback as Fallback Reader<br/>(line-by-line)
participant Return as Return Samples
Client->>Detector: Call with JSONL path
Detector->>Detector: Detect JSONL format
Detector->>HF: load_dataset(path="json",<br/>data_files=...)
alt HF Loading & Preprocessing Success
HF->>Preprocess: Stream dataset rows
Preprocess->>Preprocess: Extract/render columns<br/>(text, messages, prompt+completion, etc.)
Preprocess->>Return: Formatted samples
else HF Loading or Preprocessing Fails
HF--xDetector: Fail (schema inference,<br/>unification issues)
Detector->>Fallback: Fallback: read JSONL<br/>line-by-line
Fallback->>Fallback: Extract 'text' field
Fallback->>Detector: Emit warning
Detector->>Return: Backward-compatible samples
end
Return->>Client: Samples list
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1345 +/- ##
==========================================
+ Coverage 75.63% 75.85% +0.22%
==========================================
Files 471 471
Lines 50323 50336 +13
==========================================
+ Hits 38060 38181 +121
+ Misses 12263 12155 -108
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)
445-506: Consider adding a network-skip marker for CI reliability.These tests download from
hf-internal-testing/*datasets, which are stable but still involve network I/O. If CI environments have unreliable network access, these tests may flake.Consider marking them with a custom pytest marker (e.g.,
@pytest.mark.network) so they can be selectively skipped in constrained environments, while still running in standard CI.Example marker usage
`@pytest.mark.network` class TestHfTinyDataset: """End-to-end coverage with a real (tiny) HF dataset.""" ...Then configure
pytest.iniorpyproject.toml:markers = network: tests that require network access to HuggingFace Hub🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 445 - 506, Add a network-skippable marker to the TestHfTinyDataset test class so CI can skip HF Hub network tests when desired: annotate the TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network` (referencing the TestHfTinyDataset class and tests like test_load_single_split_directly, test_dataloader_blending_two_hf_datasets, etc.), and add the corresponding marker declaration ("network: tests that require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so pytest recognizes it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 445-506: Add a network-skippable marker to the TestHfTinyDataset
test class so CI can skip HF Hub network tests when desired: annotate the
TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network`
(referencing the TestHfTinyDataset class and tests like
test_load_single_split_directly, test_dataloader_blending_two_hf_datasets,
etc.), and add the corresponding marker declaration ("network: tests that
require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so
pytest recognizes it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e2e1cb6a-ec09-41ed-ab40-d1b508f61467
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
Summary
Local
.jsonlpaths fed toget_dataset_samples/get_dataset_dataloaderpreviously went through atext-key-only reader, while HF dataset names flowed through an auto-preprocess pipeline that recognizesmessages/conversations/prompt/text/inputcolumns. This split meant a calibration dataset behaved differently depending on whether it lived on HF Hub or on disk.This PR routes local
.jsonlthrough HF'sjsonbuilder and the same auto-preprocess pipeline so the format is detected from columns, not the source. The legacytext-field reader is preserved as a fallback for files where the HF builder fails (e.g. PyArrow schema unification across heterogeneous rows). Existing callers passing a plain{"text": ...}JSONL file keep working unchanged; chat-shaped JSONL now works without a separate code path.Changes
modelopt/torch/utils/dataset_utils.py.jsonlpaths throughload_dataset(path="json", data_files=...)+_auto_preprocess_sample.try/except; on failure for.jsonl, fall back toget_jsonl_text_samples. If the fallback also fails, re-raise the original HF error so the diagnostic is preserved.splits = ["train"]invariant for HF's file-based builders.Tests
tests/unit/torch/utils/test_dataset_utils.py— three new test classes (18 cases, all passing):TestLocalJsonlLoading— text / messages / conversations / prompt+completion / input+output columns;num_sampleshonored;toolskwarg forwarded; ValueError on unrecognized columns; legacytext-key fallback on schema-unification failure.TestGetDatasetDataloaderBlending— single JSONL, list of JSONL files concatenated, mixed-format JSONL files blended, length-mismatch assertion.TestHfTinyDataset— useshf-internal-testing/dataset_with_data_files(10 rows x {train, test}) for end-to-end coverage: single split, multiple splits, default split, HF -> JSONL -> reload round-trip, two-HF-dataset blending, HF + local-JSONL mixing.Test plan
pytest tests/unit/torch/utils/test_dataset_utils.py— 28 passedmypy modelopt/torch/utils/dataset_utils.py— no new errorsruff check— cleanSummary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests