fix(customizer): lone-root fallback accepts .json in addition to .jsonl by aray12 · Pull Request #286 · NVIDIA-NeMo/nemo-platform

aray12 · 2026-06-11T20:48:27Z

Summary

discover_dataset_files fallback (preparation.py) used glob("*.jsonl"), missing .json files even though every other discovery path (patterns, subdir walk) treated both extensions equally. A lone data.json at fileset root with no train/val naming pattern would raise DatasetFormatError.
Replaced the glob with iterdir() + suffix check for (".jsonl", ".json") to match the behavior of the other paths.
Mirrored the fix in Studio's partitionDatasetFiles so the UI discovery stays in sync with customizer.

Test plan

3 new Python unit tests: lone .json → training, mixed .jsonl+.json → both claimed, multiple .json → all claimed
Studio spec updated: flipped "does NOT park .json" test and added mixed .jsonl+.json case
All 38 Python tests pass (uv run pytest services/customizer/tests/tasks/training/test_datasets.py)
All 13 Studio tests pass (pnpm --filter nemo-studio-ui test src/hooks/useDatasetFileDiscovery/index.spec.ts)

Closes ASTD-109

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced dataset file discovery to include both .json and .jsonl files in fallback scenarios when configured training/validation patterns don't match any files.
- Improved fallback behavior: single unmatched files are automatically designated as training data; multiple files are all treated as training data.
Tests
- Added comprehensive test coverage for new dataset discovery fallback behavior with mixed file types.

….jsonl The discover_dataset_files fallback glob was .jsonl-only, while every other discovery path (TRAIN_PATTERNS, VAL_PATTERNS, subdir walk) already treated .json and .jsonl as equivalent. A user uploading data.json at the fileset root with no train/val pattern in the name would hit a DatasetFormatError even though the file is valid. Fixes the fallback to use iterdir() with suffix check instead of glob("*.jsonl"), and mirrors the change in Studio's partitionDatasetFiles so the UI stays in sync with customizer's discovery rules. Closes ASTD-109 Signed-off-by: Alex Ray <alray@nvidia.com>

coderabbitai · 2026-06-11T20:53:00Z

📝 Walkthrough

Walkthrough

Both the frontend file partitioning and backend dataset discovery logic expanded fallback behavior to recognize .json files alongside .jsonl files when no training/validation patterns match, with corresponding unit tests validating single, mixed, and multiple file scenarios.

Changes

Expand dataset file discovery to include JSON format in fallback

Layer / File(s)	Summary
Frontend dataset file partitioning with JSON support `web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts`, `web/packages/studio/src/hooks/useDatasetFileDiscovery/index.spec.ts`	`partitionDatasetFiles` now includes both `.jsonl` and `.json` root files in `unmatchedRootJsonl` fallback. Tests cover single `.json`, mixed `.json`/`.jsonl`, and prior behavior.
Backend dataset discovery with JSON support `services/customizer/src/nmp/customizer/tasks/training/datasets/preparation.py`, `services/customizer/tests/tasks/training/test_datasets.py`	`discover_dataset_files()` fallback scans for both `.jsonl` and `.json` root files, treating one file as training or multiple as all training with filename logging. Three new tests validate single `.json`, mixed `.jsonl`/`.json`, and multiple `.json` scenarios.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: extending the lone-root fallback to accept both .json and .jsonl files instead of just .jsonl.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch astd-109-customizer-lone-root-fallback-should-also-accept-json-not

⚔️ Resolve merge conflicts

Resolve merge conflict in branch astd-109-customizer-lone-root-fallback-should-also-accept-json-not

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts (2)

104-107: ⚡ Quick win

Update comment to reflect both extensions.

Comment says "unmatched root .jsonl files" but the fallback now handles both .jsonl and .json files (line 25 DATASET_EXTENSIONS).

📝 Suggested fix

-  // Customizer fallback (preparation.py:324-336): when no train/val patterns
-  // matched at all, ALL unmatched root .jsonl files are claimed as training.
-  // Single file is unambiguous; multiple files trigger a warning in customizer
-  // but still get treated as training (the merged set is auto-split for val).
+  // Customizer fallback (preparation.py:324-336): when no train/val patterns
+  // matched at all, ALL unmatched root .jsonl/.json files are claimed as training.
+  // Single file is unambiguous; multiple files trigger a warning in customizer
+  // but still get treated as training (the merged set is auto-split for val).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts` around lines
104 - 107, Update the outdated comment in useDatasetFileDiscovery/index.ts to
mention both .jsonl and .json extensions: the fallback behavior (originally
described as "unmatched root .jsonl files") now applies to all extensions listed
in DATASET_EXTENSIONS, so change the wording to refer to "unmatched root dataset
files (e.g., .jsonl and .json)" or similar and ensure the comment refers to
DATASET_EXTENSIONS rather than only .jsonl so it accurately documents the
behavior used by the discovery logic.

5-5: ⚡ Quick win

Use import type for FilesetFileOutput in both index.ts and index.spec.ts.

FilesetFileOutput is only used in type positions in both files (index.ts lines 13-14, 34-36, 39-42 and index.spec.ts line 7 return type), never as a runtime value. Per coding guidelines, type-only imports must use import type.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts` at line 5,
Replace the runtime import with a type-only import for FilesetFileOutput: change
the import of FilesetFileOutput from '`@nemo/sdk/generated/platform/schema`' to an
"import type" form in both useDatasetFileDiscovery/index.ts and its test
index.spec.ts, since FilesetFileOutput is used only in type positions (e.g., the
return types and type annotations in functions referenced in the diff) and
should not be emitted at runtime.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts`:
- Around line 104-107: Update the outdated comment in
useDatasetFileDiscovery/index.ts to mention both .jsonl and .json extensions:
the fallback behavior (originally described as "unmatched root .jsonl files")
now applies to all extensions listed in DATASET_EXTENSIONS, so change the
wording to refer to "unmatched root dataset files (e.g., .jsonl and .json)" or
similar and ensure the comment refers to DATASET_EXTENSIONS rather than only
.jsonl so it accurately documents the behavior used by the discovery logic.
- Line 5: Replace the runtime import with a type-only import for
FilesetFileOutput: change the import of FilesetFileOutput from
'`@nemo/sdk/generated/platform/schema`' to an "import type" form in both
useDatasetFileDiscovery/index.ts and its test index.spec.ts, since
FilesetFileOutput is used only in type positions (e.g., the return types and
type annotations in functions referenced in the diff) and should not be emitted
at runtime.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b761a9a6-2bd7-4ef3-86b8-759f41e53659

📥 Commits

Reviewing files that changed from the base of the PR and between 53b6c8b and 1048bc5.

📒 Files selected for processing (4)

services/customizer/src/nmp/customizer/tasks/training/datasets/preparation.py
services/customizer/tests/tasks/training/test_datasets.py
web/packages/studio/src/hooks/useDatasetFileDiscovery/index.spec.ts
web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts

aray12 requested review from a team as code owners June 11, 2026 20:48

github-actions Bot added the fix label Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(customizer): lone-root fallback accepts .json in addition to .jsonl#286

fix(customizer): lone-root fallback accepts .json in addition to .jsonl#286
aray12 wants to merge 1 commit into
mainfrom
astd-109-customizer-lone-root-fallback-should-also-accept-json-not

aray12 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aray12 commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 11, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aray12 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading