Skip to content

fix(customizer): lone-root fallback accepts .json in addition to .jsonl#286

Open
aray12 wants to merge 1 commit into
mainfrom
astd-109-customizer-lone-root-fallback-should-also-accept-json-not
Open

fix(customizer): lone-root fallback accepts .json in addition to .jsonl#286
aray12 wants to merge 1 commit into
mainfrom
astd-109-customizer-lone-root-fallback-should-also-accept-json-not

Conversation

@aray12

@aray12 aray12 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • discover_dataset_files fallback (preparation.py) used glob("*.jsonl"), missing .json files even though every other discovery path (patterns, subdir walk) treated both extensions equally. A lone data.json at fileset root with no train/val naming pattern would raise DatasetFormatError.
  • Replaced the glob with iterdir() + suffix check for (".jsonl", ".json") to match the behavior of the other paths.
  • Mirrored the fix in Studio's partitionDatasetFiles so the UI discovery stays in sync with customizer.

Test plan

  • 3 new Python unit tests: lone .json → training, mixed .jsonl+.json → both claimed, multiple .json → all claimed
  • Studio spec updated: flipped "does NOT park .json" test and added mixed .jsonl+.json case
  • All 38 Python tests pass (uv run pytest services/customizer/tests/tasks/training/test_datasets.py)
  • All 13 Studio tests pass (pnpm --filter nemo-studio-ui test src/hooks/useDatasetFileDiscovery/index.spec.ts)

Closes ASTD-109

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced dataset file discovery to include both .json and .jsonl files in fallback scenarios when configured training/validation patterns don't match any files.
    • Improved fallback behavior: single unmatched files are automatically designated as training data; multiple files are all treated as training data.
  • Tests

    • Added comprehensive test coverage for new dataset discovery fallback behavior with mixed file types.

….jsonl

The discover_dataset_files fallback glob was .jsonl-only, while every
other discovery path (TRAIN_PATTERNS, VAL_PATTERNS, subdir walk) already
treated .json and .jsonl as equivalent. A user uploading data.json at
the fileset root with no train/val pattern in the name would hit a
DatasetFormatError even though the file is valid.

Fixes the fallback to use iterdir() with suffix check instead of
glob("*.jsonl"), and mirrors the change in Studio's partitionDatasetFiles
so the UI stays in sync with customizer's discovery rules.

Closes ASTD-109

Signed-off-by: Alex Ray <alray@nvidia.com>
@aray12 aray12 requested review from a team as code owners June 11, 2026 20:48
@github-actions github-actions Bot added the fix label Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Both the frontend file partitioning and backend dataset discovery logic expanded fallback behavior to recognize .json files alongside .jsonl files when no training/validation patterns match, with corresponding unit tests validating single, mixed, and multiple file scenarios.

Changes

Expand dataset file discovery to include JSON format in fallback

Layer / File(s) Summary
Frontend dataset file partitioning with JSON support
web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts, web/packages/studio/src/hooks/useDatasetFileDiscovery/index.spec.ts
partitionDatasetFiles now includes both .jsonl and .json root files in unmatchedRootJsonl fallback. Tests cover single .json, mixed .json/.jsonl, and prior behavior.
Backend dataset discovery with JSON support
services/customizer/src/nmp/customizer/tasks/training/datasets/preparation.py, services/customizer/tests/tasks/training/test_datasets.py
discover_dataset_files() fallback scans for both .jsonl and .json root files, treating one file as training or multiple as all training with filename logging. Three new tests validate single .json, mixed .jsonl/.json, and multiple .json scenarios.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: extending the lone-root fallback to accept both .json and .jsonl files instead of just .jsonl.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch astd-109-customizer-lone-root-fallback-should-also-accept-json-not
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch astd-109-customizer-lone-root-fallback-should-also-accept-json-not

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts (2)

104-107: ⚡ Quick win

Update comment to reflect both extensions.

Comment says "unmatched root .jsonl files" but the fallback now handles both .jsonl and .json files (line 25 DATASET_EXTENSIONS).

📝 Suggested fix
-  // Customizer fallback (preparation.py:324-336): when no train/val patterns
-  // matched at all, ALL unmatched root .jsonl files are claimed as training.
-  // Single file is unambiguous; multiple files trigger a warning in customizer
-  // but still get treated as training (the merged set is auto-split for val).
+  // Customizer fallback (preparation.py:324-336): when no train/val patterns
+  // matched at all, ALL unmatched root .jsonl/.json files are claimed as training.
+  // Single file is unambiguous; multiple files trigger a warning in customizer
+  // but still get treated as training (the merged set is auto-split for val).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts` around lines
104 - 107, Update the outdated comment in useDatasetFileDiscovery/index.ts to
mention both .jsonl and .json extensions: the fallback behavior (originally
described as "unmatched root .jsonl files") now applies to all extensions listed
in DATASET_EXTENSIONS, so change the wording to refer to "unmatched root dataset
files (e.g., .jsonl and .json)" or similar and ensure the comment refers to
DATASET_EXTENSIONS rather than only .jsonl so it accurately documents the
behavior used by the discovery logic.

5-5: ⚡ Quick win

Use import type for FilesetFileOutput in both index.ts and index.spec.ts.

FilesetFileOutput is only used in type positions in both files (index.ts lines 13-14, 34-36, 39-42 and index.spec.ts line 7 return type), never as a runtime value. Per coding guidelines, type-only imports must use import type.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts` at line 5,
Replace the runtime import with a type-only import for FilesetFileOutput: change
the import of FilesetFileOutput from '`@nemo/sdk/generated/platform/schema`' to an
"import type" form in both useDatasetFileDiscovery/index.ts and its test
index.spec.ts, since FilesetFileOutput is used only in type positions (e.g., the
return types and type annotations in functions referenced in the diff) and
should not be emitted at runtime.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts`:
- Around line 104-107: Update the outdated comment in
useDatasetFileDiscovery/index.ts to mention both .jsonl and .json extensions:
the fallback behavior (originally described as "unmatched root .jsonl files")
now applies to all extensions listed in DATASET_EXTENSIONS, so change the
wording to refer to "unmatched root dataset files (e.g., .jsonl and .json)" or
similar and ensure the comment refers to DATASET_EXTENSIONS rather than only
.jsonl so it accurately documents the behavior used by the discovery logic.
- Line 5: Replace the runtime import with a type-only import for
FilesetFileOutput: change the import of FilesetFileOutput from
'`@nemo/sdk/generated/platform/schema`' to an "import type" form in both
useDatasetFileDiscovery/index.ts and its test index.spec.ts, since
FilesetFileOutput is used only in type positions (e.g., the return types and
type annotations in functions referenced in the diff) and should not be emitted
at runtime.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b761a9a6-2bd7-4ef3-86b8-759f41e53659

📥 Commits

Reviewing files that changed from the base of the PR and between 53b6c8b and 1048bc5.

📒 Files selected for processing (4)
  • services/customizer/src/nmp/customizer/tasks/training/datasets/preparation.py
  • services/customizer/tests/tasks/training/test_datasets.py
  • web/packages/studio/src/hooks/useDatasetFileDiscovery/index.spec.ts
  • web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant