fix(customizer): lone-root fallback accepts .json in addition to .jsonl#286
fix(customizer): lone-root fallback accepts .json in addition to .jsonl#286aray12 wants to merge 1 commit into
Conversation
….jsonl
The discover_dataset_files fallback glob was .jsonl-only, while every
other discovery path (TRAIN_PATTERNS, VAL_PATTERNS, subdir walk) already
treated .json and .jsonl as equivalent. A user uploading data.json at
the fileset root with no train/val pattern in the name would hit a
DatasetFormatError even though the file is valid.
Fixes the fallback to use iterdir() with suffix check instead of
glob("*.jsonl"), and mirrors the change in Studio's partitionDatasetFiles
so the UI stays in sync with customizer's discovery rules.
Closes ASTD-109
Signed-off-by: Alex Ray <alray@nvidia.com>
📝 WalkthroughWalkthroughBoth the frontend file partitioning and backend dataset discovery logic expanded fallback behavior to recognize ChangesExpand dataset file discovery to include JSON format in fallback
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts (2)
104-107: ⚡ Quick winUpdate comment to reflect both extensions.
Comment says "unmatched root .jsonl files" but the fallback now handles both
.jsonland.jsonfiles (line 25 DATASET_EXTENSIONS).📝 Suggested fix
- // Customizer fallback (preparation.py:324-336): when no train/val patterns - // matched at all, ALL unmatched root .jsonl files are claimed as training. - // Single file is unambiguous; multiple files trigger a warning in customizer - // but still get treated as training (the merged set is auto-split for val). + // Customizer fallback (preparation.py:324-336): when no train/val patterns + // matched at all, ALL unmatched root .jsonl/.json files are claimed as training. + // Single file is unambiguous; multiple files trigger a warning in customizer + // but still get treated as training (the merged set is auto-split for val).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts` around lines 104 - 107, Update the outdated comment in useDatasetFileDiscovery/index.ts to mention both .jsonl and .json extensions: the fallback behavior (originally described as "unmatched root .jsonl files") now applies to all extensions listed in DATASET_EXTENSIONS, so change the wording to refer to "unmatched root dataset files (e.g., .jsonl and .json)" or similar and ensure the comment refers to DATASET_EXTENSIONS rather than only .jsonl so it accurately documents the behavior used by the discovery logic.
5-5: ⚡ Quick winUse
import typeforFilesetFileOutputin bothindex.tsandindex.spec.ts.
FilesetFileOutputis only used in type positions in both files (index.tslines 13-14, 34-36, 39-42 andindex.spec.tsline 7 return type), never as a runtime value. Per coding guidelines, type-only imports must useimport type.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts` at line 5, Replace the runtime import with a type-only import for FilesetFileOutput: change the import of FilesetFileOutput from '`@nemo/sdk/generated/platform/schema`' to an "import type" form in both useDatasetFileDiscovery/index.ts and its test index.spec.ts, since FilesetFileOutput is used only in type positions (e.g., the return types and type annotations in functions referenced in the diff) and should not be emitted at runtime.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@web/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts`:
- Around line 104-107: Update the outdated comment in
useDatasetFileDiscovery/index.ts to mention both .jsonl and .json extensions:
the fallback behavior (originally described as "unmatched root .jsonl files")
now applies to all extensions listed in DATASET_EXTENSIONS, so change the
wording to refer to "unmatched root dataset files (e.g., .jsonl and .json)" or
similar and ensure the comment refers to DATASET_EXTENSIONS rather than only
.jsonl so it accurately documents the behavior used by the discovery logic.
- Line 5: Replace the runtime import with a type-only import for
FilesetFileOutput: change the import of FilesetFileOutput from
'`@nemo/sdk/generated/platform/schema`' to an "import type" form in both
useDatasetFileDiscovery/index.ts and its test index.spec.ts, since
FilesetFileOutput is used only in type positions (e.g., the return types and
type annotations in functions referenced in the diff) and should not be emitted
at runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b761a9a6-2bd7-4ef3-86b8-759f41e53659
📒 Files selected for processing (4)
services/customizer/src/nmp/customizer/tasks/training/datasets/preparation.pyservices/customizer/tests/tasks/training/test_datasets.pyweb/packages/studio/src/hooks/useDatasetFileDiscovery/index.spec.tsweb/packages/studio/src/hooks/useDatasetFileDiscovery/index.ts
Summary
discover_dataset_filesfallback (preparation.py) usedglob("*.jsonl"), missing.jsonfiles even though every other discovery path (patterns, subdir walk) treated both extensions equally. A lonedata.jsonat fileset root with no train/val naming pattern would raiseDatasetFormatError.iterdir()+ suffix check for(".jsonl", ".json")to match the behavior of the other paths.partitionDatasetFilesso the UI discovery stays in sync with customizer.Test plan
.json→ training, mixed.jsonl+.json→ both claimed, multiple.json→ all claimed.jsonl+.jsoncaseuv run pytest services/customizer/tests/tasks/training/test_datasets.py)pnpm --filter nemo-studio-ui test src/hooks/useDatasetFileDiscovery/index.spec.ts)Closes ASTD-109
Summary by CodeRabbit
Release Notes
Bug Fixes
.jsonand.jsonlfiles in fallback scenarios when configured training/validation patterns don't match any files.Tests