Rename CustomPack to PresetPack and refine interface by jayhenry · Pull Request #1618 · InternLM/xtuner

jayhenry · 2026-03-23T12:14:33Z

No description provided.

- _load_pack_config_jsonl: parse [dataset_path(str), sample_idx, char_start, char_end, token_start_offset] - Replace _load_pack_config_npy with _load_pack_config_parquet using load_mixed_dict_from_parquet - Update _load_pack_config dispatch: .jsonl -> JSONL loader, .parquet -> Parquet loader - Rewrite test helpers (_write_jsonl_pack, _write_parquet_pack) for new format - Add loader-level unit tests (5 tests, all passing) - Mark Feature 1 as passes: true in feature_list.json Made-with: Cursor

- Build _path_to_ds_idx mapping from ds.path; replace dataset_id int with dataset_path str lookup - Remove 'skip' strategy from short_pack_strategy and long_pack_strategy - Fix token count: use ds.num_tokens[s_idx] directly (remove double-indexing via ds.sampled) - New char-range validation: both -1 OK (plain DataItem); else char_start>=0 and char_end>char_start - pack_infos stores 6-tuples (ds_idx, s_idx, char_start, char_end, token_start_offset, max_tokens) - Add _FakeDataset with .path attribute; 8 new validation unit tests (13 total, all passing) - Mark Feature 2 as passes: true in feature_list.json Made-with: Cursor

- Replace old token-slicing logic with DataItem/LongTextDataItem consistency check - For char_start==-1: verify item is plain DataItem (no 'char_start' key) - For char_start!=-1: verify LongTextDataItem fields match pack config exactly - Retain truncation via max_tokens for long_pack_strategy='truncate' - Update _FakeDataset with long_text_meta support for LongTextDataItem testing - Add 6 TestGetitem unit tests (DataItem JSONL/Parquet, LongTextDataItem, mixed, error cases) - Updated feature_list.json: marked feature InternLM#3 as passing Made-with: Cursor

…tom pack integration - Add disable_filter: bool = False to JsonlDataset.__init__; skips num_tokens==0 and max_length filters when True - Add disable_filter: bool = False field to DatasetConfig; forwarded to JsonlDataset.build() - Add DataloaderConfig._force_custom_pack_settings: forces sample_ratio=1.0, enable_sequential_sampler=True, disable_filter=True for all datasets when pack_level='custom', with warnings on overrides - Call _force_custom_pack_settings in DataloaderConfig.build before build_datasets - Add TestDisableFilter and TestDataloaderConfigCustomMode - Updated feature_list.json: marked feature InternLM#4 as passing (4/4 features complete) Made-with: Cursor

- Add token_end_offset as 6th element to _PackSlice and _ValidatedSlice - Update _load_pack_config_jsonl: require 6 elements, parse token_end_offset - Update _load_pack_config_parquet: unpack 6th element token_end_offset - Update _validate_pack: validate token_end_offset > token_start_offset >= 0; compute n_tokens = token_end_offset - token_start_offset (no longer reads ds.num_tokens); truncate adjusts token_end_offset of last slice - Update __getitem__: compute max_tokens = tok_end - tok_off from validated slice - Update all test fixtures to 6-element format; add 2 new validation tests - All 25 tests pass; marked feature InternLM#5 as passing Made-with: Cursor

…nizeFn - For char_start==-1 (plain TokenizeFn): slice input_ids/labels[tok_off:tok_end] so token_start_offset is correctly applied (not just tok_end truncation) - For char_start!=-1 (LongTextTokenizeFn): return item as-is after field-match validation since it is pre-truncated at tokenize time - Add test_plain_tokenizefn_token_start_offset_applied: verifies non-zero token_start_offset slicing on plain DataItem - Add test_longtextdataitem_no_extra_truncation: verifies no re-slicing occurs - All 27 tests pass; marked feature InternLM#6 as passing Made-with: Cursor

…rized validation - Add load_config(path, mmap=True) loading boundaries.npy, samples.npy, paths.npy - Rewrite __init__ to store mmap'd arrays; remove old JSONL/Parquet machinery - Replace _validate_pack Python loop with _validate_arrays vectorized numpy checks - Move long_pack_strategy='truncate' handling from __init__ to __getitem__ - Update test helpers to write NPY directory format; update all fixtures - Updated feature_list.json: marked feature InternLM#7 as passing Made-with: Cursor

…mparison - Add generate_stress_pack_config: greedy packing with uniform [200,16000] token lengths - Add _MockDataset: satisfies JsonlDataset interface without file I/O - Add TestStress with 3 tests: - test_generate_stress_pack_config: validates NPY directory output - test_multiprocess_getitem: 8 fork'd processes with random index sampling, reports init time, RSS/PSS deltas, and __getitem__ latency per rank - test_mmap_memory_saving: two subprocesses compare load_config RSS/PSS/elapsed for mmap=True (0.2MB, 0.7ms) vs mmap=False (24MB, 7.5ms) - Updated feature_list.json: marked feature InternLM#8 as passing (8/8 complete) Made-with: Cursor

…son and PackConfig TypedDict - Rename custom_pack.py -> preset_pack.py, CustomPackDataset -> PresetPackDataset - Replace paths.npy (allow_pickle) with paths.json for security - Store paths as list[str] instead of object ndarray (equivalent memory, cleaner types) - Introduce PackConfig TypedDict for precise return type of load_config() - Update __init__.py, config.py, custom_sampler.py, run_test.sh accordingly - Rename test file to test_preset_pack_dataset.py and update all references Made-with: Cursor

- Introduced a new test case to compare the performance of mmap fast path versus slow path when using enable_mmap_shared with multiple local processes. - The test validates that the fast path does not write large metadata to temporary storage, while the slow path does, ensuring correct behavior under different configurations. - Utilized a tracking mechanism to log the number of bytes saved during the process for accurate performance measurement.

jayhenry · 2026-03-24T02:22:29Z

@claude review

jayhenry added 21 commits March 23, 2026 12:19

rm some unit tests of custom pack

79eeaf7

Init with plan agent

fba1159

update design and feature_list.json

ccd325e

Update new features and design doc

7d45573

add new feature to feature_list.json

cbb2041

update design doc

ce6eea5

fix jsonl disable filter ut

6af0a85

bypass set jsonl attrs if disable_filter and sampler_ratio=1

d9ac928

simplify jsonl to use _meta instead of offsets, num_tokens

00e22cc

Enable JsonlDataset mmap highway if not filter and sample

3092e8c

remove helper files

40a734f

jayhenry force-pushed the custom_pack branch from caf3866 to 40a734f Compare March 23, 2026 12:19

refine meta update logic

c4b90ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename CustomPack to PresetPack and refine interface#1618

Rename CustomPack to PresetPack and refine interface#1618
jayhenry wants to merge 22 commits intoInternLM:mainfrom
jayhenry:custom_pack

jayhenry commented Mar 23, 2026

Uh oh!

jayhenry commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jayhenry commented Mar 23, 2026

Uh oh!

jayhenry commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant