Rename CustomPack to PresetPack and refine interface#1618
Open
jayhenry wants to merge 22 commits intoInternLM:mainfrom
Open
Rename CustomPack to PresetPack and refine interface#1618jayhenry wants to merge 22 commits intoInternLM:mainfrom
jayhenry wants to merge 22 commits intoInternLM:mainfrom
Conversation
- _load_pack_config_jsonl: parse [dataset_path(str), sample_idx, char_start, char_end, token_start_offset] - Replace _load_pack_config_npy with _load_pack_config_parquet using load_mixed_dict_from_parquet - Update _load_pack_config dispatch: .jsonl -> JSONL loader, .parquet -> Parquet loader - Rewrite test helpers (_write_jsonl_pack, _write_parquet_pack) for new format - Add loader-level unit tests (5 tests, all passing) - Mark Feature 1 as passes: true in feature_list.json Made-with: Cursor
- Build _path_to_ds_idx mapping from ds.path; replace dataset_id int with dataset_path str lookup - Remove 'skip' strategy from short_pack_strategy and long_pack_strategy - Fix token count: use ds.num_tokens[s_idx] directly (remove double-indexing via ds.sampled) - New char-range validation: both -1 OK (plain DataItem); else char_start>=0 and char_end>char_start - pack_infos stores 6-tuples (ds_idx, s_idx, char_start, char_end, token_start_offset, max_tokens) - Add _FakeDataset with .path attribute; 8 new validation unit tests (13 total, all passing) - Mark Feature 2 as passes: true in feature_list.json Made-with: Cursor
- Replace old token-slicing logic with DataItem/LongTextDataItem consistency check - For char_start==-1: verify item is plain DataItem (no 'char_start' key) - For char_start!=-1: verify LongTextDataItem fields match pack config exactly - Retain truncation via max_tokens for long_pack_strategy='truncate' - Update _FakeDataset with long_text_meta support for LongTextDataItem testing - Add 6 TestGetitem unit tests (DataItem JSONL/Parquet, LongTextDataItem, mixed, error cases) - Updated feature_list.json: marked feature InternLM#3 as passing Made-with: Cursor
…tom pack integration - Add disable_filter: bool = False to JsonlDataset.__init__; skips num_tokens==0 and max_length filters when True - Add disable_filter: bool = False field to DatasetConfig; forwarded to JsonlDataset.build() - Add DataloaderConfig._force_custom_pack_settings: forces sample_ratio=1.0, enable_sequential_sampler=True, disable_filter=True for all datasets when pack_level='custom', with warnings on overrides - Call _force_custom_pack_settings in DataloaderConfig.build before build_datasets - Add TestDisableFilter and TestDataloaderConfigCustomMode - Updated feature_list.json: marked feature InternLM#4 as passing (4/4 features complete) Made-with: Cursor
- Add token_end_offset as 6th element to _PackSlice and _ValidatedSlice - Update _load_pack_config_jsonl: require 6 elements, parse token_end_offset - Update _load_pack_config_parquet: unpack 6th element token_end_offset - Update _validate_pack: validate token_end_offset > token_start_offset >= 0; compute n_tokens = token_end_offset - token_start_offset (no longer reads ds.num_tokens); truncate adjusts token_end_offset of last slice - Update __getitem__: compute max_tokens = tok_end - tok_off from validated slice - Update all test fixtures to 6-element format; add 2 new validation tests - All 25 tests pass; marked feature InternLM#5 as passing Made-with: Cursor
…nizeFn - For char_start==-1 (plain TokenizeFn): slice input_ids/labels[tok_off:tok_end] so token_start_offset is correctly applied (not just tok_end truncation) - For char_start!=-1 (LongTextTokenizeFn): return item as-is after field-match validation since it is pre-truncated at tokenize time - Add test_plain_tokenizefn_token_start_offset_applied: verifies non-zero token_start_offset slicing on plain DataItem - Add test_longtextdataitem_no_extra_truncation: verifies no re-slicing occurs - All 27 tests pass; marked feature InternLM#6 as passing Made-with: Cursor
…rized validation - Add load_config(path, mmap=True) loading boundaries.npy, samples.npy, paths.npy - Rewrite __init__ to store mmap'd arrays; remove old JSONL/Parquet machinery - Replace _validate_pack Python loop with _validate_arrays vectorized numpy checks - Move long_pack_strategy='truncate' handling from __init__ to __getitem__ - Update test helpers to write NPY directory format; update all fixtures - Updated feature_list.json: marked feature InternLM#7 as passing Made-with: Cursor
…mparison
- Add generate_stress_pack_config: greedy packing with uniform [200,16000] token lengths
- Add _MockDataset: satisfies JsonlDataset interface without file I/O
- Add TestStress with 3 tests:
- test_generate_stress_pack_config: validates NPY directory output
- test_multiprocess_getitem: 8 fork'd processes with random index sampling,
reports init time, RSS/PSS deltas, and __getitem__ latency per rank
- test_mmap_memory_saving: two subprocesses compare load_config RSS/PSS/elapsed
for mmap=True (0.2MB, 0.7ms) vs mmap=False (24MB, 7.5ms)
- Updated feature_list.json: marked feature InternLM#8 as passing (8/8 complete)
Made-with: Cursor
…son and PackConfig TypedDict - Rename custom_pack.py -> preset_pack.py, CustomPackDataset -> PresetPackDataset - Replace paths.npy (allow_pickle) with paths.json for security - Store paths as list[str] instead of object ndarray (equivalent memory, cleaner types) - Introduce PackConfig TypedDict for precise return type of load_config() - Update __init__.py, config.py, custom_sampler.py, run_test.sh accordingly - Rename test file to test_preset_pack_dataset.py and update all references Made-with: Cursor
- Introduced a new test case to compare the performance of mmap fast path versus slow path when using enable_mmap_shared with multiple local processes. - The test validates that the fast path does not write large metadata to temporary storage, while the slow path does, ensuring correct behavior under different configurations. - Utilized a tracking mechanism to log the number of bytes saved during the process for accurate performance measurement.
Collaborator
Author
|
@claude review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.