-
Notifications
You must be signed in to change notification settings - Fork 39
Refactor Apriel2 cache, add Qwen2 converter, and conversation format for SFT #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Cache improvements: - Add methods to _AttentionCache and _SSMCache (reset, reorder, crop, batch_repeat, batch_select, is_initialized, batch_size) - Add _iter_caches() helper to flatten stochastic layer dicts - Simplify Apriel2Cache methods using new abstractions - Fix sliding window attention mask sizes (cumulative_length tracking) - Localize KDA tuple handling in _SSMCache Test improvements: - Split tests into contract tests (vs HuggingFace) and Apriel2-specific - Add shared fixtures to conftest.py - Add edge case tests for SSM tuple operations - Remove duplicated fixture definitions Qwen2 converter: - Add Qwen2/Qwen2.5 to Apriel2 config conversion - Add weight mapping plan for Qwen2 models 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit adds comprehensive support for per-layer bias configurations in Apriel2 conversions and improves the surgery/conversion infrastructure. Key changes: **Per-layer bias configuration:** - Support weight-specific bias settings (query_layer.bias.enabled, etc.) - Bias inheritance for stochastic mixer submixers - Proper handling of Qwen-style bias pattern (QKV bias, no O bias) **Surgery and conversion improvements:** - Document monoidal structure in compose_configs and plan_surgery - Fix non-gated MLP handling (gate_proj only when gated=True) - Fix vision_encoder=None handling in converters - Change to relative imports in apriel2 modules for portability **Test infrastructure:** - Add requires_fastllm decorator for Fast-LLM dependent tests - Fix autouse fixture scoping (module-scoped for proper ordering) - Add comprehensive integration tests with parameterized inputs - Test all conversion stages: Qwen2 -> Apriel2 -> Supernet -> Roundtrip - Parameterized test inputs for batch size, padding, and generation length **Integration test structure:** - TestConfigPreservation: Verify config correctness at each stage - TestNumericalEquivalence: Verify logits and generation match - 24 tests covering 3 stages × 3 input variations × 2 checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolved conflict in fast_llm/models/gpt/conversion/apriel2.py: - Kept our per-layer MLP bias handling - Kept our gated vs non-gated MLP support
Enable automatic loss masking span computation for chat/conversation
datasets using HuggingFace's {% generation %}...{% endgeneration %}
markers. This allows preparing SFT data (e.g., Tulu 3) with proper
masking of non-assistant content.
- Add ConversationSourceConfig with `type: conversation` for chat data
- Add validate_chat_template() to verify tokenizer has generation markers
- Add apply_chat_template_with_spans() for text + masking span extraction
- Tokenizer must have built-in chat template with generation markers
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Inline _apply_chat_template into apply_chat_template_with_spans - Revert unnecessary test refactoring in test_tokenizer.py - Remove trivial config tests from test_preparator.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename test_plan_composition_torture.py → test_conversion_e2e.py (reflects actual purpose: end-to-end integration tests) - Rename test_algebraic_properties.py → test_plan_execution.py (clearer: tests plan execution and algebraic composition laws) - Remove stale NOTE comments referencing deleted tests - Fix fixture naming collision: attention_config → attention_config_dict in TestMarkovianProperty to avoid shadowing conftest fixtures - Consolidate shared fixtures in conftest.py Test organization now follows clear separation: - test_compose_configs.py: Config dict composition (structure/completeness) - test_plan_execution.py: Plan execution (weight transfer/correctness) - test_conversion_e2e.py: Full pipeline integration tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace apply_chat_template_with_spans with tokenize_chat (O(n) token-level)
- Add _mask_to_spans helper to convert boolean mask to loss masking spans
- Fix chat template docs: entire assistant turn must be in {% generation %}
- Add parameterized tests with exact expected tokens and trainable indices
- Add prepare_tulu3.yaml and train_supernet_qwen2.yaml examples
- Document performance tuning (~8k tokens/s, ~61GB memory, ~25h for 1B tokens)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Refactor config.py with clearer algebraic structure documentation - Document State (S), Partial Surgery (P), and Transition Spec (T) types - Clarify monoid structure and action laws for config composition - Update activation_distillation_factor from 0.1 to 0.8 in small example 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- range.py: Use append() instead of extend() for tuple pairs. The extend() call was flattening tuples into individual integers, causing "cannot unpack non-iterable numpy.int64" errors when iterating over ranges. - model.py: Fix attribute name from output_layer to head. The config uses 'head' for the language model head configuration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Integration tests should run on realistic hardware. Roundtrip tests (Apriel2 -> Fast-LLM -> Apriel2) now skip when CUDA is unavailable. Changes: - Add CUDA check to roundtrip_converted fixture - Lazy-load roundtrip fixture in converted_model to avoid eager evaluation - Apriel2 and supernet tests still run on CPU (16 tests) - Roundtrip tests skip on CPU-only CI (8 tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to split the PR in smaller pieces to make it easier to understand.
I looked at the data-preparation changes: some fix needed to the loss-masking spans. Otherwise looks good!
|
|
||
| @config_class() | ||
| @config_class(registry=True) | ||
| class LanguageModelSourceConfig(Config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should loss_masking_spans be moved to TextSourceConfig, since it's not relevant for ConversationSourceConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And same question for chosen/rejected_span and images. Unless we plan to also support those in ConversationSourceConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is cleaner!
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| ("train_mask", "expected_loss_spans"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fast-llm expects loss-masking spans in the format (begin, last), not (begin, end).
So we would need to add a -1 to all span-last values.
see: https://servicenow.github.io/Fast-LLM/recipes/instruction-finetuning/#step-2-format-the-dataset-into-a-chat-template and https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/data/preparator/gpt_memmap/prepare.py#L222
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation is correct. tokenize_chat returns (begin, end) exclusive format, which is exactly what the internal pipeline expects. (begin, last) is what is only externally required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, you're right!
- Split LanguageModelSourceConfig into abstract base + DocumentSourceConfig - Remove has_conversation property, use isinstance checks instead - Move _mask_to_spans to tokenizer module as _train_mask_to_loss_spans - tokenize_chat now returns (tokens, loss_masking_spans) directly - Safer BOS/EOS handling: check anywhere in tokens, not just first/last - Remove unused add_generation_prompt parameter from tokenize_chat 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Apriel2 Cache Refactor
Qwen2 Converter
Per-Layer Bias Support
Surgery and Conversion Improvements
SSM Layer Consolidation
Conversation Format for SFT Data Preparation
Test plan
🤖 Generated with https://claude.com/claude-code