Refactor Apriel2 cache, add Qwen2 converter, and conversation format for SFT #422

tscholak · 2025-12-15T19:46:44Z

Summary

Apriel2 Cache Refactor

Add methods to _AttentionCache and _SSMCache (reset, reorder, crop, batch_repeat, batch_select, is_initialized, batch_size)
Add _iter_caches() helper to flatten stochastic layer dicts
Simplify Apriel2Cache methods using new abstractions
Fix sliding window attention mask sizes (cumulative_length tracking)
Localize KDA tuple handling in _SSMCache

Qwen2 Converter

Add Qwen2/Qwen2.5 to Apriel2 config conversion (conversion/qwen2/ submodule)
Add weight mapping plan for Qwen2 models
Support Qwen-style bias pattern (QKV bias, no O bias)

Per-Layer Bias Support

Support weight-specific bias settings (query_layer.bias.enabled, etc.)
Bias inheritance for stochastic mixer submixers
Fix non-gated MLP handling (gate_proj only when gated=True)

Surgery and Conversion Improvements

Document monoidal structure with State/Partial Surgery/Transition Spec types
Clarify algebraic laws (monoid operation, action law) in compose_configs
Fix vision_encoder=None handling in converters
Change to relative imports in apriel2 modules for portability

SSM Layer Consolidation

Remove DiscreteMamba2 (unused)
Consolidate Mamba config (previously split across MambaConfig, MambaBaseConfig, Mamba2Config)
Add varlen support for Mamba (via optional position_indices parameter)

Conversation Format for SFT Data Preparation

Add ConversationSourceConfig (type: conversation) for chat datasets (Tulu 3, ShareGPT, etc.)
Fix O(n²) → O(n) tokenization via HuggingFace's return_assistant_tokens_mask
Add tokenize_chat() method returning (tokens, train_mask) directly
Add _mask_to_spans() helper to convert boolean mask to loss masking spans
Fix chat template documentation: entire assistant turn must be inside {% generation %} markers (following SmolLM3 pattern)
Add prepare_tulu3.yaml and train_supernet_qwen2.yaml training examples
Document performance tuning (~8k tokens/s, ~61GB memory, ~25h for 1B tokens)

Test plan

Split cache tests into contract tests (vs HuggingFace) and Apriel2-specific tests
Add integration tests for conversion: Qwen2 → Apriel2 → Supernet → Roundtrip
Add compose_configs and plan_surgery tests
Add expression plan tests (test_expr_plan.py)
Add plan execution tests (test_plan_execution.py)
Add tokenizer chat template validation and span extraction tests
Add parameterized test_tokenize_chat with exact expected tokens and trainable indices
Add test_mask_to_spans for span conversion
Manual test: prepared Tulu 3 dataset (939K samples, ~6 minutes) with Qwen2 tokenizer
Manual test: training run with ~8k tokens/s throughput

🤖 Generated with https://claude.com/claude-code

Cache improvements: - Add methods to _AttentionCache and _SSMCache (reset, reorder, crop, batch_repeat, batch_select, is_initialized, batch_size) - Add _iter_caches() helper to flatten stochastic layer dicts - Simplify Apriel2Cache methods using new abstractions - Fix sliding window attention mask sizes (cumulative_length tracking) - Localize KDA tuple handling in _SSMCache Test improvements: - Split tests into contract tests (vs HuggingFace) and Apriel2-specific - Add shared fixtures to conftest.py - Add edge case tests for SSM tuple operations - Remove duplicated fixture definitions Qwen2 converter: - Add Qwen2/Qwen2.5 to Apriel2 config conversion - Add weight mapping plan for Qwen2 models 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This commit adds comprehensive support for per-layer bias configurations in Apriel2 conversions and improves the surgery/conversion infrastructure. Key changes: **Per-layer bias configuration:** - Support weight-specific bias settings (query_layer.bias.enabled, etc.) - Bias inheritance for stochastic mixer submixers - Proper handling of Qwen-style bias pattern (QKV bias, no O bias) **Surgery and conversion improvements:** - Document monoidal structure in compose_configs and plan_surgery - Fix non-gated MLP handling (gate_proj only when gated=True) - Fix vision_encoder=None handling in converters - Change to relative imports in apriel2 modules for portability **Test infrastructure:** - Add requires_fastllm decorator for Fast-LLM dependent tests - Fix autouse fixture scoping (module-scoped for proper ordering) - Add comprehensive integration tests with parameterized inputs - Test all conversion stages: Qwen2 -> Apriel2 -> Supernet -> Roundtrip - Parameterized test inputs for batch size, padding, and generation length **Integration test structure:** - TestConfigPreservation: Verify config correctness at each stage - TestNumericalEquivalence: Verify logits and generation match - 24 tests covering 3 stages × 3 input variations × 2 checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Resolved conflict in fast_llm/models/gpt/conversion/apriel2.py: - Kept our per-layer MLP bias handling - Kept our gated vs non-gated MLP support

Enable automatic loss masking span computation for chat/conversation datasets using HuggingFace's {% generation %}...{% endgeneration %} markers. This allows preparing SFT data (e.g., Tulu 3) with proper masking of non-assistant content. - Add ConversationSourceConfig with `type: conversation` for chat data - Add validate_chat_template() to verify tokenizer has generation markers - Add apply_chat_template_with_spans() for text + masking span extraction - Tokenizer must have built-in chat template with generation markers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Inline _apply_chat_template into apply_chat_template_with_spans - Revert unnecessary test refactoring in test_tokenizer.py - Remove trivial config tests from test_preparator.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Rename test_plan_composition_torture.py → test_conversion_e2e.py (reflects actual purpose: end-to-end integration tests) - Rename test_algebraic_properties.py → test_plan_execution.py (clearer: tests plan execution and algebraic composition laws) - Remove stale NOTE comments referencing deleted tests - Fix fixture naming collision: attention_config → attention_config_dict in TestMarkovianProperty to avoid shadowing conftest fixtures - Consolidate shared fixtures in conftest.py Test organization now follows clear separation: - test_compose_configs.py: Config dict composition (structure/completeness) - test_plan_execution.py: Plan execution (weight transfer/correctness) - test_conversion_e2e.py: Full pipeline integration tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace apply_chat_template_with_spans with tokenize_chat (O(n) token-level) - Add _mask_to_spans helper to convert boolean mask to loss masking spans - Fix chat template docs: entire assistant turn must be in {% generation %} - Add parameterized tests with exact expected tokens and trainable indices - Add prepare_tulu3.yaml and train_supernet_qwen2.yaml examples - Document performance tuning (~8k tokens/s, ~61GB memory, ~25h for 1B tokens) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Refactor config.py with clearer algebraic structure documentation - Document State (S), Partial Surgery (P), and Transition Spec (T) types - Clarify monoid structure and action laws for config composition - Update activation_distillation_factor from 0.1 to 0.8 in small example 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- range.py: Use append() instead of extend() for tuple pairs. The extend() call was flattening tuples into individual integers, causing "cannot unpack non-iterable numpy.int64" errors when iterating over ranges. - model.py: Fix attribute name from output_layer to head. The config uses 'head' for the language model head configuration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Integration tests should run on realistic hardware. Roundtrip tests (Apriel2 -> Fast-LLM -> Apriel2) now skip when CUDA is unavailable. Changes: - Add CUDA check to roundtrip_converted fixture - Lazy-load roundtrip fixture in converted_model to avoid eager evaluation - Apriel2 and supernet tests still run on CPU (16 tests) - Roundtrip tests skip on CPU-only CI (8 tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

RaymondLi0

I'd suggest to split the PR in smaller pieces to make it easier to understand.
I looked at the data-preparation changes: some fix needed to the loss-masking spans. Otherwise looks good!

RaymondLi0 · 2025-12-19T20:22:49Z

fast_llm/data/preparator/gpt_memmap/config.py


-@config_class()
+@config_class(registry=True)
 class LanguageModelSourceConfig(Config):


Should loss_masking_spans be moved to TextSourceConfig, since it's not relevant for ConversationSourceConfig?

And same question for chosen/rejected_span and images. Unless we plan to also support those in ConversationSourceConfig?

yeah, this is cleaner!

RaymondLi0 · 2025-12-19T20:39:59Z

tests/data/test_tokenizer.py

+
+
+@pytest.mark.parametrize(
+    ("train_mask", "expected_loss_spans"),


Fast-llm expects loss-masking spans in the format (begin, last), not (begin, end).
So we would need to add a -1 to all span-last values.
see: https://servicenow.github.io/Fast-LLM/recipes/instruction-finetuning/#step-2-format-the-dataset-into-a-chat-template and https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/data/preparator/gpt_memmap/prepare.py#L222

I think the current implementation is correct. tokenize_chat returns (begin, end) exclusive format, which is exactly what the internal pipeline expects. (begin, last) is what is only externally required.

Indeed, you're right!

- Split LanguageModelSourceConfig into abstract base + DocumentSourceConfig - Remove has_conversation property, use isinstance checks instead - Move _mask_to_spans to tokenizer module as _train_mask_to_loss_spans - tokenize_chat now returns (tokens, loss_masking_spans) directly - Safer BOS/EOS handling: check anywhere in tokens, not just first/last - Remove unused add_generation_prompt parameter from tokenize_chat 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…r-and-qwen2

tscholak and others added 13 commits December 14, 2025 02:06

fix qwen converted to correctly load qkv biases

843a355

fix converters

33b6d31

Merge origin/main into feature/cache-refactor-and-qwen2

00cc8a9

Resolved conflict in fast_llm/models/gpt/conversion/apriel2.py: - Kept our per-layer MLP bias handling - Kept our gated vs non-gated MLP support

Merge origin/main into feature/cache-refactor-and-qwen2

cfa3663

tscholak requested a review from RaymondLi0 December 19, 2025 19:55

tscholak marked this pull request as ready for review December 19, 2025 19:55

RaymondLi0 requested changes Dec 19, 2025

View reviewed changes

tscholak and others added 2 commits December 20, 2025 22:25

Merge remote-tracking branch 'origin/main' into feature/cache-refacto…

ac5da3c

…r-and-qwen2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Apriel2 cache, add Qwen2 converter, and conversation format for SFT #422

Refactor Apriel2 cache, add Qwen2 converter, and conversation format for SFT #422

Uh oh!

tscholak commented Dec 15, 2025 •

edited

Loading

Uh oh!

RaymondLi0 left a comment •

edited

Loading

Uh oh!

RaymondLi0 Dec 19, 2025

Uh oh!

RaymondLi0 Dec 19, 2025

Uh oh!

tscholak Dec 20, 2025

Uh oh!

RaymondLi0 Dec 19, 2025 •

edited

Loading

Uh oh!

tscholak Dec 20, 2025

Uh oh!

RaymondLi0 Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		@pytest.mark.parametrize(
		("train_mask", "expected_loss_spans"),

Refactor Apriel2 cache, add Qwen2 converter, and conversation format for SFT #422

Are you sure you want to change the base?

Refactor Apriel2 cache, add Qwen2 converter, and conversation format for SFT #422

Uh oh!

Conversation

tscholak commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Apriel2 Cache Refactor

Qwen2 Converter

Per-Layer Bias Support

Surgery and Conversion Improvements

SSM Layer Consolidation

Conversation Format for SFT Data Preparation

Test plan

Uh oh!

RaymondLi0 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

tscholak Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tscholak Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tscholak commented Dec 15, 2025 •

edited

Loading

RaymondLi0 left a comment •

edited

Loading

RaymondLi0 Dec 19, 2025 •

edited

Loading