Skip to content

Conversation

@tscholak
Copy link
Collaborator

@tscholak tscholak commented Dec 15, 2025

Summary

Apriel2 Cache Refactor

  • Add methods to _AttentionCache and _SSMCache (reset, reorder, crop, batch_repeat, batch_select, is_initialized, batch_size)
  • Add _iter_caches() helper to flatten stochastic layer dicts
  • Simplify Apriel2Cache methods using new abstractions
  • Fix sliding window attention mask sizes (cumulative_length tracking)
  • Localize KDA tuple handling in _SSMCache

Qwen2 Converter

  • Add Qwen2/Qwen2.5 to Apriel2 config conversion (conversion/qwen2/ submodule)
  • Add weight mapping plan for Qwen2 models
  • Support Qwen-style bias pattern (QKV bias, no O bias)

Per-Layer Bias Support

  • Support weight-specific bias settings (query_layer.bias.enabled, etc.)
  • Bias inheritance for stochastic mixer submixers
  • Fix non-gated MLP handling (gate_proj only when gated=True)

Surgery and Conversion Improvements

  • Document monoidal structure with State/Partial Surgery/Transition Spec types
  • Clarify algebraic laws (monoid operation, action law) in compose_configs
  • Fix vision_encoder=None handling in converters
  • Change to relative imports in apriel2 modules for portability

SSM Layer Consolidation

  • Remove DiscreteMamba2 (unused)
  • Consolidate Mamba config (previously split across MambaConfig, MambaBaseConfig, Mamba2Config)
  • Add varlen support for Mamba (via optional position_indices parameter)

Conversation Format for SFT Data Preparation

  • Add ConversationSourceConfig (type: conversation) for chat datasets (Tulu 3, ShareGPT, etc.)
  • Fix O(n²) → O(n) tokenization via HuggingFace's return_assistant_tokens_mask
  • Add tokenize_chat() method returning (tokens, train_mask) directly
  • Add _mask_to_spans() helper to convert boolean mask to loss masking spans
  • Fix chat template documentation: entire assistant turn must be inside {% generation %} markers (following SmolLM3 pattern)
  • Add prepare_tulu3.yaml and train_supernet_qwen2.yaml training examples
  • Document performance tuning (~8k tokens/s, ~61GB memory, ~25h for 1B tokens)

Test plan

  • Split cache tests into contract tests (vs HuggingFace) and Apriel2-specific tests
  • Add integration tests for conversion: Qwen2 → Apriel2 → Supernet → Roundtrip
  • Add compose_configs and plan_surgery tests
  • Add expression plan tests (test_expr_plan.py)
  • Add plan execution tests (test_plan_execution.py)
  • Add tokenizer chat template validation and span extraction tests
  • Add parameterized test_tokenize_chat with exact expected tokens and trainable indices
  • Add test_mask_to_spans for span conversion
  • Manual test: prepared Tulu 3 dataset (939K samples, ~6 minutes) with Qwen2 tokenizer
  • Manual test: training run with ~8k tokens/s throughput

🤖 Generated with https://claude.com/claude-code

tscholak and others added 13 commits December 14, 2025 02:06
Cache improvements:
- Add methods to _AttentionCache and _SSMCache (reset, reorder, crop,
  batch_repeat, batch_select, is_initialized, batch_size)
- Add _iter_caches() helper to flatten stochastic layer dicts
- Simplify Apriel2Cache methods using new abstractions
- Fix sliding window attention mask sizes (cumulative_length tracking)
- Localize KDA tuple handling in _SSMCache

Test improvements:
- Split tests into contract tests (vs HuggingFace) and Apriel2-specific
- Add shared fixtures to conftest.py
- Add edge case tests for SSM tuple operations
- Remove duplicated fixture definitions

Qwen2 converter:
- Add Qwen2/Qwen2.5 to Apriel2 config conversion
- Add weight mapping plan for Qwen2 models

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit adds comprehensive support for per-layer bias configurations
in Apriel2 conversions and improves the surgery/conversion infrastructure.

Key changes:

**Per-layer bias configuration:**
- Support weight-specific bias settings (query_layer.bias.enabled, etc.)
- Bias inheritance for stochastic mixer submixers
- Proper handling of Qwen-style bias pattern (QKV bias, no O bias)

**Surgery and conversion improvements:**
- Document monoidal structure in compose_configs and plan_surgery
- Fix non-gated MLP handling (gate_proj only when gated=True)
- Fix vision_encoder=None handling in converters
- Change to relative imports in apriel2 modules for portability

**Test infrastructure:**
- Add requires_fastllm decorator for Fast-LLM dependent tests
- Fix autouse fixture scoping (module-scoped for proper ordering)
- Add comprehensive integration tests with parameterized inputs
- Test all conversion stages: Qwen2 -> Apriel2 -> Supernet -> Roundtrip
- Parameterized test inputs for batch size, padding, and generation length

**Integration test structure:**
- TestConfigPreservation: Verify config correctness at each stage
- TestNumericalEquivalence: Verify logits and generation match
- 24 tests covering 3 stages × 3 input variations × 2 checks

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolved conflict in fast_llm/models/gpt/conversion/apriel2.py:
- Kept our per-layer MLP bias handling
- Kept our gated vs non-gated MLP support
Enable automatic loss masking span computation for chat/conversation
datasets using HuggingFace's {% generation %}...{% endgeneration %}
markers. This allows preparing SFT data (e.g., Tulu 3) with proper
masking of non-assistant content.

- Add ConversationSourceConfig with `type: conversation` for chat data
- Add validate_chat_template() to verify tokenizer has generation markers
- Add apply_chat_template_with_spans() for text + masking span extraction
- Tokenizer must have built-in chat template with generation markers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Inline _apply_chat_template into apply_chat_template_with_spans
- Revert unnecessary test refactoring in test_tokenizer.py
- Remove trivial config tests from test_preparator.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename test_plan_composition_torture.py → test_conversion_e2e.py
  (reflects actual purpose: end-to-end integration tests)
- Rename test_algebraic_properties.py → test_plan_execution.py
  (clearer: tests plan execution and algebraic composition laws)
- Remove stale NOTE comments referencing deleted tests
- Fix fixture naming collision: attention_config → attention_config_dict
  in TestMarkovianProperty to avoid shadowing conftest fixtures
- Consolidate shared fixtures in conftest.py

Test organization now follows clear separation:
- test_compose_configs.py: Config dict composition (structure/completeness)
- test_plan_execution.py: Plan execution (weight transfer/correctness)
- test_conversion_e2e.py: Full pipeline integration tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace apply_chat_template_with_spans with tokenize_chat (O(n) token-level)
- Add _mask_to_spans helper to convert boolean mask to loss masking spans
- Fix chat template docs: entire assistant turn must be in {% generation %}
- Add parameterized tests with exact expected tokens and trainable indices
- Add prepare_tulu3.yaml and train_supernet_qwen2.yaml examples
- Document performance tuning (~8k tokens/s, ~61GB memory, ~25h for 1B tokens)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Refactor config.py with clearer algebraic structure documentation
- Document State (S), Partial Surgery (P), and Transition Spec (T) types
- Clarify monoid structure and action laws for config composition
- Update activation_distillation_factor from 0.1 to 0.8 in small example

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- range.py: Use append() instead of extend() for tuple pairs. The extend()
  call was flattening tuples into individual integers, causing "cannot
  unpack non-iterable numpy.int64" errors when iterating over ranges.

- model.py: Fix attribute name from output_layer to head. The config
  uses 'head' for the language model head configuration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Integration tests should run on realistic hardware. Roundtrip tests
(Apriel2 -> Fast-LLM -> Apriel2) now skip when CUDA is unavailable.

Changes:
- Add CUDA check to roundtrip_converted fixture
- Lazy-load roundtrip fixture in converted_model to avoid eager evaluation
- Apriel2 and supernet tests still run on CPU (16 tests)
- Roundtrip tests skip on CPU-only CI (8 tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@tscholak tscholak requested a review from RaymondLi0 December 19, 2025 19:55
@tscholak tscholak marked this pull request as ready for review December 19, 2025 19:55
Copy link
Contributor

@RaymondLi0 RaymondLi0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to split the PR in smaller pieces to make it easier to understand.
I looked at the data-preparation changes: some fix needed to the loss-masking spans. Otherwise looks good!


@config_class()
@config_class(registry=True)
class LanguageModelSourceConfig(Config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should loss_masking_spans be moved to TextSourceConfig, since it's not relevant for ConversationSourceConfig?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same question for chosen/rejected_span and images. Unless we plan to also support those in ConversationSourceConfig?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is cleaner!



@pytest.mark.parametrize(
("train_mask", "expected_loss_spans"),
Copy link
Contributor

@RaymondLi0 RaymondLi0 Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast-llm expects loss-masking spans in the format (begin, last), not (begin, end).
So we would need to add a -1 to all span-last values.
see: https://servicenow.github.io/Fast-LLM/recipes/instruction-finetuning/#step-2-format-the-dataset-into-a-chat-template and https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/data/preparator/gpt_memmap/prepare.py#L222

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation is correct. tokenize_chat returns (begin, end) exclusive format, which is exactly what the internal pipeline expects. (begin, last) is what is only externally required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, you're right!

tscholak and others added 2 commits December 20, 2025 22:25
- Split LanguageModelSourceConfig into abstract base + DocumentSourceConfig
- Remove has_conversation property, use isinstance checks instead
- Move _mask_to_spans to tokenizer module as _train_mask_to_loss_spans
- tokenize_chat now returns (tokens, loss_masking_spans) directly
- Safer BOS/EOS handling: check anywhere in tokens, not just first/last
- Remove unused add_generation_prompt parameter from tokenize_chat

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants