feat: add DeepSeek-V2 / V2-Lite / Coder-V2 architecture adapter#1408
feat: add DeepSeek-V2 / V2-Lite / Coder-V2 architecture adapter#1408mukund1985 wants to merge 2 commits into
Conversation
…RoPE) Closes TransformerLensOrg#1400. DeepSeek-V2, V2-Lite, and Coder-V2 all use DeepseekV2ForCausalLM. This adds a bridge adapter covering three V2-specific differences from V3: 1. Complex-exponential RoPE: V2's rotary embedding returns freqs_cis (a complex tensor via torch.polar) rather than a (cos, sin) tuple. - RotaryEmbeddingBridge.forward() now passes complex tensors through without raising, leaving them for the attention bridge to consume. - MLAAttentionBridge.forward() detects complex position_embeddings and dispatches to a new _apply_rotary_complex() helper that mirrors DeepSeek-V2's apply_rotary_emb (view_as_complex, multiply, flatten). 2. Optional Q LoRA path: V2-Lite sets q_lora_rank=None, skipping q_a_proj/q_a_layernorm/q_b_proj and using q_proj directly instead. All three Q-path submodules are marked optional=True in the adapter; q_a_layernorm uses GeneralizedComponent (which already supports optional) rather than RMSNormalizationBridge. MLAAttentionBridge already branches on q_lora_rank at runtime. 3. Gate not hookable: DeepseekV2Moe.forward() routes via nn.functional.linear(..., self.gate.weight) rather than self.gate(hidden_states), so the gate module's forward() is never called and bridge hooks cannot fire. The gate is omitted from MoEBridge submodules; shared_experts uses __call__ and hooks fine. Files changed: - supported_architectures/deepseek_v2.py (new) - supported_architectures/__init__.py: register adapter - factories/architecture_adapter_factory.py: map DeepseekV2ForCausalLM - generalized_components/mla_attention.py: complex RoPE support - generalized_components/rotary_embedding.py: complex tensor pass-through - tests/integration/model_bridge/test_deepseek_v2_adapter.py (new, 17 tests)
|
@jlarson4 tagging you as you have been reviewing the related MLA/bridge work — happy to take any feedback on the approach here, particularly the complex RoPE handling in |
|
I will review this later today! In the future, would you please comment on issues you wish to take on? This allows other contributors to see what you are working on so we aren't repeatedly duplicating efforts and allows me to get you properly assigned to the issue. |
|
@mukund1985 This looks good to me. Excellent solution for branching the two different variations of the rotary implementation. I can merge this once the formatting CI issues are cleared up. Please run |
|
@jlarson4 done — both addressed in commit
|
|
@jlarson4 just dug into the These are SwigPy C-extension deprecation warnings — not from any code in this PR. I confirmed they reproduce identically on |
Closes #1400.
Summary
Adds TransformerBridge support for
DeepseekV2ForCausalLM(DeepSeek-V2, DeepSeek-V2-Lite, DeepSeek-Coder-V2). The adapter follows the same pattern as the existing V3 adapter but handles three V2-specific differences.What changed
New files
transformer_lens/model_bridge/supported_architectures/deepseek_v2.py— the adaptertests/integration/model_bridge/test_deepseek_v2_adapter.py— 17 integration tests covering both V2-full and V2-Lite configsModified files
supported_architectures/__init__.pyandfactories/architecture_adapter_factory.py— register the adaptergeneralized_components/mla_attention.py— complex RoPE support (see below)generalized_components/rotary_embedding.py— complex tensor pass-throughV2-specific differences from V3
1. Complex-exponential RoPE
V3 returns
(cos, sin)from its rotary embedding. V2 returnsfreqs_cis = torch.polar(ones, freqs)— a complex tensor — and applies rotation via complex multiplication rather than the standard(q * cos) + (rotate_half(q) * sin)formula.Two small changes to shared components:
RotaryEmbeddingBridge.forward(): when the original component returns a complex tensor, pass it through instead of raising. The hook_cos/hook_sin pattern does not apply here.MLAAttentionBridge.forward(): detects complexposition_embeddingsand dispatches to a new_apply_rotary_complex()helper that mirrorsapply_rotary_embfrom the V2 modeling code (view_as_complex → multiply → flatten). The (cos, sin) path for V3 is unchanged. KV-cache kwargs are guarded socos/sinare only included when present.2. Optional Q LoRA path (V2-Lite)
V2-Lite sets
q_lora_rank=None, skipping Q compression and usingq_projdirectly. The three Q-path submodules (q_a_proj,q_a_layernorm,q_b_proj) are markedoptional=Trueso bridge setup skips them when absent.q_a_layernormusesGeneralizedComponent(which already supportsoptional) rather thanRMSNormalizationBridge— the layernorm call happens insideMLAAttentionBridge.forward()via the stored HF module, so a plain wrapper suffices.MLAAttentionBridgealready branches onq_lora_rankat runtime.3. Gate not hookable
DeepseekV2Moe.forward()routes viann.functional.linear(hidden_states, self.gate.weight)rather thanself.gate(hidden_states), so the gate module forward is never called and bridge hooks cannot be attached. The gate is omitted fromMoEBridgesubmodules.shared_expertsis called via__call__and hooks correctly.Tests
17 tests, all passing: