Skip to content

feat: add DeepSeek-V2 / V2-Lite / Coder-V2 architecture adapter#1408

Open
mukund1985 wants to merge 2 commits into
TransformerLensOrg:devfrom
mukund1985:feat/deepseek-v2-adapter
Open

feat: add DeepSeek-V2 / V2-Lite / Coder-V2 architecture adapter#1408
mukund1985 wants to merge 2 commits into
TransformerLensOrg:devfrom
mukund1985:feat/deepseek-v2-adapter

Conversation

@mukund1985

Copy link
Copy Markdown

Closes #1400.

Summary

Adds TransformerBridge support for DeepseekV2ForCausalLM (DeepSeek-V2, DeepSeek-V2-Lite, DeepSeek-Coder-V2). The adapter follows the same pattern as the existing V3 adapter but handles three V2-specific differences.

What changed

New files

  • transformer_lens/model_bridge/supported_architectures/deepseek_v2.py — the adapter
  • tests/integration/model_bridge/test_deepseek_v2_adapter.py — 17 integration tests covering both V2-full and V2-Lite configs

Modified files

  • supported_architectures/__init__.py and factories/architecture_adapter_factory.py — register the adapter
  • generalized_components/mla_attention.py — complex RoPE support (see below)
  • generalized_components/rotary_embedding.py — complex tensor pass-through

V2-specific differences from V3

1. Complex-exponential RoPE

V3 returns (cos, sin) from its rotary embedding. V2 returns freqs_cis = torch.polar(ones, freqs) — a complex tensor — and applies rotation via complex multiplication rather than the standard (q * cos) + (rotate_half(q) * sin) formula.

Two small changes to shared components:

  • RotaryEmbeddingBridge.forward(): when the original component returns a complex tensor, pass it through instead of raising. The hook_cos/hook_sin pattern does not apply here.
  • MLAAttentionBridge.forward(): detects complex position_embeddings and dispatches to a new _apply_rotary_complex() helper that mirrors apply_rotary_emb from the V2 modeling code (view_as_complex → multiply → flatten). The (cos, sin) path for V3 is unchanged. KV-cache kwargs are guarded so cos/sin are only included when present.

2. Optional Q LoRA path (V2-Lite)

V2-Lite sets q_lora_rank=None, skipping Q compression and using q_proj directly. The three Q-path submodules (q_a_proj, q_a_layernorm, q_b_proj) are marked optional=True so bridge setup skips them when absent. q_a_layernorm uses GeneralizedComponent (which already supports optional) rather than RMSNormalizationBridge — the layernorm call happens inside MLAAttentionBridge.forward() via the stored HF module, so a plain wrapper suffices. MLAAttentionBridge already branches on q_lora_rank at runtime.

3. Gate not hookable

DeepseekV2Moe.forward() routes via nn.functional.linear(hidden_states, self.gate.weight) rather than self.gate(hidden_states), so the gate module forward is never called and bridge hooks cannot be attached. The gate is omitted from MoEBridge submodules. shared_experts is called via __call__ and hooks correctly.

Tests

17 tests, all passing:

TestDeepSeekV2BridgeCreation   — block count, MLA type
TestDeepSeekV2ForwardPass      — output shape, HF output match (atol < 0.15)
TestDeepSeekV2DenseVsMoELayers — dense vs MoE hook presence, mlp hooks all layers
TestDeepSeekV2AttentionHooks   — attn hooks fire all layers, MLA latent hooks
TestDeepSeekV2LiteBridgeCreation — block count, MLA type for Lite
TestDeepSeekV2LiteForwardPass  — output shape, HF match for Lite
TestDeepSeekV2LiteNoQLatentHook — hook_q_latent absent, hook_kv_latent fires, no NaN

…RoPE)

Closes TransformerLensOrg#1400.

DeepSeek-V2, V2-Lite, and Coder-V2 all use DeepseekV2ForCausalLM.
This adds a bridge adapter covering three V2-specific differences
from V3:

1. Complex-exponential RoPE: V2's rotary embedding returns freqs_cis
   (a complex tensor via torch.polar) rather than a (cos, sin) tuple.
   - RotaryEmbeddingBridge.forward() now passes complex tensors through
     without raising, leaving them for the attention bridge to consume.
   - MLAAttentionBridge.forward() detects complex position_embeddings
     and dispatches to a new _apply_rotary_complex() helper that mirrors
     DeepSeek-V2's apply_rotary_emb (view_as_complex, multiply, flatten).

2. Optional Q LoRA path: V2-Lite sets q_lora_rank=None, skipping
   q_a_proj/q_a_layernorm/q_b_proj and using q_proj directly instead.
   All three Q-path submodules are marked optional=True in the adapter;
   q_a_layernorm uses GeneralizedComponent (which already supports
   optional) rather than RMSNormalizationBridge. MLAAttentionBridge
   already branches on q_lora_rank at runtime.

3. Gate not hookable: DeepseekV2Moe.forward() routes via
   nn.functional.linear(..., self.gate.weight) rather than
   self.gate(hidden_states), so the gate module's forward() is never
   called and bridge hooks cannot fire. The gate is omitted from
   MoEBridge submodules; shared_experts uses __call__ and hooks fine.

Files changed:
- supported_architectures/deepseek_v2.py (new)
- supported_architectures/__init__.py: register adapter
- factories/architecture_adapter_factory.py: map DeepseekV2ForCausalLM
- generalized_components/mla_attention.py: complex RoPE support
- generalized_components/rotary_embedding.py: complex tensor pass-through
- tests/integration/model_bridge/test_deepseek_v2_adapter.py (new, 17 tests)
@mukund1985

Copy link
Copy Markdown
Author

@jlarson4 tagging you as you have been reviewing the related MLA/bridge work — happy to take any feedback on the approach here, particularly the complex RoPE handling in MLAAttentionBridge.

@jlarson4

Copy link
Copy Markdown
Collaborator

I will review this later today! In the future, would you please comment on issues you wish to take on? This allows other contributors to see what you are working on so we aren't repeatedly duplicating efforts and allows me to get you properly assigned to the issue.

@jlarson4

Copy link
Copy Markdown
Collaborator

@mukund1985 This looks good to me. Excellent solution for branching the two different variations of the rotary implementation.

I can merge this once the formatting CI issues are cleared up. Please run make check-format and uv run mypy . on your code when you have a moment.

@mukund1985

Copy link
Copy Markdown
Author

@jlarson4 done — both addressed in commit 96fd927:

  • make check-format: black flagged spacing and line wraps in test_deepseek_v2_adapter.py and deepseek_v2.py, fixed via make format.
  • uv run mypy .: one error in rotary_embedding.pyforward() was annotated -> Tuple[Tensor, Tensor] but the V2 path legitimately returns a single complex freqs_cis tensor. Widened to Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor] and updated the docstring. Both checks pass clean now.

@mukund1985

Copy link
Copy Markdown
Author

@jlarson4 format and type checks are both green. The one failing check — Notebook Checks (Attribution_Patching_Demo) — does not involve any file this PR touches. Looks like a pre-existing failure in dev, possibly related to the changes merged with #1398 yesterday.

@mukund1985

Copy link
Copy Markdown
Author

@jlarson4 just dug into the Attribution_Patching_Demo failure. Cell 3 fails because the import from transformer_lens.model_bridge import TransformerBridge now produces stderr (nbval flags any unexpected stderr as a failure). The stderr is:

<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

These are SwigPy C-extension deprecation warnings — not from any code in this PR. I confirmed they reproduce identically on dev itself (without our changes), so this is a pre-existing issue in the base branch, not introduced here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants