Skip to content

[BugFix] Unwrap AttentionMask wrapper for transformers 5.9 in sequence parallel#206

Merged
tastelikefeet merged 4 commits into
modelscope:mainfrom
ys2025-AI:main
May 28, 2026
Merged

[BugFix] Unwrap AttentionMask wrapper for transformers 5.9 in sequence parallel#206
tastelikefeet merged 4 commits into
modelscope:mainfrom
ys2025-AI:main

Conversation

@ys2025-AI
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Description

In transformers = 5.9, masking_utils.sdpa_mask receives padding_mask as an AttentionMask dataclass instead of a raw torch.Tensor. The downstream _ignore_causal_mask_sdpa calls .all() on it, which raises AttributeError and breaks sequence-parallel training.

Changes

  • twinkle/model/transformers/strategy/sequence_parallel/__init__.py
    • Added type check in sdpa_mask to unwrap AttentionMask via .to_tensor() or .mask before forwarding to origin_sdpa.

Experiment results

Verified on 2 × A3 NPU with Qwen3.5-27B, dp=2, fsdp=2, sp=2

ys2025-AI added 4 commits May 26, 2026 16:27
- Add kernelize_model integration to ep_fsdp2_lora and fsdp2 examples
- Support model parameter in apply_npu_patch for FLA instance patching
- Implement NPU-accelerated packed MoE experts with weight caching
- Add Qwen3.5 SparseMoeBlock forward with dual Transformers version support
- Support partial RoPE and gated RMSNorm with FP32 mode option
- Add MindSpeed Triton FLA backend integration for Qwen3.5
- Add environment variable controls for patch toggles
- Add dynamic model discovery for unknown model families
- Skip weight cache when requires_grad=True to preserve autograd graph
- Resolve underlying PyTorch model from TransformersModel wrapper in FLA patch
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the sdpa_mask function to support padding_mask when it is provided as an AttentionMask wrapper instead of a raw PyTorch Tensor. The feedback suggests using torch.is_tensor instead of isinstance to maintain style consistency with the rest of the codebase.

@tastelikefeet tastelikefeet merged commit a4256a7 into modelscope:main May 28, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants