[compat] Support Qwen2-Audio with newer transformers by MWXGOD · Pull Request #9453 · modelscope/ms-swift

MWXGOD · 2026-05-30T11:31:39Z

Qwen2-Audio Compatibility with Newer Transformers for RLHF/GRPO

PR type

Bug fix
Feature enhancement
Documentation
Test

PR information

Motivation

Qwen2-Audio is currently constrained to the older transformers 4.48-era
encoding path. This is inconvenient for RLHF workflows such as GRPO because
recent trl versions require newer transformers releases.

When Qwen2-Audio is used with a newer transformers stack, the old audio
placeholder encoding path may encode <|AUDIO|> incorrectly, which can lead to
unstable training or corrupted inference outputs. This PR proposes a small,
Qwen2-Audio-only compatibility update so that SFT, inference, and GRPO/RLHF can
run in one environment.

Summary of changes

Encode Qwen2-Audio <|AUDIO|> contexts with the Qwen2-Audio processor instead
of treating them as generic text-only contexts.
Load audio with librosa.load(..., sr=processor.feature_extractor.sampling_rate).
Keep the existing labels and loss_scale construction after input_ids are
returned by the processor.
Keep all non-Qwen2-Audio models on the existing tokenization path.
Relax Qwen2-Audio requirements from transformers>=4.45,<4.49 to
transformers>=4.48,<6.
Add a minimal transformers>=5.0 cache compatibility patch for generation.

Suggested implementation

Line numbers below refer to the current official snapshot and may shift slightly
after upstream changes.

1. `swift/template/base.py`

Near the top-level imports, around line 1-15, add:

import warnings
import librosa

Replace _encode_context_list, currently starting around line 1064, with an
audio-aware version:

def _encode_context_list(
    self,
    context_list: List[Context],
    loss_scale_list: Optional[List[float]] = None,
    audio_path_list: Optional[List[str]] = None,
) -> Tuple[List[int], List[int], List[float]]:
    is_binary_loss_scale = self.is_binary_loss_scale
    if is_binary_loss_scale is None:
        is_binary_loss_scale = self.loss_scale.is_binary_loss_scale
    input_ids: List[int] = []
    labels: List[int] = []
    loss_scale: List[float] = []
    if loss_scale_list is None:
        loss_scale_list = [0.] * len(context_list)

    audio_ptr = 0
    for context, loss_weight in zip(context_list, loss_scale_list):
        if isinstance(context, str) and '<|AUDIO|>' in context:
            if audio_path_list is None or audio_ptr >= len(audio_path_list):
                warnings.warn(
                    'Found <|AUDIO|> but no matching audio input; fallback to text tokenization',
                    RuntimeWarning)
                token_list = self._tokenize(context)
            else:
                sample_rate = self.processor.feature_extractor.sampling_rate
                wav, _ = librosa.load(audio_path_list[audio_ptr], sr=sample_rate, mono=True)
                encoded = self.processor(
                    text=context,
                    audio=wav,
                    sampling_rate=sample_rate,
                    return_tensors=None,
                    add_special_tokens=False,
                )
                token_list = encoded['input_ids']
                if len(token_list) > 0 and isinstance(token_list[0], list):
                    token_list = token_list[0]
                audio_ptr += 1
        else:
            token_list = self._tokenize(context) if isinstance(context, str) else context

        input_ids += token_list
        if loss_weight > 0.0:
            labels += token_list
        else:
            labels += [-100] * len(token_list)
        if not is_binary_loss_scale:
            loss_scale.extend([loss_weight] * len(token_list))
    if is_binary_loss_scale:
        loss_scale = None
    return input_ids, labels, loss_scale

In _encode, around line 1472 after
self._simplify_context_list(...), call the audio-aware path only for
Qwen2-Audio:

res_context_list, loss_scale_list = self._simplify_context_list(res_context_list, loss_scale_list, inputs)
if self.tokenizer.model_meta.model_type and self.tokenizer.model_meta.model_type == 'qwen2_audio':
    input_ids, labels, loss_scale = self._encode_context_list(
        res_context_list, loss_scale_list, inputs.audios)
else:
    input_ids, labels, loss_scale = self._encode_context_list(res_context_list, loss_scale_list)

2. `swift/model/models/qwen.py`

If the target branch does not already import transformers, add it near the
top-level imports:

import transformers

Replace Qwen2AudioLoader, currently starting around line 1815, with:

class Qwen2AudioLoader(ModelLoader):

    @staticmethod
    def _is_transformers5() -> bool:
        return version.parse(transformers.__version__) >= version.parse('5.0.0')

    def _patch_transformers5_model(self, model: PreTrainedModel) -> PreTrainedModel:
        if not self._is_transformers5():
            return model
        generation_config = getattr(model, 'generation_config', None)
        if generation_config is not None and hasattr(generation_config, 'cache_implementation'):
            generation_config.cache_implementation = None
        _patch_hybrid_cache_device_update()
        return model

    def get_model(self, model_dir: str, *args, **kwargs) -> PreTrainedModel:
        from transformers import Qwen2AudioForConditionalGeneration
        self.auto_model_cls = self.auto_model_cls or Qwen2AudioForConditionalGeneration
        model = super().get_model(model_dir, *args, **kwargs)
        return self._patch_transformers5_model(model)

Update the Qwen2-Audio requirement, currently around line 1835:

requires=['transformers>=4.48', 'librosa'],

Add the cache helper after the Qwen2-Audio registration and before the next
model loader class:

def _patch_hybrid_cache_device_update() -> None:
    try:
        from transformers.cache_utils import HybridCache

        def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx: int, *args,
                   **kwargs) -> Tuple[torch.Tensor]:
            self.key_cache[layer_idx] = self.key_cache[layer_idx].to(key_states.device)
            self.value_cache[layer_idx] = self.value_cache[layer_idx].to(value_states.device)
            return self._update_origin(key_states, value_states, layer_idx, *args, **kwargs)

        if not hasattr(HybridCache, '_update_origin'):
            HybridCache._update_origin = HybridCache.update
            HybridCache.update = update
    except ImportError:
        pass

Compatibility

This change is intentionally scoped to Qwen2-Audio. Other text and multimodal
models should continue to use the existing encode path.

Expected supported environments:

transformers>=4.48: preserve the existing Qwen2-Audio baseline.
transformers>=5.0: support newer trl/GRPO environments with the cache
compatibility handling above.

Experiment results

Suggested smoke tests:

Qwen2-Audio SFT encode: a sample with <|AUDIO|> produces valid input_ids,
labels, and loss_scale.
Qwen2-Audio inference: generation is readable under both 4.48-series and newer
transformers versions.
GRPO/RLHF: a recent trl version, for example trl>=0.20, can initialize
Qwen2-Audio and start rollout generation.
Regression: non-Qwen2-Audio models still use the original encode path.

Checklist

The change is limited to Qwen2-Audio compatibility.
No task-specific metric, reward function, or private dataset logic is included.
Qwen2-Audio SFT encode has been checked with an audio sample.
Qwen2-Audio inference has been checked with newer transformers.
A GRPO/RLHF initialization or rollout smoke test has been checked with recent trl.
pre-commit run --all-files has been run before submission.

Notes for reviewers

This PR does not introduce task-specific evaluation, reward, or dataset code. It
only addresses Qwen2-Audio compatibility with the newer dependency stack needed
by RLHF/GRPO.

gemini-code-assist

Code Review

This pull request adds support for Qwen2-Audio models, including patching for transformers v5 compatibility and handling audio tokenization in the base template. The review comments highlight critical issues: a missing transformers import in qwen.py causing a NameError, a global librosa import in base.py that introduces an unwanted hard dependency, and a potential AttributeError in the audio tokenization path for non-audio models.

gemini-code-assist · 2026-05-30T11:32:47Z

+    @staticmethod
+    def _is_transformers5() -> bool:
+        return version.parse(transformers.__version__) >= version.parse('5.0.0')


The transformers module is not imported at the top of swift/model/models/qwen.py. Calling _is_transformers5() will raise a NameError at runtime. Please import transformers locally inside the method or at the module level.

Suggested change

@staticmethod

def _is_transformers5() -> bool:

return version.parse(transformers.__version__) >= version.parse('5.0.0')

@staticmethod

def _is_transformers5() -> bool:

import transformers

return version.parse(transformers.__version__) >= version.parse('5.0.0')

gemini-code-assist · 2026-05-30T11:32:48Z

 import hashlib
 import inspect
+import librosa
 import math
 import os
 import random
 import re
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+import warnings
 from collections import defaultdict
 from contextlib import contextmanager, nullcontext
 from copy import deepcopy


Importing librosa globally at the top of swift/template/base.py makes it a hard dependency for all models in the repository, even if they do not use audio features. This will cause an ImportError for users who do not have librosa installed. Please remove the global import and load librosa lazily where it is actually needed.

Suggested change

import hashlib

import inspect

import librosa

import math

import os

import random

import re

import torch

import torch.nn as nn

import torch.nn.functional as F

import warnings

from collections import defaultdict

from contextlib import contextmanager, nullcontext

from copy import deepcopy

import hashlib

import inspect

import math

import os

import random

import re

import torch

import torch.nn as nn

import torch.nn.functional as F

import warnings

from collections import defaultdict

from contextlib import contextmanager, nullcontext

from copy import deepcopy

gemini-code-assist · 2026-05-30T11:32:48Z

+        for context, loss_weight in zip(context_list, loss_scale_list):
+            if isinstance(context, str) and '<|AUDIO|>' in context:
+                if audio_path_list is None or audio_ptr >= len(audio_path_list):
+                    warnings.warn('Found <|AUDIO|> but no matching audio input; fallback to text tokenization',
+                                  RuntimeWarning)
+                    token_list = self._tokenize(context)
+                else:
+                    sample_rate = self.processor.feature_extractor.sampling_rate
+                    wav, _ = librosa.load(audio_path_list[audio_ptr], sr=sample_rate, mono=True)
+                    encoded = self.processor(
+                        text=context,
+                        audio=wav,
+                        sampling_rate=sample_rate,
+                        return_tensors=None,
+                        add_special_tokens=False,
+                    )
+                    token_list = encoded['input_ids']
+                    if len(token_list) > 0 and isinstance(token_list[0], list):
+                        token_list = token_list[0]
+                    audio_ptr += 1


To prevent other models from accidentally triggering the audio processing path (which would raise an AttributeError since non-audio models do not have a feature_extractor on their processor), we should restrict this block to only run when the model type is qwen2_audio. Additionally, we should import librosa locally here to avoid making it a global hard dependency.

for context, loss_weight in zip(context_list, loss_scale_list): if (isinstance(context, str) and '<|AUDIO|>' in context and getattr(self.tokenizer, 'model_meta', None) and getattr(self.tokenizer.model_meta, 'model_type', None) == 'qwen2_audio'): if audio_path_list is None or audio_ptr >= len(audio_path_list): warnings.warn('Found <|AUDIO|> but no matching audio input; fallback to text tokenization', RuntimeWarning) token_list = self._tokenize(context) else: import librosa sample_rate = self.processor.feature_extractor.sampling_rate wav, _ = librosa.load(audio_path_list[audio_ptr], sr=sample_rate, mono=True) encoded = self.processor( text=context, audio=wav, sampling_rate=sample_rate, return_tensors=None, add_special_tokens=False, ) token_list = encoded['input_ids'] if len(token_list) > 0 and isinstance(token_list[0], list): token_list = token_list[0] audio_ptr += 1

Thanks for the review. I updated the PR to:

import transformers locally in _is_transformers5;

remove the global librosa import and import it lazily in the Qwen2-Audio branch;

restrict the audio placeholder branch to qwen2_audio only.

hjh0119 · 2026-05-30T15:29:49Z

This is inconvenient for RLHF workflows such as GRPO because
recent trl versions require newer transformers releases.

To my knowledge, Swift's GRPO should be compatible with Transformers 4. Could you provide the specific error message?

MWXGOD · 2026-05-30T17:06:32Z

Thanks for pointing this out. I agree that my previous wording was too broad. The issue is not that Swift GRPO is generally incompatible with Transformers 4. The concrete problem I encountered is with the Qwen2-Audio + recent TRL GRPO environment. This PR is intended to address two related issues at the same time: 1. Qwen2-Audio inference with newer Transformers With transformers==5.9.0 and trl==0.29.1, Swift GRPO starts normally, but Qwen2-Audio reports: [transformers] Expanding inputs for audio tokens in Qwen2Audio should be done in processing. If this warning is not addressed, Qwen2-Audio inference produces garbled outputs in my tests. This is related to the dependency-version issue mentioned in the Swift FAQ Q19, where the current workaround is to use transformers==4.48 when Qwen2-Audio inference results are garbled: https://swift.readthedocs.io/en/latest/Instruction/Frequently-asked-questions.html#q19-issues-related-to-specific-model-dependency-versions This PR tries to fix the newer-Transformers behavior directly by moving the <|AUDIO|> expansion to the processor-side encoding path. 2. transformers==4.48 incompatibility with recent TRL If I downgrade transformers to 4.48, Qwen2-Audio inference becomes stable, but my GRPO environment with trl==0.29.1 fails before training starts: AttributeError: type object 'TrainingArguments' has no attribute '_VALID_DICT_FIELDS' The error comes from trl.experimental.cpo.CPOConfig accessing TrainingArguments._VALID_DICT_FIELDS, which is not available in transformers==4.48. So the goal of this PR is not to claim that Swift GRPO generally requires Transformers 5. Rather, it is to make Qwen2-Audio usable in a newer Transformers + recent TRL GRPO environment, while also addressing the known Qwen2-Audio garbled-inference issue that currently requires downgrading to transformers==4.48. From: jinghanhu Date: 2026-05-30 23:30 To: modelscope/ms-swift CC: MWXGOD; Author Subject: Re: [modelscope/ms-swift] [compat] Support Qwen2-Audio with newer transformers (PR #9453) hjh0119 left a comment (modelscope/ms-swift#9453) This is inconvenient for RLHF workflows such as GRPO because recent trl versions require newer transformers releases. To my knowledge, Swift's GRPO should be compatible with Transformers 4. Could you provide the specific error message? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Support Qwen2-Audio with newer transformers

e82f113

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

Address Qwen2-Audio compatibility review comments

4e55dbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[compat] Support Qwen2-Audio with newer transformers#9453

[compat] Support Qwen2-Audio with newer transformers#9453
MWXGOD wants to merge 2 commits into
modelscope:mainfrom
MWXGOD:fix-qwen2-audio-transformers-compat

MWXGOD commented May 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

MWXGOD May 30, 2026

Uh oh!

hjh0119 commented May 30, 2026

Uh oh!

MWXGOD commented May 30, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MWXGOD commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen2-Audio Compatibility with Newer Transformers for RLHF/GRPO

PR type

PR information

Motivation

Summary of changes

Suggested implementation

1. swift/template/base.py

2. swift/model/models/qwen.py

Compatibility

Experiment results

Checklist

Notes for reviewers

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

MWXGOD May 30, 2026

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented May 30, 2026

Uh oh!

MWXGOD commented May 30, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MWXGOD commented May 30, 2026 •

edited

Loading

1. `swift/template/base.py`

2. `swift/model/models/qwen.py`