[compat] Support Qwen2-Audio with newer transformers#9453
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for Qwen2-Audio models, including patching for transformers v5 compatibility and handling audio tokenization in the base template. The review comments highlight critical issues: a missing transformers import in qwen.py causing a NameError, a global librosa import in base.py that introduces an unwanted hard dependency, and a potential AttributeError in the audio tokenization path for non-audio models.
| @staticmethod | ||
| def _is_transformers5() -> bool: | ||
| return version.parse(transformers.__version__) >= version.parse('5.0.0') |
There was a problem hiding this comment.
The transformers module is not imported at the top of swift/model/models/qwen.py. Calling _is_transformers5() will raise a NameError at runtime. Please import transformers locally inside the method or at the module level.
| @staticmethod | |
| def _is_transformers5() -> bool: | |
| return version.parse(transformers.__version__) >= version.parse('5.0.0') | |
| @staticmethod | |
| def _is_transformers5() -> bool: | |
| import transformers | |
| return version.parse(transformers.__version__) >= version.parse('5.0.0') |
| import hashlib | ||
| import inspect | ||
| import librosa | ||
| import math | ||
| import os | ||
| import random | ||
| import re | ||
| import torch | ||
| import torch.nn as nn | ||
| import torch.nn.functional as F | ||
| import warnings | ||
| from collections import defaultdict | ||
| from contextlib import contextmanager, nullcontext | ||
| from copy import deepcopy |
There was a problem hiding this comment.
Importing librosa globally at the top of swift/template/base.py makes it a hard dependency for all models in the repository, even if they do not use audio features. This will cause an ImportError for users who do not have librosa installed. Please remove the global import and load librosa lazily where it is actually needed.
| import hashlib | |
| import inspect | |
| import librosa | |
| import math | |
| import os | |
| import random | |
| import re | |
| import torch | |
| import torch.nn as nn | |
| import torch.nn.functional as F | |
| import warnings | |
| from collections import defaultdict | |
| from contextlib import contextmanager, nullcontext | |
| from copy import deepcopy | |
| import hashlib | |
| import inspect | |
| import math | |
| import os | |
| import random | |
| import re | |
| import torch | |
| import torch.nn as nn | |
| import torch.nn.functional as F | |
| import warnings | |
| from collections import defaultdict | |
| from contextlib import contextmanager, nullcontext | |
| from copy import deepcopy |
| for context, loss_weight in zip(context_list, loss_scale_list): | ||
| if isinstance(context, str) and '<|AUDIO|>' in context: | ||
| if audio_path_list is None or audio_ptr >= len(audio_path_list): | ||
| warnings.warn('Found <|AUDIO|> but no matching audio input; fallback to text tokenization', | ||
| RuntimeWarning) | ||
| token_list = self._tokenize(context) | ||
| else: | ||
| sample_rate = self.processor.feature_extractor.sampling_rate | ||
| wav, _ = librosa.load(audio_path_list[audio_ptr], sr=sample_rate, mono=True) | ||
| encoded = self.processor( | ||
| text=context, | ||
| audio=wav, | ||
| sampling_rate=sample_rate, | ||
| return_tensors=None, | ||
| add_special_tokens=False, | ||
| ) | ||
| token_list = encoded['input_ids'] | ||
| if len(token_list) > 0 and isinstance(token_list[0], list): | ||
| token_list = token_list[0] | ||
| audio_ptr += 1 |
There was a problem hiding this comment.
To prevent other models from accidentally triggering the audio processing path (which would raise an AttributeError since non-audio models do not have a feature_extractor on their processor), we should restrict this block to only run when the model type is qwen2_audio. Additionally, we should import librosa locally here to avoid making it a global hard dependency.
for context, loss_weight in zip(context_list, loss_scale_list):
if (isinstance(context, str)
and '<|AUDIO|>' in context
and getattr(self.tokenizer, 'model_meta', None)
and getattr(self.tokenizer.model_meta, 'model_type', None) == 'qwen2_audio'):
if audio_path_list is None or audio_ptr >= len(audio_path_list):
warnings.warn('Found <|AUDIO|> but no matching audio input; fallback to text tokenization',
RuntimeWarning)
token_list = self._tokenize(context)
else:
import librosa
sample_rate = self.processor.feature_extractor.sampling_rate
wav, _ = librosa.load(audio_path_list[audio_ptr], sr=sample_rate, mono=True)
encoded = self.processor(
text=context,
audio=wav,
sampling_rate=sample_rate,
return_tensors=None,
add_special_tokens=False,
)
token_list = encoded['input_ids']
if len(token_list) > 0 and isinstance(token_list[0], list):
token_list = token_list[0]
audio_ptr += 1There was a problem hiding this comment.
Thanks for the review. I updated the PR to:
- import transformers locally in _is_transformers5;
- remove the global librosa import and import it lazily in the Qwen2-Audio branch;
- restrict the audio placeholder branch to qwen2_audio only.
To my knowledge, Swift's GRPO should be compatible with Transformers 4. Could you provide the specific error message? |
|
Thanks for pointing this out. I agree that my previous wording was too broad.
The issue is not that Swift GRPO is generally incompatible with Transformers 4. The concrete problem I encountered is with the Qwen2-Audio + recent TRL GRPO environment.
This PR is intended to address two related issues at the same time:
1. Qwen2-Audio inference with newer Transformers
With transformers==5.9.0 and trl==0.29.1, Swift GRPO starts normally, but Qwen2-Audio reports:
[transformers] Expanding inputs for audio tokens in Qwen2Audio should be done in processing.
If this warning is not addressed, Qwen2-Audio inference produces garbled outputs in my tests. This is related to the dependency-version issue mentioned in the Swift FAQ Q19, where the current workaround is to use transformers==4.48 when Qwen2-Audio inference results are garbled:
https://swift.readthedocs.io/en/latest/Instruction/Frequently-asked-questions.html#q19-issues-related-to-specific-model-dependency-versions
This PR tries to fix the newer-Transformers behavior directly by moving the <|AUDIO|> expansion to the processor-side encoding path.
2. transformers==4.48 incompatibility with recent TRL
If I downgrade transformers to 4.48, Qwen2-Audio inference becomes stable, but my GRPO environment with trl==0.29.1 fails before training starts:
AttributeError: type object 'TrainingArguments' has no attribute '_VALID_DICT_FIELDS'
The error comes from trl.experimental.cpo.CPOConfig accessing TrainingArguments._VALID_DICT_FIELDS, which is not available in transformers==4.48.
So the goal of this PR is not to claim that Swift GRPO generally requires Transformers 5. Rather, it is to make Qwen2-Audio usable in a newer Transformers + recent TRL GRPO environment, while also addressing the known Qwen2-Audio garbled-inference issue that currently requires downgrading to transformers==4.48.
From: jinghanhu
Date: 2026-05-30 23:30
To: modelscope/ms-swift
CC: MWXGOD; Author
Subject: Re: [modelscope/ms-swift] [compat] Support Qwen2-Audio with newer transformers (PR #9453)
hjh0119 left a comment (modelscope/ms-swift#9453)
This is inconvenient for RLHF workflows such as GRPO because
recent trl versions require newer transformers releases.
To my knowledge, Swift's GRPO should be compatible with Transformers 4. Could you provide the specific error message?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Qwen2-Audio Compatibility with Newer Transformers for RLHF/GRPO
PR type
PR information
Motivation
Qwen2-Audio is currently constrained to the older
transformers4.48-eraencoding path. This is inconvenient for RLHF workflows such as GRPO because
recent
trlversions require newertransformersreleases.When Qwen2-Audio is used with a newer
transformersstack, the old audioplaceholder encoding path may encode
<|AUDIO|>incorrectly, which can lead tounstable training or corrupted inference outputs. This PR proposes a small,
Qwen2-Audio-only compatibility update so that SFT, inference, and GRPO/RLHF can
run in one environment.
Summary of changes
<|AUDIO|>contexts with the Qwen2-Audio processor insteadof treating them as generic text-only contexts.
librosa.load(..., sr=processor.feature_extractor.sampling_rate).labelsandloss_scaleconstruction afterinput_idsarereturned by the processor.
transformers>=4.45,<4.49totransformers>=4.48,<6.transformers>=5.0cache compatibility patch for generation.Suggested implementation
Line numbers below refer to the current official snapshot and may shift slightly
after upstream changes.
1.
swift/template/base.pyNear the top-level imports, around line 1-15, add:
Replace
_encode_context_list, currently starting around line 1064, with anaudio-aware version:
In
_encode, around line 1472 afterself._simplify_context_list(...), call the audio-aware path only forQwen2-Audio:
2.
swift/model/models/qwen.pyIf the target branch does not already import
transformers, add it near thetop-level imports:
Replace
Qwen2AudioLoader, currently starting around line 1815, with:Update the Qwen2-Audio requirement, currently around line 1835:
Add the cache helper after the Qwen2-Audio registration and before the next
model loader class:
Compatibility
This change is intentionally scoped to Qwen2-Audio. Other text and multimodal
models should continue to use the existing encode path.
Expected supported environments:
transformers>=4.48: preserve the existing Qwen2-Audio baseline.transformers>=5.0: support newertrl/GRPO environments with the cachecompatibility handling above.
Experiment results
Suggested smoke tests:
<|AUDIO|>produces validinput_ids,labels, andloss_scale.transformersversions.trlversion, for exampletrl>=0.20, can initializeQwen2-Audio and start rollout generation.
Checklist
transformers.trl.pre-commit run --all-fileshas been run before submission.Notes for reviewers
This PR does not introduce task-specific evaluation, reward, or dataset code. It
only addresses Qwen2-Audio compatibility with the newer dependency stack needed
by RLHF/GRPO.