fix(qwen-asr): lazy-load forced_aligner and populate word-level timestamps#10054
fix(qwen-asr): lazy-load forced_aligner and populate word-level timestamps#10054fqscfqj wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces lazy-loading of the forced-aligner model variant to avoid loading timestamp/forced-alignment components unless timestamps are requested.
Changes:
- Persist base model load parameters and forced-aligner config for later on-demand loading.
- Add a cached, thread-safe
_get_ts_model()to lazy-load a second model instance with forced aligner attached. - Change timestamp “word” granularity output to sentence-level segments that optionally include per-word timing details (OpenAI
verbose_json-style).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Select model: with or without forced aligner | ||
| if want_timestamps: | ||
| model = self._get_ts_model() | ||
| has_aligner = self._forced_aligner_name is not None | ||
| else: | ||
| model = self.model | ||
| has_aligner = False |
| if not self._forced_aligner_name: | ||
| return self.model # no aligner configured — fall back silently |
| for ts in time_stamps: | ||
| s, e, t = self._extract_word_info(ts) | ||
|
|
||
| # Detect sentence boundary via time gap | ||
| if prev_end is not None and (s - prev_end) >= threshold and buf_text: | ||
| result.append(backend_pb2.TranscriptSegment( | ||
| id=len(result), | ||
| start=int(buf_start * 1_000_000_000), | ||
| end=int(buf_end * 1_000_000_000), | ||
| text=self._smart_join(buf_text), | ||
| )) | ||
| buf_text = [] | ||
| buf_start = None | ||
|
|
||
| if buf_start is None: | ||
| buf_start = s | ||
| buf_text.append(t) | ||
| buf_end = e | ||
| if prev_end is not None and (s - prev_end) >= threshold and buf: | ||
| sentence_groups.append(buf) | ||
| buf = [] | ||
| buf.append(ts) | ||
| prev_end = e | ||
| if buf: | ||
| sentence_groups.append(buf) | ||
|
|
||
| # flush remaining | ||
| if buf_text and buf_start is not None: | ||
| result.append(backend_pb2.TranscriptSegment( | ||
| result = [] | ||
| for group in sentence_groups: | ||
| words_info = [self._extract_word_info(ts) for ts in group] |
| print(f"Lazy-loading forced_aligner: {self._forced_aligner_name}", file=sys.stderr) | ||
| self._ts_model = Qwen3ASRModel.from_pretrained( | ||
| self.model_path, **load_kwargs | ||
| ) | ||
| print("Forced-aligner model loaded", file=sys.stderr) |
…tamps Previously, the qwen-asr backend had two issues: 1. The forced_aligner model was loaded eagerly in LoadModel, consuming extra VRAM even when the client never requests timestamps. This is wasteful since the aligner is only needed for timestamp alignment. 2. When 'word' granularity was requested, words were emitted as separate segments rather than populating the 'words' field on sentence-level segments — so the Go server's transcriptResultFromProto never saw any TranscriptWord entries and the OpenAI-format 'words' array was always empty. Changes: - LoadModel: do NOT pass forced_aligner to from_pretrained; save the model path and load kwargs for later use. - _get_ts_model (new): lazy-load a second model instance with the forced_aligner attached, guarded by a threading.Lock. Only loaded on the first timestamp request; subsequent requests reuse it. - AudioTranscription: read request.timestamp_granularities to determine granularity (word vs segment). Select the appropriate model via _get_ts_model() when timestamps are requested. - _build_segments: for 'word' granularity, populate TranscriptWord on each sentence-level segment (gap-merged); for 'segment' granularity, return sentence-level segments without word children. Tested with qwen3-asr-1.7b + Qwen3-ForcedAligner-0.6B on English audio. Both segment and word timestamp granularities produce correct output. Signed-off-by: fqscfqj <fqscfqj@outlook.com>
f80b898 to
d37e18d
Compare
- has_aligner: derive from model identity (model is not self.model) instead of checking _forced_aligner_name, avoiding mismatch when the name is set but aligner fails to load. - _get_ts_model: log a warning when timestamps are requested but no forced_aligner is configured, making silent fallback explicit. - _build_segments: extract word info once and reuse, avoiding duplicate _extract_word_info calls. Added _compute_gap_threshold_from_extracted for the pre-extracted path. Signed-off-by: fqscfqj <fqscfqj@outlook.com>
localai-bot
left a comment
There was a problem hiding this comment.
Thanks for the fix — the lean-by-default model split and populating seg.words (matching what the Go side's transcriptResultFromProto reads, and OpenAI's verbose_json shape) are both the right calls. A few things to address, one of which is a blocker.
| if self._ts_model is not None: | ||
| return self._ts_model | ||
| if not self._forced_aligner_name: | ||
| if want_timestamps: |
There was a problem hiding this comment.
Blocker: NameError here. want_timestamps is a local variable inside AudioTranscription, not a parameter or attribute of self, so it's undefined in _get_ts_model(). This branch is reached exactly when a client requests timestamps on a model with no forced_aligner configured (_get_ts_model() is only called under if want_timestamps:), so instead of the intended warning + graceful fallback to plain text, it raises NameError and the request fails.
Since we already know timestamps were requested if we got here, just drop the guard:
if not self._forced_aligner_name:
print("WARNING: timestamps requested but no forced_aligner configured; "
"returning plain text without timestamps", file=sys.stderr)
return self.modelThere was a problem hiding this comment.
Fixed. Removed the want_timestamps guard entirely — _get_ts_model() now always prints the warning (it's only called from a code path that already checks want_timestamps), and returns None instead of self.model so the caller can distinguish "no aligner configured" from a real model.
Also updated AudioTranscription to handle None:
ts_model = self._get_ts_model()
if ts_model is None:
model = self.model
has_aligner = False
want_timestamps = False
else:
model = ts_model
has_aligner = True| if self._forced_aligner_kwargs: | ||
| load_kwargs["forced_aligner_kwargs"] = self._forced_aligner_kwargs | ||
| print(f"Lazy-loading forced_aligner: {self._forced_aligner_name}", file=sys.stderr) | ||
| self._ts_model = Qwen3ASRModel.from_pretrained( |
There was a problem hiding this comment.
The VRAM table in the PR description lists the timestamps-requested path as ~4.7 GB both before and after, but this loads a second full Qwen3ASRModel (base ASR weights + aligner) while self.model (the base copy) is never freed. So once timestamps are requested you hold both — roughly self.model (~3.5 GB) + self._ts_model (~4.7 GB) ≈ 8.2 GB, which is actually worse than the old eager behavior for any workload that does request timestamps. The win is real only for pure-transcription workloads.
That may be an acceptable trade-off, but two questions:
- Does
from_pretrained(forced_aligner=...)actually reload the full ASR backbone, or can the aligner be attached to the existingself.modelin place? If it can attach, the duplication disappears. - If duplication is unavoidable, should
self.modelbe dropped once_ts_modelis loaded (under the lock) to avoid double-holding VRAM?
Either way, the description's table should be corrected.
There was a problem hiding this comment.
Good catch on the VRAM table — it was misleading. Fixed:
from_pretrained(forced_aligner=...) **does** reload the full backbone, so we now del self.modelimmediately after loading_ts_model` to avoid holding both copies:
self._ts_model = Qwen3ASRModel.from_pretrained(self.model_path, **load_kwargs)
if self.model is not None:
del self.model
self.model = NoneThis keeps peak VRAM at ~4.7 GB (single copy) instead of ~8.2 GB (double). Updated VRAM table:
| Scenario | Before (old) | After (this PR) |
|---|---|---|
| No timestamps | ~4.7 GB (aligner always loaded) | ~3.5 GB (aligner not loaded) |
| Timestamps requested | ~4.7 GB | ~4.7 GB (lazy-loaded, base freed) |
| # Always compute sentence-level segments via gap merging. | ||
| # Extract word info once and reuse throughout. | ||
| extracted = [self._extract_word_info(ts) for ts in time_stamps] | ||
| threshold = self._compute_gap_threshold_from_extracted(extracted) |
There was a problem hiding this comment.
Now that _build_segments uses _compute_gap_threshold_from_extracted, the original _compute_gap_threshold(time_stamps) has no remaining callers — it's dead code. Either delete it, or collapse to a single implementation that extracts internally:
@staticmethod
def _compute_gap_threshold(time_stamps):
return BackendServicer._compute_gap_threshold_from_extracted(
[BackendServicer._extract_word_info(ts) for ts in time_stamps]
)There was a problem hiding this comment.
Done — removed _compute_gap_threshold(time_stamps) entirely. Updated the docstring of the remaining _compute_gap_threshold_from_extracted to be self-contained.
1. Fix NameError: remove 'want_timestamps' reference from _get_ts_model() (it was a local variable in AudioTranscription, not accessible here). Now returns None when no aligner is configured, caller handles fallback. 2. Fix VRAM duplication: del self.model after _ts_model is loaded so only one full model copy is held in memory at a time. 3. Remove dead _compute_gap_threshold(time_stamps) method — all callers now use _compute_gap_threshold_from_extracted(). Update its docstring.
fbdc67e to
39933e8
Compare
Problem
The qwen-asr backend had two issues with timestamp support:
Eager forced_aligner loading: The
Qwen3-ForcedAligner-0.6Bmodel was loaded inLoadModelregardless of whether the client ever requests timestamps. This wastes ~1.2 GB VRAM for plain transcription requests.Empty
wordsarray: Whentimestamp_granularities[]=wordwas requested, each word was emitted as a separateTranscriptSegmentinstead of populating thewordsfield on sentence-level segments. The Go server'stranscriptResultFromProtoreadss.Wordsto populate both per-segment and top-levelwords— but they were always empty.Changes
LoadModel
forced_alignertofrom_pretrainedduring initial loadmodel_pathand_load_kwargsfor later use_ts_model = Noneand_ts_lock = threading.Lock()_get_ts_model()(new)threading.Lockfor thread safetyself.model(base model) if no forced_aligner is configuredAudioTranscription
request.timestamp_granularitiesto determine granularity (wordvssegment)_get_ts_model()when timestamps are requestedreturn_time_stamps=Trueonly when a forced_aligner is available_build_segmentswordgranularity: gap-merges words into sentence segments and populatesTranscriptWordon each segmentsegmentgranularity: returns sentence-level segments without word childrenTesting
Tested with
qwen3-asr-1.7b+Qwen3-ForcedAligner-0.6Bon English audio:VRAM Impact