fix(qwen-asr): lazy-load forced_aligner and populate word-level timestamps by fqscfqj · Pull Request #10054 · mudler/LocalAI

fqscfqj · 2026-05-29T02:29:28Z

Problem

The qwen-asr backend had two issues with timestamp support:

Eager forced_aligner loading: The Qwen3-ForcedAligner-0.6B model was loaded in LoadModel regardless of whether the client ever requests timestamps. This wastes ~1.2 GB VRAM for plain transcription requests.
Empty words array: When timestamp_granularities[]=word was requested, each word was emitted as a separate TranscriptSegment instead of populating the words field on sentence-level segments. The Go server's transcriptResultFromProto reads s.Words to populate both per-segment and top-level words — but they were always empty.

Changes

LoadModel

Do NOT pass forced_aligner to from_pretrained during initial load
Save model_path and _load_kwargs for later use
Initialize _ts_model = None and _ts_lock = threading.Lock()

`_get_ts_model()` (new)

Lazy-loads a second model instance with the forced_aligner attached
Guarded by threading.Lock for thread safety
Only loaded on the first timestamp request; subsequent requests reuse the cached instance
Returns self.model (base model) if no forced_aligner is configured

AudioTranscription

Reads request.timestamp_granularities to determine granularity (word vs segment)
Selects the appropriate model via _get_ts_model() when timestamps are requested
Passes return_time_stamps=True only when a forced_aligner is available

`_build_segments`

For word granularity: gap-merges words into sentence segments and populates TranscriptWord on each segment
For segment granularity: returns sentence-level segments without word children

Testing

Tested with qwen3-asr-1.7b + Qwen3-ForcedAligner-0.6B on English audio:

# Segment timestamps
POST /v1/audio/transcriptions
  timestamp_granularities[]=segment
  response_format=verbose_json

-> segments: [{id:0, start:0.08, end:3.2, text:"The quick brown fox..."}, ...]

# Word timestamps
POST /v1/audio/transcriptions
  timestamp_granularities[]=word
  response_format=verbose_json

-> segments: [{start:0.08, end:3.2, text:"The quick brown fox...",
              words: [{start:0.08, end:0.16, text:"The"}, ...]}]

VRAM Impact

Scenario	Before	After
No timestamps requested	~3.5 GB + 1.2 GB (aligner always loaded)	~3.5 GB (aligner not loaded)
Timestamps requested	~4.7 GB	~4.7 GB (lazy-loaded on first request)

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces lazy-loading of the forced-aligner model variant to avoid loading timestamp/forced-alignment components unless timestamps are requested.

Changes:

Persist base model load parameters and forced-aligner config for later on-demand loading.
Add a cached, thread-safe _get_ts_model() to lazy-load a second model instance with forced aligner attached.
Change timestamp “word” granularity output to sentence-level segments that optionally include per-word timing details (OpenAI verbose_json-style).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            # Select model: with or without forced aligner
+            if want_timestamps:
+                model = self._get_ts_model()
+                has_aligner = self._forced_aligner_name is not None
+            else:
+                model = self.model
+                has_aligner = False


+        if not self._forced_aligner_name:
+            return self.model  # no aligner configured — fall back silently


        for ts in time_stamps:
            s, e, t = self._extract_word_info(ts)
-
-            # Detect sentence boundary via time gap
-            if prev_end is not None and (s - prev_end) >= threshold and buf_text:
-                result.append(backend_pb2.TranscriptSegment(
-                    id=len(result),
-                    start=int(buf_start * 1_000_000_000),
-                    end=int(buf_end * 1_000_000_000),
-                    text=self._smart_join(buf_text),
-                ))
-                buf_text = []
-                buf_start = None
-
-            if buf_start is None:
-                buf_start = s
-            buf_text.append(t)
-            buf_end = e
+            if prev_end is not None and (s - prev_end) >= threshold and buf:
+                sentence_groups.append(buf)
+                buf = []
+            buf.append(ts)
            prev_end = e
+        if buf:
+            sentence_groups.append(buf)

-        # flush remaining
-        if buf_text and buf_start is not None:
-            result.append(backend_pb2.TranscriptSegment(
+        result = []
+        for group in sentence_groups:
+            words_info = [self._extract_word_info(ts) for ts in group]


+            print(f"Lazy-loading forced_aligner: {self._forced_aligner_name}", file=sys.stderr)
+            self._ts_model = Qwen3ASRModel.from_pretrained(
+                self.model_path, **load_kwargs
+            )
+            print("Forced-aligner model loaded", file=sys.stderr)


…tamps Previously, the qwen-asr backend had two issues: 1. The forced_aligner model was loaded eagerly in LoadModel, consuming extra VRAM even when the client never requests timestamps. This is wasteful since the aligner is only needed for timestamp alignment. 2. When 'word' granularity was requested, words were emitted as separate segments rather than populating the 'words' field on sentence-level segments — so the Go server's transcriptResultFromProto never saw any TranscriptWord entries and the OpenAI-format 'words' array was always empty. Changes: - LoadModel: do NOT pass forced_aligner to from_pretrained; save the model path and load kwargs for later use. - _get_ts_model (new): lazy-load a second model instance with the forced_aligner attached, guarded by a threading.Lock. Only loaded on the first timestamp request; subsequent requests reuse it. - AudioTranscription: read request.timestamp_granularities to determine granularity (word vs segment). Select the appropriate model via _get_ts_model() when timestamps are requested. - _build_segments: for 'word' granularity, populate TranscriptWord on each sentence-level segment (gap-merged); for 'segment' granularity, return sentence-level segments without word children. Tested with qwen3-asr-1.7b + Qwen3-ForcedAligner-0.6B on English audio. Both segment and word timestamp granularities produce correct output. Signed-off-by: fqscfqj <fqscfqj@outlook.com>

- has_aligner: derive from model identity (model is not self.model) instead of checking _forced_aligner_name, avoiding mismatch when the name is set but aligner fails to load. - _get_ts_model: log a warning when timestamps are requested but no forced_aligner is configured, making silent fallback explicit. - _build_segments: extract word info once and reuse, avoiding duplicate _extract_word_info calls. Added _compute_gap_threshold_from_extracted for the pre-extracted path. Signed-off-by: fqscfqj <fqscfqj@outlook.com>

localai-bot

Thanks for the fix — the lean-by-default model split and populating seg.words (matching what the Go side's transcriptResultFromProto reads, and OpenAI's verbose_json shape) are both the right calls. A few things to address, one of which is a blocker.

localai-bot · 2026-05-29T10:23:45Z

+        if self._ts_model is not None:
+            return self._ts_model
+        if not self._forced_aligner_name:
+            if want_timestamps:


Blocker: NameError here. want_timestamps is a local variable inside AudioTranscription, not a parameter or attribute of self, so it's undefined in _get_ts_model(). This branch is reached exactly when a client requests timestamps on a model with no forced_aligner configured (_get_ts_model() is only called under if want_timestamps:), so instead of the intended warning + graceful fallback to plain text, it raises NameError and the request fails.

Since we already know timestamps were requested if we got here, just drop the guard:

if not self._forced_aligner_name: print("WARNING: timestamps requested but no forced_aligner configured; " "returning plain text without timestamps", file=sys.stderr) return self.model

Fixed. Removed the want_timestamps guard entirely — _get_ts_model() now always prints the warning (it's only called from a code path that already checks want_timestamps), and returns None instead of self.model so the caller can distinguish "no aligner configured" from a real model.

Also updated AudioTranscription to handle None:

ts_model = self._get_ts_model() if ts_model is None: model = self.model has_aligner = False want_timestamps = False else: model = ts_model has_aligner = True

localai-bot · 2026-05-29T10:23:45Z

+            if self._forced_aligner_kwargs:
+                load_kwargs["forced_aligner_kwargs"] = self._forced_aligner_kwargs
+            print(f"Lazy-loading forced_aligner: {self._forced_aligner_name}", file=sys.stderr)
+            self._ts_model = Qwen3ASRModel.from_pretrained(


The VRAM table in the PR description lists the timestamps-requested path as ~4.7 GB both before and after, but this loads a second full Qwen3ASRModel (base ASR weights + aligner) while self.model (the base copy) is never freed. So once timestamps are requested you hold both — roughly self.model (~3.5 GB) + self._ts_model (~4.7 GB) ≈ 8.2 GB, which is actually worse than the old eager behavior for any workload that does request timestamps. The win is real only for pure-transcription workloads.

That may be an acceptable trade-off, but two questions:

Does from_pretrained(forced_aligner=...) actually reload the full ASR backbone, or can the aligner be attached to the existing self.model in place? If it can attach, the duplication disappears.

If duplication is unavoidable, should self.model be dropped once _ts_model is loaded (under the lock) to avoid double-holding VRAM?

Either way, the description's table should be corrected.

Good catch on the VRAM table — it was misleading. Fixed:

from_pretrained(forced_aligner=...) **does** reload the full backbone, so we now del self.modelimmediately after loading_ts_model` to avoid holding both copies:

self._ts_model = Qwen3ASRModel.from_pretrained(self.model_path, **load_kwargs) if self.model is not None: del self.model self.model = None

This keeps peak VRAM at ~4.7 GB (single copy) instead of ~8.2 GB (double). Updated VRAM table:

Scenario Before (old) After (this PR)

No timestamps ~4.7 GB (aligner always loaded) ~3.5 GB (aligner not loaded)

Timestamps requested ~4.7 GB ~4.7 GB (lazy-loaded, base freed)

localai-bot · 2026-05-29T10:23:45Z

+        # Always compute sentence-level segments via gap merging.
+        # Extract word info once and reuse throughout.
+        extracted = [self._extract_word_info(ts) for ts in time_stamps]
+        threshold = self._compute_gap_threshold_from_extracted(extracted)


Now that _build_segments uses _compute_gap_threshold_from_extracted, the original _compute_gap_threshold(time_stamps) has no remaining callers — it's dead code. Either delete it, or collapse to a single implementation that extracts internally:

@staticmethod def _compute_gap_threshold(time_stamps): return BackendServicer._compute_gap_threshold_from_extracted( [BackendServicer._extract_word_info(ts) for ts in time_stamps] )

Done — removed _compute_gap_threshold(time_stamps) entirely. Updated the docstring of the remaining _compute_gap_threshold_from_extracted to be self-contained.

1. Fix NameError: remove 'want_timestamps' reference from _get_ts_model() (it was a local variable in AudioTranscription, not accessible here). Now returns None when no aligner is configured, caller handles fallback. 2. Fix VRAM duplication: del self.model after _ts_model is loaded so only one full model copy is held in memory at a time. 3. Remove dead _compute_gap_threshold(time_stamps) method — all callers now use _compute_gap_threshold_from_extracted(). Update its docstring.

Copilot AI review requested due to automatic review settings May 29, 2026 02:29

Copilot AI reviewed May 29, 2026

View reviewed changes

fqscfqj force-pushed the fix/qwen-asr-timestamps branch from f80b898 to d37e18d Compare May 29, 2026 02:31

localai-bot reviewed May 29, 2026

View reviewed changes

fqscfqj force-pushed the fix/qwen-asr-timestamps branch from fbdc67e to 39933e8 Compare May 30, 2026 01:59

Merge branch 'master' into fix/qwen-asr-timestamps

c404785

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(qwen-asr): lazy-load forced_aligner and populate word-level timestamps#10054

fix(qwen-asr): lazy-load forced_aligner and populate word-level timestamps#10054
fqscfqj wants to merge 4 commits into
mudler:masterfrom
fqscfqj:fix/qwen-asr-timestamps

fqscfqj commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

localai-bot left a comment

Uh oh!

localai-bot May 29, 2026

Uh oh!

fqscfqj May 29, 2026

Uh oh!

localai-bot May 29, 2026

Uh oh!

fqscfqj May 29, 2026

Uh oh!

localai-bot May 29, 2026

Uh oh!

fqscfqj May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if not self._forced_aligner_name:
		return self.model # no aligner configured — fall back silently

Scenario	Before (old)	After (this PR)
No timestamps	~4.7 GB (aligner always loaded)	~3.5 GB (aligner not loaded)
Timestamps requested	~4.7 GB	~4.7 GB (lazy-loaded, base freed)

Uh oh!

Conversation

fqscfqj commented May 29, 2026

Problem

Changes

LoadModel

_get_ts_model() (new)

AudioTranscription

_build_segments

Testing

VRAM Impact

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

localai-bot left a comment

Choose a reason for hiding this comment

Uh oh!

localai-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

fqscfqj May 29, 2026

Choose a reason for hiding this comment

Uh oh!

localai-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

fqscfqj May 29, 2026

Choose a reason for hiding this comment

Uh oh!

localai-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

fqscfqj May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`_get_ts_model()` (new)

`_build_segments`