Skip to content

fix(qwen-asr): lazy-load forced_aligner and populate word-level timestamps#10054

Open
fqscfqj wants to merge 4 commits into
mudler:masterfrom
fqscfqj:fix/qwen-asr-timestamps
Open

fix(qwen-asr): lazy-load forced_aligner and populate word-level timestamps#10054
fqscfqj wants to merge 4 commits into
mudler:masterfrom
fqscfqj:fix/qwen-asr-timestamps

Conversation

@fqscfqj
Copy link
Copy Markdown
Contributor

@fqscfqj fqscfqj commented May 29, 2026

Problem

The qwen-asr backend had two issues with timestamp support:

  1. Eager forced_aligner loading: The Qwen3-ForcedAligner-0.6B model was loaded in LoadModel regardless of whether the client ever requests timestamps. This wastes ~1.2 GB VRAM for plain transcription requests.

  2. Empty words array: When timestamp_granularities[]=word was requested, each word was emitted as a separate TranscriptSegment instead of populating the words field on sentence-level segments. The Go server's transcriptResultFromProto reads s.Words to populate both per-segment and top-level words — but they were always empty.

Changes

LoadModel

  • Do NOT pass forced_aligner to from_pretrained during initial load
  • Save model_path and _load_kwargs for later use
  • Initialize _ts_model = None and _ts_lock = threading.Lock()

_get_ts_model() (new)

  • Lazy-loads a second model instance with the forced_aligner attached
  • Guarded by threading.Lock for thread safety
  • Only loaded on the first timestamp request; subsequent requests reuse the cached instance
  • Returns self.model (base model) if no forced_aligner is configured

AudioTranscription

  • Reads request.timestamp_granularities to determine granularity (word vs segment)
  • Selects the appropriate model via _get_ts_model() when timestamps are requested
  • Passes return_time_stamps=True only when a forced_aligner is available

_build_segments

  • For word granularity: gap-merges words into sentence segments and populates TranscriptWord on each segment
  • For segment granularity: returns sentence-level segments without word children

Testing

Tested with qwen3-asr-1.7b + Qwen3-ForcedAligner-0.6B on English audio:

# Segment timestamps
POST /v1/audio/transcriptions
  timestamp_granularities[]=segment
  response_format=verbose_json

-> segments: [{id:0, start:0.08, end:3.2, text:"The quick brown fox..."}, ...]

# Word timestamps
POST /v1/audio/transcriptions
  timestamp_granularities[]=word
  response_format=verbose_json

-> segments: [{start:0.08, end:3.2, text:"The quick brown fox...",
              words: [{start:0.08, end:0.16, text:"The"}, ...]}]

VRAM Impact

Scenario Before After
No timestamps requested ~3.5 GB + 1.2 GB (aligner always loaded) ~3.5 GB (aligner not loaded)
Timestamps requested ~4.7 GB ~4.7 GB (lazy-loaded on first request)

Copilot AI review requested due to automatic review settings May 29, 2026 02:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces lazy-loading of the forced-aligner model variant to avoid loading timestamp/forced-alignment components unless timestamps are requested.

Changes:

  • Persist base model load parameters and forced-aligner config for later on-demand loading.
  • Add a cached, thread-safe _get_ts_model() to lazy-load a second model instance with forced aligner attached.
  • Change timestamp “word” granularity output to sentence-level segments that optionally include per-word timing details (OpenAI verbose_json-style).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +335 to +341
# Select model: with or without forced aligner
if want_timestamps:
model = self._get_ts_model()
has_aligner = self._forced_aligner_name is not None
else:
model = self.model
has_aligner = False
Comment thread backend/python/qwen-asr/backend.py Outdated
Comment on lines +152 to +153
if not self._forced_aligner_name:
return self.model # no aligner configured — fall back silently
Comment thread backend/python/qwen-asr/backend.py Outdated
Comment on lines +276 to +288
for ts in time_stamps:
s, e, t = self._extract_word_info(ts)

# Detect sentence boundary via time gap
if prev_end is not None and (s - prev_end) >= threshold and buf_text:
result.append(backend_pb2.TranscriptSegment(
id=len(result),
start=int(buf_start * 1_000_000_000),
end=int(buf_end * 1_000_000_000),
text=self._smart_join(buf_text),
))
buf_text = []
buf_start = None

if buf_start is None:
buf_start = s
buf_text.append(t)
buf_end = e
if prev_end is not None and (s - prev_end) >= threshold and buf:
sentence_groups.append(buf)
buf = []
buf.append(ts)
prev_end = e
if buf:
sentence_groups.append(buf)

# flush remaining
if buf_text and buf_start is not None:
result.append(backend_pb2.TranscriptSegment(
result = []
for group in sentence_groups:
words_info = [self._extract_word_info(ts) for ts in group]
Comment on lines +161 to +165
print(f"Lazy-loading forced_aligner: {self._forced_aligner_name}", file=sys.stderr)
self._ts_model = Qwen3ASRModel.from_pretrained(
self.model_path, **load_kwargs
)
print("Forced-aligner model loaded", file=sys.stderr)
…tamps

Previously, the qwen-asr backend had two issues:

1. The forced_aligner model was loaded eagerly in LoadModel, consuming
   extra VRAM even when the client never requests timestamps.  This is
   wasteful since the aligner is only needed for timestamp alignment.

2. When 'word' granularity was requested, words were emitted as separate
   segments rather than populating the 'words' field on sentence-level
   segments — so the Go server's transcriptResultFromProto never saw
   any TranscriptWord entries and the OpenAI-format 'words' array was
   always empty.

Changes:
- LoadModel: do NOT pass forced_aligner to from_pretrained; save the
  model path and load kwargs for later use.
- _get_ts_model (new): lazy-load a second model instance with the
  forced_aligner attached, guarded by a threading.Lock.  Only loaded
  on the first timestamp request; subsequent requests reuse it.
- AudioTranscription: read request.timestamp_granularities to determine
  granularity (word vs segment).  Select the appropriate model via
  _get_ts_model() when timestamps are requested.
- _build_segments: for 'word' granularity, populate TranscriptWord on
  each sentence-level segment (gap-merged); for 'segment' granularity,
  return sentence-level segments without word children.

Tested with qwen3-asr-1.7b + Qwen3-ForcedAligner-0.6B on English audio.
Both segment and word timestamp granularities produce correct output.

Signed-off-by: fqscfqj <fqscfqj@outlook.com>
@fqscfqj fqscfqj force-pushed the fix/qwen-asr-timestamps branch from f80b898 to d37e18d Compare May 29, 2026 02:31
- has_aligner: derive from model identity (model is not self.model)
  instead of checking _forced_aligner_name, avoiding mismatch when
  the name is set but aligner fails to load.
- _get_ts_model: log a warning when timestamps are requested but
  no forced_aligner is configured, making silent fallback explicit.
- _build_segments: extract word info once and reuse, avoiding
  duplicate _extract_word_info calls. Added
  _compute_gap_threshold_from_extracted for the pre-extracted path.

Signed-off-by: fqscfqj <fqscfqj@outlook.com>
Copy link
Copy Markdown
Collaborator

@localai-bot localai-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix — the lean-by-default model split and populating seg.words (matching what the Go side's transcriptResultFromProto reads, and OpenAI's verbose_json shape) are both the right calls. A few things to address, one of which is a blocker.

Comment thread backend/python/qwen-asr/backend.py Outdated
if self._ts_model is not None:
return self._ts_model
if not self._forced_aligner_name:
if want_timestamps:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: NameError here. want_timestamps is a local variable inside AudioTranscription, not a parameter or attribute of self, so it's undefined in _get_ts_model(). This branch is reached exactly when a client requests timestamps on a model with no forced_aligner configured (_get_ts_model() is only called under if want_timestamps:), so instead of the intended warning + graceful fallback to plain text, it raises NameError and the request fails.

Since we already know timestamps were requested if we got here, just drop the guard:

if not self._forced_aligner_name:
    print("WARNING: timestamps requested but no forced_aligner configured; "
          "returning plain text without timestamps", file=sys.stderr)
    return self.model

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed the want_timestamps guard entirely — _get_ts_model() now always prints the warning (it's only called from a code path that already checks want_timestamps), and returns None instead of self.model so the caller can distinguish "no aligner configured" from a real model.

Also updated AudioTranscription to handle None:

ts_model = self._get_ts_model()
if ts_model is None:
    model = self.model
    has_aligner = False
    want_timestamps = False
else:
    model = ts_model
    has_aligner = True

if self._forced_aligner_kwargs:
load_kwargs["forced_aligner_kwargs"] = self._forced_aligner_kwargs
print(f"Lazy-loading forced_aligner: {self._forced_aligner_name}", file=sys.stderr)
self._ts_model = Qwen3ASRModel.from_pretrained(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VRAM table in the PR description lists the timestamps-requested path as ~4.7 GB both before and after, but this loads a second full Qwen3ASRModel (base ASR weights + aligner) while self.model (the base copy) is never freed. So once timestamps are requested you hold both — roughly self.model (~3.5 GB) + self._ts_model (~4.7 GB) ≈ 8.2 GB, which is actually worse than the old eager behavior for any workload that does request timestamps. The win is real only for pure-transcription workloads.

That may be an acceptable trade-off, but two questions:

  • Does from_pretrained(forced_aligner=...) actually reload the full ASR backbone, or can the aligner be attached to the existing self.model in place? If it can attach, the duplication disappears.
  • If duplication is unavoidable, should self.model be dropped once _ts_model is loaded (under the lock) to avoid double-holding VRAM?

Either way, the description's table should be corrected.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the VRAM table — it was misleading. Fixed:

from_pretrained(forced_aligner=...) **does** reload the full backbone, so we now del self.modelimmediately after loading_ts_model` to avoid holding both copies:

self._ts_model = Qwen3ASRModel.from_pretrained(self.model_path, **load_kwargs)
if self.model is not None:
    del self.model
    self.model = None

This keeps peak VRAM at ~4.7 GB (single copy) instead of ~8.2 GB (double). Updated VRAM table:

Scenario Before (old) After (this PR)
No timestamps ~4.7 GB (aligner always loaded) ~3.5 GB (aligner not loaded)
Timestamps requested ~4.7 GB ~4.7 GB (lazy-loaded, base freed)

# Always compute sentence-level segments via gap merging.
# Extract word info once and reuse throughout.
extracted = [self._extract_word_info(ts) for ts in time_stamps]
threshold = self._compute_gap_threshold_from_extracted(extracted)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that _build_segments uses _compute_gap_threshold_from_extracted, the original _compute_gap_threshold(time_stamps) has no remaining callers — it's dead code. Either delete it, or collapse to a single implementation that extracts internally:

@staticmethod
def _compute_gap_threshold(time_stamps):
    return BackendServicer._compute_gap_threshold_from_extracted(
        [BackendServicer._extract_word_info(ts) for ts in time_stamps]
    )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed _compute_gap_threshold(time_stamps) entirely. Updated the docstring of the remaining _compute_gap_threshold_from_extracted to be self-contained.

1. Fix NameError: remove 'want_timestamps' reference from _get_ts_model()
   (it was a local variable in AudioTranscription, not accessible here).
   Now returns None when no aligner is configured, caller handles fallback.

2. Fix VRAM duplication: del self.model after _ts_model is loaded so only
   one full model copy is held in memory at a time.

3. Remove dead _compute_gap_threshold(time_stamps) method — all callers
   now use _compute_gap_threshold_from_extracted(). Update its docstring.
@fqscfqj fqscfqj force-pushed the fix/qwen-asr-timestamps branch from fbdc67e to 39933e8 Compare May 30, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants