Skip to content

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21

Open
Alex-Wengg wants to merge 3 commits intomainfrom
feat/qwen3-forced-aligner-coreml
Open

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21
Alex-Wengg wants to merge 3 commits intomainfrom
feat/qwen3-forced-aligner-coreml

Conversation

@Alex-Wengg
Copy link
Contributor

Summary

  • Add CoreML conversion pipeline and inference script for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text
  • Split audio encoder into conv frontend + transformer to preserve cross-chunk bidirectional attention (5x improvement in alignment accuracy vs monolithic per-chunk approach)
  • Includes conversion scripts, end-to-end inference pipeline, PyTorch comparison tooling, and Python environment (pyproject.toml + uv.lock)

Files

File Purpose
convert-coreml.py CLI to export all 5 CoreML components from HuggingFace weights
individual_components.py PyTorch wrapper modules for tracing (conv, transformer, decoder, etc.)
run_coreml_inference.py End-to-end CoreML inference + parity comparison against PyTorch
compare-models.py Generate PyTorch reference timestamps from LibriSpeech
pyproject.toml / uv.lock Python dependencies
README.md Architecture docs, I/O shapes, inference pipeline
problems_encountered.md Conversion journal with all issues and fixes

CoreML Components (5 models)

Component Input Output Precision
Audio Conv [1, 128, 100] mel [1, 13, 1024] conv features FP16
Audio Transformer [1, 256, 1024] features [1, 256, 1024] embeddings FP32
Token Embedding [1, seq, int32] [1, seq, 1024] FP16
Decoder Prefill [1, 1024, 1024] + RoPE [1, 1024, 1024] FP32
LM Head [1, seq, 1024] [1, seq, 5000] timestamps FP32

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 word boundaries)

Metric Value
AAS (mean boundary error) 4.4 ms
Max boundary error 160 ms
% within 20ms 95.4%
% within 80ms (1 segment) 99.1%
% within 160ms (2 segments) 100.0%

Test plan

  • cd models/stt/qwen3-forced-aligner-0.6b/coreml && uv sync
  • uv run python convert-coreml.py — convert all 5 components
  • uv run python compare-models.py --num-files 3 — generate PyTorch reference
  • uv run python run_coreml_inference.py compare — verify CoreML vs PyTorch parity
  • Verify AAS < 5ms and >95% within 20ms

🤖 Generated with Claude Code

Alex-Wengg and others added 3 commits February 15, 2026 04:26
Add CoreML conversion pipeline for Qwen3-ForcedAligner-0.6B, a non-autoregressive
forced alignment model that produces per-word timestamps from audio + text.

The pipeline splits the model into 5 CoreML components:
- Audio conv frontend (per-chunk mel → conv features)
- Audio transformer (cross-chunk bidirectional attention + projection)
- Token embedding (vocab → hidden states)
- Decoder prefill (28-layer Qwen3 decoder, single NAR pass)
- LM head (hidden states → 5000 timestamp bins)

Key design decisions:
- Audio encoder split into conv + transformer to preserve cross-chunk
  attention (monolithic per-chunk approach had 20.7ms AAS vs 4.4ms split)
- MRoPE cos/sin computed outside the model for flexibility
- Last mel chunk trimmed after conv to remove padding artifacts
- Decoder and LM head use FLOAT32 precision to avoid FP16 overflow

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 boundaries):
- AAS: 4.4ms, within 20ms: 95.4%, within 80ms: 99.1%

Co-Authored-By: Claude <noreply@anthropic.com>
The inference script supports two audio encoder paths with auto-detection.
Split encoder (audio_conv + audio_transformer) preserves cross-chunk attention
for 4.4ms AAS. Monolithic encoder (audio_encoder) is faster but lacks
cross-chunk attention (20.7ms AAS). Added comparison table and updated
architecture, I/O shapes, inference pipeline, conversion, and parity sections.
Document 5 bugs encountered during FluidAudio Swift integration:
MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel,
STFT center padding, and MRoPE position clamping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments