feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference by Alex-Wengg · Pull Request #21 · FluidInference/mobius

Alex-Wengg · 2026-02-15T09:26:54Z

Summary

Add CoreML conversion pipeline and inference script for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text
Split audio encoder into conv frontend + transformer to preserve cross-chunk bidirectional attention (5x improvement in alignment accuracy vs monolithic per-chunk approach)
Includes conversion scripts, end-to-end inference pipeline, PyTorch comparison tooling, and Python environment (pyproject.toml + uv.lock)

Files

File	Purpose
`convert-coreml.py`	CLI to export all 5 CoreML components from HuggingFace weights
`individual_components.py`	PyTorch wrapper modules for tracing (conv, transformer, decoder, etc.)
`run_coreml_inference.py`	End-to-end CoreML inference + parity comparison against PyTorch
`compare-models.py`	Generate PyTorch reference timestamps from LibriSpeech
`pyproject.toml` / `uv.lock`	Python dependencies
`README.md`	Architecture docs, I/O shapes, inference pipeline
`problems_encountered.md`	Conversion journal with all issues and fixes

CoreML Components (5 models)

Component	Input	Output	Precision
Audio Conv	`[1, 128, 100]` mel	`[1, 13, 1024]` conv features	FP16
Audio Transformer	`[1, 256, 1024]` features	`[1, 256, 1024]` embeddings	FP32
Token Embedding	`[1, seq, int32]`	`[1, seq, 1024]`	FP16
Decoder Prefill	`[1, 1024, 1024]` + RoPE	`[1, 1024, 1024]`	FP32
LM Head	`[1, seq, 1024]`	`[1, seq, 5000]` timestamps	FP32

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 word boundaries)

Metric	Value
AAS (mean boundary error)	4.4 ms
Max boundary error	160 ms
% within 20ms	95.4%
% within 80ms (1 segment)	99.1%
% within 160ms (2 segments)	100.0%

Test plan

cd models/stt/qwen3-forced-aligner-0.6b/coreml && uv sync
uv run python convert-coreml.py — convert all 5 components
uv run python compare-models.py --num-files 3 — generate PyTorch reference
uv run python run_coreml_inference.py compare — verify CoreML vs PyTorch parity
Verify AAS < 5ms and >95% within 20ms

🤖 Generated with Claude Code

Add CoreML conversion pipeline for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text. The pipeline splits the model into 5 CoreML components: - Audio conv frontend (per-chunk mel → conv features) - Audio transformer (cross-chunk bidirectional attention + projection) - Token embedding (vocab → hidden states) - Decoder prefill (28-layer Qwen3 decoder, single NAR pass) - LM head (hidden states → 5000 timestamp bins) Key design decisions: - Audio encoder split into conv + transformer to preserve cross-chunk attention (monolithic per-chunk approach had 20.7ms AAS vs 4.4ms split) - MRoPE cos/sin computed outside the model for flexibility - Last mel chunk trimmed after conv to remove padding artifacts - Decoder and LM head use FLOAT32 precision to avoid FP16 overflow Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 boundaries): - AAS: 4.4ms, within 20ms: 95.4%, within 80ms: 99.1% Co-Authored-By: Claude <noreply@anthropic.com>

The inference script supports two audio encoder paths with auto-detection. Split encoder (audio_conv + audio_transformer) preserves cross-chunk attention for 4.4ms AAS. Monolithic encoder (audio_encoder) is faster but lacks cross-chunk attention (20.7ms AAS). Added comparison table and updated architecture, I/O shapes, inference pipeline, conversion, and parity sections.

Document 5 bugs encountered during FluidAudio Swift integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.

Alex-Wengg and others added 3 commits February 15, 2026 04:26

docs: add Swift/CoreML integration bugs for ForcedAligner

26e11bd

Document 5 bugs encountered during FluidAudio Swift integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21
Alex-Wengg wants to merge 3 commits intomainfrom
feat/qwen3-forced-aligner-coreml

Alex-Wengg commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Alex-Wengg commented Feb 15, 2026

Summary

Files

CoreML Components (5 models)

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 word boundaries)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments