feat(soniox): support stt-rt-v5 with endpoint_sensitivity option#6126
feat(soniox): support stt-rt-v5 with endpoint_sensitivity option#6126mihafabcic-soniox wants to merge 3 commits into
Conversation
2000ms is the Soniox API's own default and works well in practice. The previous value of 500ms, the API minimum, is too aggressive and can cause word recognition issues when the model finalizes tokens too early.
| enable_language_identification: bool = True | ||
|
|
||
| max_endpoint_delay_ms: int = 500 | ||
| max_endpoint_delay_ms: int = 2000 |
There was a problem hiding this comment.
π© Breaking default change: max_endpoint_delay_ms 500 β 2000
The default max_endpoint_delay_ms changed from 500 to 2000. This is a 4Γ increase in the maximum endpoint detection delay, meaning existing users who rely on the default will experience noticeably later speech finalization. While this appears intentional for the v5 model, it is a behavioral breaking change for any caller that constructs STTOptions() without explicitly setting this field. The livekit-plugins-inworld plugin references soniox/stt-rt-v4 in comments (livekit-plugins/livekit-plugins-inworld/livekit/plugins/inworld/stt.py:55) β that plugin may also need updating if it depends on these defaults.
Was this helpful? React with π or π to provide feedback.
Updates the LiveKit Soniox plugin for the v5 model.
endpoint_sensitivitytoSTTOptions(float | None, range-1.0to1.0). Controls how quickly the model commits endpoints. Higher values finalize sooner. Only supported by v5; earlier models reject the field. Skipped on the wire whenNoneso the server uses its default.stt-rt-v5.max_endpoint_delay_msraised from500(the API minimum) to2000. The old default was too aggressive on phone-call audio: short pauses between word or digit groups would cause Soniox to finalize a segment too early, before the model had enough context.2000matches the Soniox API's own default.