fix(sarvam): prevent transcript loss after long agent responses#4798
fix(sarvam): prevent transcript loss after long agent responses#4798Nikhils-G wants to merge 4 commits intolivekit:mainfrom
Conversation
When the agent produces a long TTS response (10+ seconds), audio chunks pile up faster than they're sent. Once the audio task drains the buffer and finishes, asyncio.wait with FIRST_COMPLETED immediately cancels the message task — but Sarvam is still processing all that buffered audio and hasn't returned the transcript yet. The result: the user speaks, Sarvam hears it, but the transcript never makes it back. The agent goes silent and can't respond. This fix gives the message task up to 30 seconds to receive the transcript before giving up, which matches the worst-case processing time observed in production with long audio buffers.
livekit-plugins/livekit-plugins-sarvam/livekit/plugins/sarvam/stt.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR addresses a concurrency race in the Sarvam streaming STT connection loop where long agent TTS responses can cause buffered audio to be processed late, and the transcript message task gets cancelled prematurely (leading to dropped transcripts and the agent going silent).
Changes:
- Adds a 30s grace period for
_message_taskto receive the final transcript when_audio_taskcompletes first. - Refines task cancellation to only cancel tasks still not done after the grace-period logic.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self._logger.info( | ||
| "Transcript received from Sarvam", | ||
| extra=self._build_log_context(), | ||
| ) |
There was a problem hiding this comment.
If _message_task completes during the extra 30s wait, it won’t be in the original done set, so any exception from _process_messages can be missed. After the wait, explicitly check/propagate self._message_task.exception() (or update the done set) so failures still surface like they do for tasks in done.
| ) | |
| ) | |
| # Ensure any exception from the message task is propagated | |
| # by including it in the completed tasks set. | |
| done = done | {self._message_task} |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The except Exception: pass around the wait will also swallow asyncio.CancelledError on Python 3.10 (where it inherits from Exception), preventing proper cancellation/shutdown of _run_connection. Handle CancelledError explicitly (re-raise), and avoid suppressing unexpected exceptions from the wait call (at least log them).
| except Exception: | |
| pass | |
| except asyncio.CancelledError: | |
| # Propagate cancellation so the surrounding coroutine can shut down properly | |
| raise | |
| except Exception: | |
| # Log unexpected errors instead of silently swallowing them | |
| self._logger.exception( | |
| "Error while waiting for transcript task", | |
| extra=self._build_log_context(), | |
| ) |
If the message task fails with an API error while we're waiting for the transcript, that exception was getting silently swallowed since it wasn't in the original done set. Now we check for and re-raise any exception after the 30s wait completes.
Don't swallow asyncio.CancelledError during the transcript grace period — re-raise it so shutdown works correctly on Python 3.10+. Also log unexpected exceptions instead of silently dropping them. Fixed ruff formatting to pass CI.
livekit-plugins/livekit-plugins-sarvam/livekit/plugins/sarvam/stt.py
Outdated
Show resolved
Hide resolved
No point waiting for a transcript if the audio pipeline broke — the server won't have anything to transcribe. Check the audio task for exceptions first and only enter the grace period on clean completion.
When the agent produces a long TTS response (10+ seconds), audio chunks pile up faster than they're sent. Once the audio task drains the buffer and finishes, asyncio.wait with FIRST_COMPLETED immediately cancels the message task — but Sarvam is still processing all that buffered audio and hasn't returned the transcript yet.
The result: the user speaks, Sarvam hears it, but the transcript never makes it back. The agent goes silent and can't respond.
This fix gives the message task up to 30 seconds to receive the transcript before giving up, which matches the worst-case processing time observed in production with long audio buffers.