fix(voice): scope playout state per generation step in SpeechHandle by anshulkulhari7 · Pull Request #6107 · livekit/agents

anshulkulhari7 · 2026-06-15T09:59:13Z

Summary

Follow-up to #5594 (the cross-step stale-paused-speech fix).

SpeechHandle is the public whole-turn handle, but a multi-step turn reuses the
same handle across generation steps while some state (playout / pause) is
generation-step local. #5594 resets _paused_speech in _scheduling_task after
the current generation's _wait_for_generation completes, which covers the
cross-step leak. It does not cover the case @jayeshp19 flagged in #5533:

the false-interruption timer fires before the silent tool-call generation
finishes. In that case the runtime can still do a thinking → listening → thinking false-resume during the silent step itself … the pause-on-thinking
path can still record _paused_speech.

This is the "pausing while no audio is actually playing" edge case, and it is what
issue #5545 proposes to fix by scoping playout/pause state per generation step.

The bug (reproduced)

User asks "what's the weather in Tokyo?"
The model emits only a function call — a silent tool-call step, no audio.
VAD-only speech starts and ends during that silent step, so the pause-on-thinking
path records _paused_speech and the false-interruption timer is scheduled.
The timer fires while the silent step is still active (before it advances to
the tool reply).

On main, _on_false_interruption resumes and emits
AgentFalseInterruptionEvent(resumed=True) even though no audio ever played for
this step — leaking stale per-step playout state. The added regression test
asserts no such resume is emitted; it fails on main:

>       assert not any(ev.resumed for ev in false_interruption_events)
E       assert not True
tests/test_agent_session.py: AssertionError

Approach

Mirroring the issue's proposal ("a private generation-step ref … playout callbacks,
pause state, and cleanup timers should only mutate state when their captured
generation ref still matches the handle's current generation"), this scopes playout
state to the current step:

SpeechHandle._playout_started — set when the current step starts audio playout
(in the existing _on_first_frame callbacks), reset on every step advance
(_authorize_generation). It records whether the current step is actually
playing audio.
_on_false_interruption now only treats a pause as a resumable false interruption
when the current step actually started playout. For a silent step it undoes the
preemptive pause silently and emits no resume event.

The change is additive and backward-compatible. The existing #5594 regression test
(test_silent_tool_call_pause_state_does_not_leak_into_tool_reply) and the rest of
the interruption suite still pass, since a real tool reply has started playout by the
time its timer fires.

Question for maintainers

@longcw @davidzhao — does this per-step _playout_started gating match the internal
direction you intended for #5545? I kept it minimal (a per-step playout flag + a
single early-return in the false-interruption timer) rather than introducing a
broader public per-step handle. Happy to fold playout-state tracking more fully into
a generation-step ref if you'd prefer that shape.

Verification

uv run pytest tests/test_agent_session.py --unit — 34 passed (incl. new test and
the fix: clear stale paused speech state across generation steps #5594 regression test)
uv run ruff check / ruff format --check — clean on changed files
uv run mypy -p livekit.agents (strict) — Success, no issues

Closes #5545

AI-assisted: implemented with AI assistance; all changes were reviewed, tested, and verified by the author.

A SpeechHandle represents a whole assistant turn, but a multi-step turn (e.g. a silent tool-call step followed by a tool-reply step) reuses the same handle across steps while playout state is generation-step local. When a silent tool-call step records a preemptive pause and the false-interruption timer fires while that step is still active, the runtime resumed and emitted an AgentFalseInterruptionEvent(resumed=True) even though no audio ever played for that step, leaking stale per-step playout state. Track playout per generation step via a _playout_started flag on SpeechHandle that is set when a step starts audio playout and reset on every step advance. The false-interruption resume now only fires when the current generation step actually started playout; otherwise the preemptive pause is undone silently. Closes livekit#5545

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

anshulkulhari7 requested a review from a team as a code owner June 15, 2026 09:59

devin-ai-integration Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(voice): scope playout state per generation step in SpeechHandle#6107

fix(voice): scope playout state per generation step in SpeechHandle#6107
anshulkulhari7 wants to merge 1 commit into
livekit:mainfrom
anshulkulhari7:refactor/5545-playout-pause-per-step

anshulkulhari7 commented Jun 15, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anshulkulhari7 commented Jun 15, 2026

Summary

The bug (reproduced)

Approach

Question for maintainers

Verification

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant