Skip to content

fix(voice): scope playout state per generation step in SpeechHandle#6107

Open
anshulkulhari7 wants to merge 1 commit into
livekit:mainfrom
anshulkulhari7:refactor/5545-playout-pause-per-step
Open

fix(voice): scope playout state per generation step in SpeechHandle#6107
anshulkulhari7 wants to merge 1 commit into
livekit:mainfrom
anshulkulhari7:refactor/5545-playout-pause-per-step

Conversation

@anshulkulhari7

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #5594 (the cross-step stale-paused-speech fix).

SpeechHandle is the public whole-turn handle, but a multi-step turn reuses the
same handle across generation steps while some state (playout / pause) is
generation-step local. #5594 resets _paused_speech in _scheduling_task after
the current generation's _wait_for_generation completes, which covers the
cross-step leak. It does not cover the case @jayeshp19 flagged in #5533:

the false-interruption timer fires before the silent tool-call generation
finishes. In that case the runtime can still do a thinking → listening → thinking false-resume during the silent step itself … the pause-on-thinking
path can still record _paused_speech.

This is the "pausing while no audio is actually playing" edge case, and it is what
issue #5545 proposes to fix by scoping playout/pause state per generation step.

The bug (reproduced)

  1. User asks "what's the weather in Tokyo?"
  2. The model emits only a function call — a silent tool-call step, no audio.
  3. VAD-only speech starts and ends during that silent step, so the pause-on-thinking
    path records _paused_speech and the false-interruption timer is scheduled.
  4. The timer fires while the silent step is still active (before it advances to
    the tool reply).

On main, _on_false_interruption resumes and emits
AgentFalseInterruptionEvent(resumed=True) even though no audio ever played for
this step — leaking stale per-step playout state. The added regression test
asserts no such resume is emitted; it fails on main:

>       assert not any(ev.resumed for ev in false_interruption_events)
E       assert not True
tests/test_agent_session.py: AssertionError

Approach

Mirroring the issue's proposal ("a private generation-step ref … playout callbacks,
pause state, and cleanup timers should only mutate state when their captured
generation ref still matches the handle's current generation"), this scopes playout
state to the current step:

  • SpeechHandle._playout_started — set when the current step starts audio playout
    (in the existing _on_first_frame callbacks), reset on every step advance
    (_authorize_generation). It records whether the current step is actually
    playing audio.
  • _on_false_interruption now only treats a pause as a resumable false interruption
    when the current step actually started playout. For a silent step it undoes the
    preemptive pause silently and emits no resume event.

The change is additive and backward-compatible. The existing #5594 regression test
(test_silent_tool_call_pause_state_does_not_leak_into_tool_reply) and the rest of
the interruption suite still pass, since a real tool reply has started playout by the
time its timer fires.

Question for maintainers

@longcw @davidzhao — does this per-step _playout_started gating match the internal
direction you intended for #5545? I kept it minimal (a per-step playout flag + a
single early-return in the false-interruption timer) rather than introducing a
broader public per-step handle. Happy to fold playout-state tracking more fully into
a generation-step ref if you'd prefer that shape.

Verification

Closes #5545


AI-assisted: implemented with AI assistance; all changes were reviewed, tested, and verified by the author.

A SpeechHandle represents a whole assistant turn, but a multi-step turn
(e.g. a silent tool-call step followed by a tool-reply step) reuses the
same handle across steps while playout state is generation-step local.

When a silent tool-call step records a preemptive pause and the
false-interruption timer fires while that step is still active, the
runtime resumed and emitted an AgentFalseInterruptionEvent(resumed=True)
even though no audio ever played for that step, leaking stale per-step
playout state.

Track playout per generation step via a _playout_started flag on
SpeechHandle that is set when a step starts audio playout and reset on
every step advance. The false-interruption resume now only fires when the
current generation step actually started playout; otherwise the preemptive
pause is undone silently.

Closes livekit#5545
@anshulkulhari7 anshulkulhari7 requested a review from a team as a code owner June 15, 2026 09:59

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track voice playout and pause state by generation step instead of reused SpeechHandle

1 participant