echo guard: disable interruptions while AEC warms up#4813
echo guard: disable interruptions while AEC warms up#4813
Conversation
| if self._session._echo_guard_remaining_duration > 0: | ||
| # disable interruption from audio activity while echo guard is active | ||
| return |
There was a problem hiding this comment.
🟡 Echo guard blocks interruptions even when agent is not speaking (between turns)
The _interrupt_by_audio_activity check at agent_activity.py:1229 only checks _echo_guard_remaining_duration > 0 without verifying the agent is currently speaking. When the agent transitions from "speaking" to "thinking" (e.g., during a tool call or between LLM response and TTS), _cancel_echo_guard_timer preserves the remaining echo guard duration. This causes _interrupt_by_audio_activity to block genuine user interruptions during non-speaking phases where there is no audio output and thus no echo to guard against.
Root Cause and Impact
The push_audio method at agent_activity.py:788-791 correctly gates audio discarding on agent_state == "speaking", but _interrupt_by_audio_activity at agent_activity.py:1229 does not include this check:
# push_audio - correctly checks agent_state
should_discard = ... or (
self._session.agent_state == "speaking"
and self._session._echo_guard_remaining_duration > 0
)
# _interrupt_by_audio_activity - missing agent_state check
def _interrupt_by_audio_activity(self) -> None:
if self._session._echo_guard_remaining_duration > 0:
return # blocks even when agent is in "thinking" or "listening" stateScenario: With echo_guard_duration=3.0, the agent speaks for 1 second then transitions to "thinking". _cancel_echo_guard_timer (agent_session.py:1217-1227) saves 2.0s of remaining duration. During the thinking phase, the user speaks but _interrupt_by_audio_activity returns early because _echo_guard_remaining_duration is 2.0 > 0, even though there's no audio output to cause echo. This blocks the user from interrupting the agent during non-speaking phases.
Impact: Users cannot interrupt the agent during "thinking" or "listening" states while the echo guard has remaining duration, even though echo is only possible during "speaking" state.
| if self._session._echo_guard_remaining_duration > 0: | |
| # disable interruption from audio activity while echo guard is active | |
| return | |
| if self._session.agent_state == "speaking" and self._session._echo_guard_remaining_duration > 0: | |
| # disable interruption from audio activity while echo guard is active | |
| return | |
Was this helpful? React with 👍 or 👎 to provide feedback.
| but may incur extra compute if the user interrupts or revises mid-utterance. | ||
| Defaults to ``False``. | ||
| echo_guard_duration (float, optional): The duration in seconds that the agent | ||
| will ignore user's audio interruptions after the agent starts speaking. |
There was a problem hiding this comment.
Q: Does this apply to both cases when the session starts:
- Agent speaks first
- Agent's first response (the user might speak first)
There was a problem hiding this comment.
yes, it only considers agent speaking, no matter who speaking first.
| use_tts_aligned_transcript: NotGivenOr[bool] = NOT_GIVEN, | ||
| tts_text_transforms: NotGivenOr[Sequence[TextTransforms] | None] = NOT_GIVEN, | ||
| preemptive_generation: bool = False, | ||
| echo_guard_duration: float | None = None, |
There was a problem hiding this comment.
I think we should do it by default.
Maybe we shouldn't even have an option for it, it seems like an issue everybody has
There was a problem hiding this comment.
how long does it usually need to warm up the AEC?
when
echo_guard_durationis set, it blocks interruptions for a few seconds after the agent starts speaking to allow client to calibrate AEC.this only blocks the audio input when agent is speaking, and disable the interruption from audio input,
session.interruptstill works.