-
Notifications
You must be signed in to change notification settings - Fork 2.8k
echo guard: disable interruptions while AEC warms up #4813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -89,6 +89,7 @@ class AgentSessionOptions: | |
| preemptive_generation: bool | ||
| tts_text_transforms: Sequence[TextTransforms] | None | ||
| ivr_detection: bool | ||
| echo_guard_duration: float | None | ||
|
|
||
|
|
||
| Userdata_T = TypeVar("Userdata_T") | ||
|
|
@@ -158,6 +159,7 @@ def __init__( | |
| use_tts_aligned_transcript: NotGivenOr[bool] = NOT_GIVEN, | ||
| tts_text_transforms: NotGivenOr[Sequence[TextTransforms] | None] = NOT_GIVEN, | ||
| preemptive_generation: bool = False, | ||
| echo_guard_duration: float | None = None, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should do it by default.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how long does it usually need to warm up the AEC? |
||
| ivr_detection: bool = False, | ||
| conn_options: NotGivenOr[SessionConnectOptions] = NOT_GIVEN, | ||
| loop: asyncio.AbstractEventLoop | None = None, | ||
|
|
@@ -246,6 +248,10 @@ def __init__( | |
| can reduce response latency by overlapping model inference with user audio, | ||
| but may incur extra compute if the user interrupts or revises mid-utterance. | ||
| Defaults to ``False``. | ||
| echo_guard_duration (float, optional): The duration in seconds that the agent | ||
| will ignore user's audio interruptions after the agent starts speaking. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Q: Does this apply to both cases when the session starts:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it only considers agent speaking, no matter who speaking first. |
||
| This is useful to prevent the agent from being interrupted by echo before AEC is ready. | ||
| Default ``None``. | ||
| ivr_detection (bool): Whether to detect if the agent is interacting with an IVR system. | ||
| Default ``False``. | ||
| conn_options (SessionConnectOptions, optional): Connection options for | ||
|
|
@@ -291,6 +297,7 @@ def __init__( | |
| use_tts_aligned_transcript=use_tts_aligned_transcript | ||
| if is_given(use_tts_aligned_transcript) | ||
| else None, | ||
| echo_guard_duration=echo_guard_duration, | ||
| ) | ||
| self._conn_options = conn_options or SessionConnectOptions() | ||
| self._started = False | ||
|
|
@@ -316,6 +323,11 @@ def __init__( | |
| self._llm_error_counts = 0 | ||
| self._tts_error_counts = 0 | ||
|
|
||
| # echo guard: disable interruptions while AEC warms up | ||
| self._echo_guard_remaining_duration = echo_guard_duration or 0.0 | ||
| self._echo_guard_timer: asyncio.TimerHandle | None = None | ||
| self._echo_guard_speaking_start: float | None = None | ||
|
|
||
| # configurable IO | ||
| self._input = io.AgentInput(self._on_video_input_changed, self._on_audio_input_changed) | ||
| self._output = io.AgentOutput( | ||
|
|
@@ -787,6 +799,8 @@ async def _aclose_impl( | |
|
|
||
| self._closing = True | ||
| self._cancel_user_away_timer() | ||
| self._cancel_echo_guard_timer() | ||
| self._on_echo_guard_expired() # always clear echo guard when closing the session | ||
|
|
||
| if self._activity is not None: | ||
| if not drain: | ||
|
|
@@ -1192,6 +1206,26 @@ def _cancel_user_away_timer(self) -> None: | |
| self._user_away_timer.cancel() | ||
| self._user_away_timer = None | ||
|
|
||
| def _on_echo_guard_expired(self) -> None: | ||
| if self._echo_guard_remaining_duration > 0: | ||
| logger.debug("echo guard expired, re-enabling interruptions") | ||
|
|
||
| self._echo_guard_remaining_duration = 0.0 | ||
| self._echo_guard_timer = None | ||
| self._echo_guard_speaking_start = None | ||
|
|
||
| def _cancel_echo_guard_timer(self) -> None: | ||
| if self._echo_guard_timer is not None: | ||
| self._echo_guard_timer.cancel() | ||
| self._echo_guard_timer = None | ||
|
|
||
| if self._echo_guard_speaking_start is not None: | ||
| elapsed = time.time() - self._echo_guard_speaking_start | ||
| self._echo_guard_remaining_duration = max( | ||
| 0.0, self._echo_guard_remaining_duration - elapsed | ||
| ) | ||
| self._echo_guard_speaking_start = None | ||
|
|
||
| def _update_agent_state( | ||
| self, | ||
| state: AgentState, | ||
|
|
@@ -1223,6 +1257,20 @@ def _update_agent_state( | |
| self._agent_speaking_span.end() | ||
| self._agent_speaking_span = None | ||
|
|
||
| # echo guard: disable interruptions while AEC warms up | ||
| if state == "speaking" and self._echo_guard_remaining_duration > 0: | ||
| self._echo_guard_speaking_start = time.time() | ||
| self._echo_guard_timer = self._loop.call_later( | ||
| self._echo_guard_remaining_duration, self._on_echo_guard_expired | ||
| ) | ||
| logger.debug( | ||
| "echo guard active, disabling interruptions for %.2fs", | ||
| self._echo_guard_remaining_duration, | ||
| ) | ||
|
|
||
| if self._agent_state == "speaking" and state != "speaking": | ||
| self._cancel_echo_guard_timer() | ||
|
|
||
| if state == "listening" and self._user_state == "listening": | ||
| self._set_user_away_timer() | ||
| else: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Echo guard blocks interruptions even when agent is not speaking (between turns)
The
_interrupt_by_audio_activitycheck atagent_activity.py:1229only checks_echo_guard_remaining_duration > 0without verifying the agent is currently speaking. When the agent transitions from "speaking" to "thinking" (e.g., during a tool call or between LLM response and TTS),_cancel_echo_guard_timerpreserves the remaining echo guard duration. This causes_interrupt_by_audio_activityto block genuine user interruptions during non-speaking phases where there is no audio output and thus no echo to guard against.Root Cause and Impact
The
push_audiomethod atagent_activity.py:788-791correctly gates audio discarding onagent_state == "speaking", but_interrupt_by_audio_activityatagent_activity.py:1229does not include this check:Scenario: With
echo_guard_duration=3.0, the agent speaks for 1 second then transitions to "thinking"._cancel_echo_guard_timer(agent_session.py:1217-1227) saves 2.0s of remaining duration. During the thinking phase, the user speaks but_interrupt_by_audio_activityreturns early because_echo_guard_remaining_durationis 2.0 > 0, even though there's no audio output to cause echo. This blocks the user from interrupting the agent during non-speaking phases.Impact: Users cannot interrupt the agent during "thinking" or "listening" states while the echo guard has remaining duration, even though echo is only possible during "speaking" state.
Was this helpful? React with 👍 or 👎 to provide feedback.