Skip to content

Voice text splitter drops short custom chunks #3363

@Aphroq

Description

@Aphroq

Please read this first

  • Have you read the docs? Yes.
  • Have you searched for related issues? Yes. I searched existing issues and PRs for voice text_splitter short chunk and did not find a matching report or fix.

Describe the bug

StreamedAudioResult drops text when a custom TTS text_splitter returns a non-empty chunk shorter than 20 characters. This can silently omit short responses such as "ok" even though the splitter explicitly marked the text as ready for TTS.

The default sentence splitter already applies its own minimum sentence length before returning a chunk. Once a custom splitter returns non-empty text, StreamedAudioResult should treat that as the splitter's decision and send the chunk to the TTS model.

I also checked adjacent voice streaming behavior. The same path needs to emit turn_ended when a custom splitter consumes all buffered text before _turn_done() is called; otherwise a turn can start without a matching turn-ended lifecycle event.

Debug information

  • Agents SDK version: 0.17.1, reproduced on main at 92e014a4
  • Python version: Python 3.12.1

Repro steps

Run this minimal script:

import asyncio
from collections.abc import AsyncIterator

import numpy as np

from agents.voice import StreamedAudioResult, TTSModel, TTSModelSettings, VoicePipelineConfig


class RecordingTTS(TTSModel):
    def __init__(self):
        self.texts = []

    @property
    def model_name(self) -> str:
        return "recording_tts"

    async def run(self, text: str, settings: TTSModelSettings) -> AsyncIterator[bytes]:
        self.texts.append(text)
        yield np.zeros(2, dtype=np.int16).tobytes()


def split_immediately(text: str) -> tuple[str, str]:
    return text, ""


async def main():
    tts = RecordingTTS()
    result = StreamedAudioResult(
        tts,
        TTSModelSettings(buffer_size=1, text_splitter=split_immediately),
        VoicePipelineConfig(),
    )
    await result._add_text("ok")
    await result._turn_done()
    await result._done()

    events = []
    audio_chunks = 0
    async for event in result.stream():
        if event.type == "voice_stream_event_lifecycle":
            events.append(event.event)
        elif event.type == "voice_stream_event_audio":
            events.append("audio")
            audio_chunks += 1
    print({"tts_texts": tts.texts, "events": events, "audio_chunks": audio_chunks})


asyncio.run(main())

Actual result:

{'tts_texts': [], 'events': ['turn_started', 'session_ended'], 'audio_chunks': 0}

Expected behavior

The non-empty splitter chunk should be sent to TTS even though it is shorter than 20 characters. The run should produce one audio event and a balanced lifecycle sequence, for example:

{'tts_texts': ['ok'], 'events': ['turn_started', 'audio', 'turn_ended', 'session_ended'], 'audio_chunks': 1}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions