Skip to content

[Bug] Inconsistent TTS voices (multiple voices) and repetition for the same output in ADK 1.26.0 (Vietnamese) #4649

@thinhemb-knmholdings

Description

@thinhemb-knmholdings

Description:
I am experiencing an issue where the TTS output generates two different voices and repeats words/phrases within the exact same sentence. This seems like the model is hallucinating a multi-speaker conversation or experiencing audio generation glitches.

This issue started occurring after I migrated from the googleapis/python-genai SDK (v1.65.0) to the new google/adk-python (v1.26.0).

Environment:

  • Model: gemini-2.5-flash-native-audio-preview-12-2025
  • Old SDK: python-genai v1.65.0 (Worked fine, single consistent voice)
  • Current SDK: adk-python v1.26.0
  • Language: Vietnamese
  • Use Case: Voice Bot / Virtual Assistant (e.g., Customer Service)

Troubleshooting Steps I've Taken:

  1. Verified Audio Source: I confirmed this is NOT an audio conversion/decoding error on my end. I intercepted and saved the raw audio bytes directly returned by Google's API, and the anomalies (2 voices, repetitions) are present in the raw file.
  2. Prompt Engineering: I tried explicitly instructing the model to use only one voice and act as a single agent, but it didn't solve the issue.
  3. Configuration Check: I adjusted various configurations within the ADK, but the issue persists.

Expected Behavior:
For a single generated response, the TTS should output the text using one consistent voice without sudden voice changes or unnatural repetitions.

Actual Behavior:
The audio output shifts between two distinct voices (as if two different people are speaking) and repeats certain parts of the sentence (e.g., saying "Alo" multiple times or repeating the intro). It feels like the model is trying to simulate both sides of a phone call.

Steps to Reproduce:
(Please review the attached code snippet and audio file)

  1. Initialize the adk-python client (v1.26.0).
  2. Send a prompt simulating a customer service scenario in Vietnamese (e.g., "Trung tâm hành chính công xin nghe...").
  3. Listen to the raw audio output returned by the model.

Attachments:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    live[Component] This issue is related to live, voice and video chat

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions