Skip to content

[QUESTION] SpeakersResult returns re-encoded speaker_identifiers — how to track changes? #98

@andycop

Description

@andycop

Context

We're using the RT transcription API with speaker recognition (known speakers via speaker_diarization_config.speakers). We submit known speakers with their stored speaker_identifiers in StartRecognition, and read back speaker_identifiers from the SpeakersResult message at end of session.

This is working well for identification, but we've hit a question about managing speaker_identifiers over time.

Observed Behaviour

When we submit known speakers with their stored speaker_identifiers, the SpeakersResult always returns:

  • The same number of speaker_identifiers per speaker as we submitted
  • Different byte values — the first ~97 bytes (appears to be a format/header prefix) are identical, but the remaining voice data bytes differ every session
  • This happens even for speakers who did not speak during the session

Example: we submit 3 identifiers for Speaker A and 1 for Speaker B. We always get back exactly 3 for Speaker A and 1 for Speaker B, all with modified values.

The Problem

Because the returned identifiers are always re-encoded, we can't distinguish between:

  1. Unchanged identifiers — voice data that was submitted and passed through (no new audio for this speaker)
  2. Updated identifiers — voice data that was refined with new audio from the session
  3. New identifiers — a genuinely new voice embedding captured from session audio

This makes it impossible to maintain a reliable speaker identifier set over time. Specifically:

  • We can't tell if a returned identifier is "better" than what we sent (should we replace?)
  • We can't detect when a new identifier has been captured vs an existing one re-encoded
  • If a speaker is misidentified and we correct it, we can't safely move identifiers to the correct speaker because we don't know which are real vs re-encoded copies of the wrong speaker's data

Use Case

We store multiple speaker_identifiers per speaker to improve recognition across different contexts (different microphones, in-person vs remote, etc.). When a user corrects a misidentification, we need to know which identifiers to move to the correct speaker profile.

Questions

  1. Why are returned identifiers re-encoded? Is there a session-specific salt/nonce in the encoding, or are they genuinely refined each time?
  2. Is there a way to get identifiers returned unchanged so we can track which ones were updated vs passed through?
  3. When a known speaker speaks during a session, does the returned set include any new identifier derived from the session audio, or is it always the same count as submitted?
  4. What is the recommended strategy for maintaining a speaker's identifier set over time — should we replace stored identifiers with returned ones, or keep the originals?

Environment

  • API: Real-time transcription WebSocket (not using the Python SDK directly, but the RT API)
  • Feature: speaker_diarization_config with speakers array and get_speakers: true

Any guidance would be really appreciated — this is blocking our speaker profile management feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions