[TTS] Allow inference without reference audio #15213

rfejgin · 2025-12-20T01:30:53Z

In some cases we need to run inference on manifests that do not include context audio and/or ground truth audio:

Text context manifests may not have a context audio included (and it wouldn't be very relevant anyway)
When generating arbitrary text, the ground truth audio cannot be assumed to be available.

Note that even without the context and ground truth there are useful metrics that can be calculated, like WER, UTMOS, and inference speed.

This PR adapts the inference scripts and the data loader used by them to allow the absence of context and/or ground truth audio. When these is missing, the the metrics that depend on them are set to 0.0.

Note: this PR replaces an earlier PR that was based off magpietts_2508. We are now merging directly into main instead.

New command line argument: --datasets <dataset1,dataset2,...> where dataset1, dataset2, ... are the names datasets to process in the datasets_json_path file. If not specified, all datasets in the datasets_json_path will be processed. If specified, only the datasets in the list will be processed. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

* Correctly handle comma-separated list of dataset names in the --datasets argument. * Help text Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

In some cases we need to run inference on manifests that do not include context audio and/or ground truth audio: * Text context manifests may not have a context audio included (and it wouldn't be very relevant anyway) * When generating from arbitrary text the ground truth audio cannot be assumed to be available. Note that even without the context and ground truth there are useful metrics that can be calculated like WER, UTMOS, and inference speed. This commit modified the inference and dependent scripts to allow the absence of context and/or ground truth audio. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

The removed chagnes are part of a separate PR: NVIDIA-NeMo#15213

The removed chagnes are part of a separate PR: NVIDIA-NeMo#15213 Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

The removed changes are included in a separate PR: NVIDIA-NeMo#15213 Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

blisc · 2025-12-30T21:17:14Z

nemo/collections/tts/data/text_to_speech_dataset.py

+        # Assert no more than one of audio or audio_filepath in the batch
+        if 'audio' in batch_dict:
+            assert 'audio_filepath' not in batch_dict

-        # Assert only ONE of context_audio or context_audio_codes in the batch
+        # Assert no more than one of context_audio or context_audio_codes in the batch
        if 'context_audio' in batch_dict:
            assert 'context_audio_codes' not in batch_dict
-        if 'context_audio_codes' in batch_dict:
-            assert 'context_audio' not in batch_dict


The comments no longer match the logic, but the real question is why do we care that the dataset only returns one of audio vs audio_codes? Should we just remove these checks?

I suspect that these checks were put there since if both the audio waveform and the audio codes are provided, it's ambiguous which of the two should be used by the model; we wouldn't one part of the code choosing to use the audio and another to use the codes. So it simplifies things to know that only (at most) one is present and any time.

About the comments not matching the logic: which part?

About the comments not matching the logic: which part?

I think I was mistaken; the comments do match the logic.

Now that I read this again shouldn't # Assert no more than one of audio or audio_filepath in the batch be audio or audio_codes instead?

You are right, good catch!

Fixed.

nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py

…hout_ref_audio

- default SSIMs to NaN - Set ground truth audio transcript to None if it is missing and subsequently skip WER and CER calculations. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

rfejgin added 3 commits December 19, 2025 15:57

Refined datasets filtering in the inference script

10a838b

* Correctly handle comma-separated list of dataset names in the --datasets argument. * Help text Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

github-actions bot added the TTS label Dec 20, 2025

rfejgin requested review from blisc and subhankar-ghosh and removed request for subhankar-ghosh December 20, 2025 01:31

rfejgin added a commit to rfejgin/NeMo that referenced this pull request Dec 22, 2025

Remove changes not related to this PR

bcc8ad4

The removed chagnes are part of a separate PR: NVIDIA-NeMo#15213

rfejgin added a commit to rfejgin/NeMo that referenced this pull request Dec 22, 2025

Remove changes not related to this PR

d1d73d9

The removed chagnes are part of a separate PR: NVIDIA-NeMo#15213 Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

rfejgin force-pushed the magpietts_inference_without_ref_audio branch from bcc8ad4 to d1d73d9 Compare December 22, 2025 19:12

Remove changes not related to this PR

7c4098b

The removed changes are included in a separate PR: NVIDIA-NeMo#15213 Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

rfejgin force-pushed the magpietts_inference_without_ref_audio branch from d1d73d9 to 7c4098b Compare December 22, 2025 19:35

blisc added the Run CICD label Dec 30, 2025

blisc requested changes Dec 30, 2025

View reviewed changes

Merge remote-tracking branch 'nemo/main' into magpietts_inference_wit…

683a750

…hout_ref_audio

chtruong814 added Run CICD and removed Run CICD labels Jan 5, 2026

chtruong814 had a problem deploying to test January 5, 2026 23:25 — with GitHub Actions Error

Address PR comments

9301912

- default SSIMs to NaN - Set ground truth audio transcript to None if it is missing and subsequently skip WER and CER calculations. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

chtruong814 added Run CICD and removed Run CICD labels Jan 6, 2026

chtruong814 temporarily deployed to test January 6, 2026 05:21 — with GitHub Actions Inactive

Correct an assertion per PR comments

1827143

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

chtruong814 added Run CICD and removed Run CICD labels Jan 6, 2026

chtruong814 requested a deployment to test January 6, 2026 21:11 — with GitHub Actions Waiting

blisc approved these changes Jan 6, 2026

View reviewed changes

rfejgin marked this pull request as draft January 6, 2026 21:45

rfejgin marked this pull request as ready for review January 6, 2026 22:08

rfejgin enabled auto-merge (squash) January 6, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TTS] Allow inference without reference audio #15213

[TTS] Allow inference without reference audio #15213

rfejgin commented Dec 20, 2025 •

edited

Loading

Uh oh!

blisc Dec 30, 2025

Uh oh!

rfejgin Jan 5, 2026

Uh oh!

blisc Jan 6, 2026

Uh oh!

rfejgin Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TTS] Allow inference without reference audio #15213

Are you sure you want to change the base?

[TTS] Allow inference without reference audio #15213

Conversation

rfejgin commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blisc Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

rfejgin Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

blisc Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

rfejgin Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rfejgin commented Dec 20, 2025 •

edited

Loading