Skip to content

Conversation

@rfejgin
Copy link
Collaborator

@rfejgin rfejgin commented Dec 20, 2025

In some cases we need to run inference on manifests that do not include context audio and/or ground truth audio:

  • Text context manifests may not have a context audio included (and it wouldn't be very relevant anyway)
  • When generating arbitrary text, the ground truth audio cannot be assumed to be available.

Note that even without the context and ground truth there are useful metrics that can be calculated, like WER, UTMOS, and inference speed.

This PR adapts the inference scripts and the data loader used by them to allow the absence of context and/or ground truth audio. When these is missing, the the metrics that depend on them are set to 0.0.

Note: this PR replaces an earlier PR that was based off magpietts_2508. We are now merging directly into main instead.

New command line argument: --datasets <dataset1,dataset2,...> where
dataset1, dataset2, ... are the names datasets to process in the
datasets_json_path file.

If not specified, all datasets in the datasets_json_path will be processed.
If specified, only the datasets in the list will be processed.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
* Correctly handle comma-separated list of dataset names in the --datasets argument.
* Help text

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
In some cases we need to run inference on manifests that do not include context audio
and/or ground truth audio:

* Text context manifests may not have a context audio included (and it wouldn't be very
relevant anyway)
* When generating from arbitrary text the ground truth audio cannot be assumed to be
available. Note that even without the context and ground truth there are useful metrics
that can be calculated like WER, UTMOS, and inference speed.

This commit modified the inference and dependent scripts to allow the absence of context
and/or ground truth audio.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@github-actions github-actions bot added the TTS label Dec 20, 2025
@rfejgin rfejgin requested review from blisc and subhankar-ghosh and removed request for subhankar-ghosh December 20, 2025 01:31
rfejgin added a commit to rfejgin/NeMo that referenced this pull request Dec 22, 2025
The removed chagnes are part of a separate PR:
NVIDIA-NeMo#15213
rfejgin added a commit to rfejgin/NeMo that referenced this pull request Dec 22, 2025
The removed chagnes are part of a separate PR:
NVIDIA-NeMo#15213

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin force-pushed the magpietts_inference_without_ref_audio branch from bcc8ad4 to d1d73d9 Compare December 22, 2025 19:12
The removed changes are included in a separate PR:
NVIDIA-NeMo#15213

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin force-pushed the magpietts_inference_without_ref_audio branch from d1d73d9 to 7c4098b Compare December 22, 2025 19:35
@blisc blisc added the Run CICD label Dec 30, 2025
Comment on lines 767 to 784
# Assert no more than one of audio or audio_filepath in the batch
if 'audio' in batch_dict:
assert 'audio_filepath' not in batch_dict

# Assert only ONE of context_audio or context_audio_codes in the batch
# Assert no more than one of context_audio or context_audio_codes in the batch
if 'context_audio' in batch_dict:
assert 'context_audio_codes' not in batch_dict
if 'context_audio_codes' in batch_dict:
assert 'context_audio' not in batch_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments no longer match the logic, but the real question is why do we care that the dataset only returns one of audio vs audio_codes? Should we just remove these checks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that these checks were put there since if both the audio waveform and the audio codes are provided, it's ambiguous which of the two should be used by the model; we wouldn't one part of the code choosing to use the audio and another to use the codes. So it simplifies things to know that only (at most) one is present and any time.

About the comments not matching the logic: which part?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the comments not matching the logic: which part?

I think I was mistaken; the comments do match the logic.

Now that I read this again shouldn't # Assert no more than one of audio or audio_filepath in the batch be audio or audio_codes instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, good catch!

Fixed.

- default SSIMs to NaN
- Set ground truth audio transcript to None if it is missing and subsequently skip WER and CER calculations.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin rfejgin marked this pull request as draft January 6, 2026 21:45
@rfejgin rfejgin marked this pull request as ready for review January 6, 2026 22:08
@rfejgin rfejgin enabled auto-merge (squash) January 6, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants