-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[TTS] Allow inference without reference audio #15213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TTS] Allow inference without reference audio #15213
Conversation
New command line argument: --datasets <dataset1,dataset2,...> where dataset1, dataset2, ... are the names datasets to process in the datasets_json_path file. If not specified, all datasets in the datasets_json_path will be processed. If specified, only the datasets in the list will be processed. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
* Correctly handle comma-separated list of dataset names in the --datasets argument. * Help text Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
In some cases we need to run inference on manifests that do not include context audio and/or ground truth audio: * Text context manifests may not have a context audio included (and it wouldn't be very relevant anyway) * When generating from arbitrary text the ground truth audio cannot be assumed to be available. Note that even without the context and ground truth there are useful metrics that can be calculated like WER, UTMOS, and inference speed. This commit modified the inference and dependent scripts to allow the absence of context and/or ground truth audio. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
The removed chagnes are part of a separate PR: NVIDIA-NeMo#15213
The removed chagnes are part of a separate PR: NVIDIA-NeMo#15213 Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
bcc8ad4 to
d1d73d9
Compare
The removed changes are included in a separate PR: NVIDIA-NeMo#15213 Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
d1d73d9 to
7c4098b
Compare
| # Assert no more than one of audio or audio_filepath in the batch | ||
| if 'audio' in batch_dict: | ||
| assert 'audio_filepath' not in batch_dict | ||
|
|
||
| # Assert only ONE of context_audio or context_audio_codes in the batch | ||
| # Assert no more than one of context_audio or context_audio_codes in the batch | ||
| if 'context_audio' in batch_dict: | ||
| assert 'context_audio_codes' not in batch_dict | ||
| if 'context_audio_codes' in batch_dict: | ||
| assert 'context_audio' not in batch_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments no longer match the logic, but the real question is why do we care that the dataset only returns one of audio vs audio_codes? Should we just remove these checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that these checks were put there since if both the audio waveform and the audio codes are provided, it's ambiguous which of the two should be used by the model; we wouldn't one part of the code choosing to use the audio and another to use the codes. So it simplifies things to know that only (at most) one is present and any time.
About the comments not matching the logic: which part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the comments not matching the logic: which part?
I think I was mistaken; the comments do match the logic.
Now that I read this again shouldn't # Assert no more than one of audio or audio_filepath in the batch be audio or audio_codes instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, good catch!
Fixed.
nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py
Outdated
Show resolved
Hide resolved
nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py
Outdated
Show resolved
Hide resolved
- default SSIMs to NaN - Set ground truth audio transcript to None if it is missing and subsequently skip WER and CER calculations. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
In some cases we need to run inference on manifests that do not include context audio and/or ground truth audio:
Note that even without the context and ground truth there are useful metrics that can be calculated, like WER, UTMOS, and inference speed.
This PR adapts the inference scripts and the data loader used by them to allow the absence of context and/or ground truth audio. When these is missing, the the metrics that depend on them are set to
0.0.Note: this PR replaces an earlier PR that was based off
magpietts_2508. We are now merging directly intomaininstead.