Skip to content

canary-1b-v2 enter stale status in streaming mode #15231

@livefantasia

Description

@livefantasia

Describe the bug
Follow the instruction in https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/streaming_decoding/canary_chunked_and_streaming_decoding.html
to perform inference in streaming mode, there is no transcription generated after certain time (20~40 seconds later) for 4 mins audio file.
Script as below:
uv run --with "ml-dtypes==0.5.4" --with "onnx==1.20.0" --with "numpy<2.4" python examples/asr/asr_chunked_inference/aed/speech_to_text_aed_streaming_infer.py
pretrained_name="nvidia/canary-1b-v2"
audio_dir="inference_input"
output_filename="inference_output/submission.json"
batch_size=1
num_workers=0
debug_mode=True
chunk_secs=4.0
left_context_secs=10.0
right_context_secs=2
decoding.streaming_policy="alignatt"
decoding.alignatt_thr=8
decoding.exclude_sink_frames=8
decoding.xatt_scores_layer=-2
decoding.hallucinations_detector=True
calculate_latency=False
allow_mps=False
+prompt.pnc="yes"
+prompt.task="asr"
+prompt.source_lang="en"
+prompt.target_lang="en"

Steps/Code to reproduce bug

  1. Clone the Nemo Repo
  2. Setup venv with uv
  3. put a 4 min long audio in the input directory
  4. run the script for transcription

Expected behavior

A clear and concise description of what you expected to happen.

Environment details
Use the standard pyproject.toml in the Nemo Repo

Additional context
I tried both alignatt and waitk policy, all the same result. The transcription stops automatically after 20~30 seconds. My own tracing shows the alignatt policy threshold will never be met after certain time.
Inference log as blow

(TraeAI-5) ~/DevLibrary/NeMo [0] $ ./run_streaming_inference.sh
[NeMo W 2025-12-25 22:11:22 megatron_init:62] Megatron num_microbatches_calculator not found, using Apex version.
W1225 22:11:22.040000 81677 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
No exporters were provided. This means that no telemetry data will be collected.
[NeMo W 2025-12-25 22:11:22 nemo_logging:364] /Users/xxxx/DevLibrary/NeMo/.venv/lib/python3.11/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(

[NeMo I 2025-12-25 22:11:22 speech_to_text_aed_streaming_infer:168] Hydra config: model_path: null
pretrained_name: nvidia/canary-1b-v2
audio_dir: inference_input
dataset_manifest: null
output_filename: inference_output/submission.json
batch_size: 1
num_workers: 0
random_seed: null
chunk_secs: 4.0
left_context_secs: 10.0
right_context_secs: 2.0
cuda: null
allow_mps: false
compute_dtype: null
matmul_precision: high
audio_type: wav
sort_input_manifest: true
overwrite_transcripts: true
decoding:
streaming_policy: alignatt
alignatt_thr: 8.0
waitk_lagging: 2
exclude_sink_frames: 8
xatt_scores_layer: -2
max_tokens_per_alignatt_step: 30
max_generation_length: 512
use_avgpool_for_alignatt: false
hallucinations_detector: true
calculate_wer: true
calculate_bleu: false
calculate_latency: false
clean_groundtruth_text: false
ignore_capitalization: true
ignore_punctuation: true
langid: en
use_cer: false
presort_manifest: true
return_hypotheses: false
channel_selector: null
gt_text_attr_name: text
gt_lang_attr_name: source_lang
timestamps: false
prompt:
pnc: 'yes'
task: asr
source_lang: en
target_lang: en
debug_mode: true

[NeMo I 2025-12-25 22:11:22 speech_to_text_aed_streaming_infer:195] Inference will be done on device : cpu with compute_dtype: torch.float32
[NeMo I 2025-12-25 22:11:25 mixins:184] Tokenizer CanaryBPETokenizer initialized with 16384 tokens
[NeMo W 2025-12-25 22:11:25 modelPT:188] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
use_lhotse: true
skip_missing_manifest_entries: true
input_cfg: null
tarred_audio_filepaths: null
manifest_filepath: null
sample_rate: 16000
shuffle: true
num_workers: 4
pin_memory: true
prompt_format: canary2
max_duration: 40.0
min_duration: 0.01
text_field: answer
lang_field: target_lang
use_bucketing: true
max_tps: null
bucket_duration_bins: null
bucket_batch_size: null
num_buckets: null
bucket_buffer_size: 20000
shuffle_buffer_size: 10000

[NeMo W 2025-12-25 22:11:25 modelPT:195] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
use_lhotse: true
prompt_format: canary2
manifest_filepath: null
sample_rate: 16000
batch_size: 4
shuffle: true
max_duration: 40.0
min_duration: 0.1
num_workers: 2
pin_memory: true
text_field: answer
lang_field: target_lang

Error getting class at nemo.collections.asr.modules.transformer.get_nemo_transformer: Located non-class of type 'function' while loading 'nemo.collections.asr.modules.transformer.get_nemo_transformer'
[NeMo I 2025-12-25 22:11:31 mixins:184] Tokenizer SentencePieceTokenizer initialized with 16384 tokens
[NeMo W 2025-12-25 22:11:32 modelPT:188] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
use_lhotse: true
skip_missing_manifest_entries: true
input_cfg: null
tarred_audio_filepaths: null
manifest_filepath: null
sample_rate: 16000
shuffle: true
num_workers: 2
pin_memory: true
max_duration: 40.0
min_duration: 0.1
text_field: answer
batch_duration: null
max_tps: null
use_bucketing: true
bucket_duration_bins: null
bucket_batch_size: null
num_buckets: null
bucket_buffer_size: 20000
shuffle_buffer_size: 10000

[NeMo W 2025-12-25 22:11:32 modelPT:195] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
use_lhotse: true
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: false
max_duration: 40.0
min_duration: 0.1
num_workers: 2
pin_memory: true
text_field: answer

[NeMo I 2025-12-25 22:11:36 save_restore_connector:284] Model EncDecCTCModelBPE was successfully restored from /Users/xxxx/.cache/huggingface/hub/models--nvidia--canary-1b-v2/snapshots/87bc52657add533cd0156b3fc1aef027280754bf/canary-1b-v2.nemo.
[NeMo I 2025-12-25 22:11:38 save_restore_connector:284] Model EncDecMultiTaskModel was successfully restored from /Users/xxxx/.cache/huggingface/hub/models--nvidia--canary-1b-v2/snapshots/87bc52657add533cd0156b3fc1aef027280754bf/canary-1b-v2.nemo.
[NeMo I 2025-12-25 22:11:38 aed_multitask_models:292] Changed decoding strategy to
strategy: greedy
compute_hypothesis_token_set: false
preserve_alignments: null
confidence_cfg:
preserve_frame_confidence: false
preserve_token_confidence: false
preserve_word_confidence: false
exclude_blank: true
aggregation: min
tdt_include_duration: false
method_cfg:
name: entropy
entropy_type: tsallis
alpha: 0.33
entropy_norm: exp
temperature: DEPRECATED
compute_langs: false
greedy:
temperature: null
max_generation_delta: -1
preserve_alignments: false
preserve_token_confidence: false
confidence_method_cfg:
name: entropy
entropy_type: tsallis
alpha: 0.33
entropy_norm: exp
temperature: DEPRECATED
n_samples: 1
beam:
beam_size: 1
search_type: default
len_pen: 1.0
max_generation_delta: -1
return_best_hypothesis: true
preserve_alignments: false
ngram_lm_model: null
ngram_lm_alpha: 0.0
boosting_tree:
model_path: null
key_phrases_file: null
key_phrases_list: null
context_score: 1.0
depth_scaling: 1.0
unk_score: 0.0
final_eos_score: 1.0
score_per_phrase: 0.0
source_lang: en
use_triton: true
uniform_weights: false
use_bpe_dropout: false
num_of_transcriptions: 5
bpe_alpha: 0.3
boosting_tree_alpha: 0.0
temperature: 1.0

[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:275] Corrected contexts (sec): Left 10.00, Chunk 4.00, Right 2.00
[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:281] Corrected contexts (subsampled encoder frames): Left 125 - Chunk 50 - Right 25
[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:282] Corrected contexts (in audio samples): Left 160000 - Chunk 64000 - Right 32000
[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:284] Theoretical latency: 6.00 seconds
0%| | 0/1 [00:00<?, ?it/s][NeMo I 2025-12-25 22:11:40 speech_to_text_aed_streaming_infer:401] Processed chunk 1, current sample position: 96000/3849600
[NeMo I 2025-12-25 22:11:42 speech_to_text_aed_streaming_infer:401] Processed chunk 2, current sample position: 160000/3849600
[NeMo I 2025-12-25 22:11:44 speech_to_text_aed_streaming_infer:401] Processed chunk 3, current sample position: 224000/3849600
[NeMo I 2025-12-25 22:11:46 speech_to_text_aed_streaming_infer:401] Processed chunk 4, current sample position: 288000/3849600
[NeMo I 2025-12-25 22:11:48 speech_to_text_aed_streaming_infer:401] Processed chunk 5, current sample position: 352000/3849600
[NeMo I 2025-12-25 22:11:50 aed_batched_streaming:368] !!! hallucination 'a b a b a b' detected !!!
[NeMo I 2025-12-25 22:11:50 speech_to_text_aed_streaming_infer:401] Processed chunk 6, current sample position: 416000/3849600
[NeMo I 2025-12-25 22:11:52 speech_to_text_aed_streaming_infer:401] Processed chunk 7, current sample position: 480000/3849600
[NeMo I 2025-12-25 22:11:54 speech_to_text_aed_streaming_infer:401] Processed chunk 8, current sample position: 544000/3849600
[NeMo I 2025-12-25 22:11:55 speech_to_text_aed_streaming_infer:401] Processed chunk 9, current sample position: 608000/3849600
[NeMo I 2025-12-25 22:11:57 speech_to_text_aed_streaming_infer:401] Processed chunk 10, current sample position: 672000/3849600
[NeMo I 2025-12-25 22:11:59 speech_to_text_aed_streaming_infer:401] Processed chunk 11, current sample position: 736000/3849600
[NeMo I 2025-12-25 22:12:01 speech_to_text_aed_streaming_infer:401] Processed chunk 12, current sample position: 800000/3849600
[NeMo I 2025-12-25 22:12:02 speech_to_text_aed_streaming_infer:401] Processed chunk 13, current sample position: 864000/3849600
[NeMo I 2025-12-25 22:12:04 speech_to_text_aed_streaming_infer:401] Processed chunk 14, current sample position: 928000/3849600
[NeMo I 2025-12-25 22:12:06 speech_to_text_aed_streaming_infer:401] Processed chunk 15, current sample position: 992000/3849600
[NeMo I 2025-12-25 22:12:08 speech_to_text_aed_streaming_infer:401] Processed chunk 16, current sample position: 1056000/3849600
[NeMo I 2025-12-25 22:12:09 speech_to_text_aed_streaming_infer:401] Processed chunk 17, current sample position: 1120000/3849600
[NeMo I 2025-12-25 22:12:11 speech_to_text_aed_streaming_infer:401] Processed chunk 18, current sample position: 1184000/3849600
[NeMo I 2025-12-25 22:12:13 speech_to_text_aed_streaming_infer:401] Processed chunk 19, current sample position: 1248000/3849600
[NeMo I 2025-12-25 22:12:14 speech_to_text_aed_streaming_infer:401] Processed chunk 20, current sample position: 1312000/3849600
[NeMo I 2025-12-25 22:12:16 speech_to_text_aed_streaming_infer:401] Processed chunk 21, current sample position: 1376000/3849600
[NeMo I 2025-12-25 22:12:18 speech_to_text_aed_streaming_infer:401] Processed chunk 22, current sample position: 1440000/3849600
[NeMo I 2025-12-25 22:12:20 speech_to_text_aed_streaming_infer:401] Processed chunk 23, current sample position: 1504000/3849600
[NeMo I 2025-12-25 22:12:21 speech_to_text_aed_streaming_infer:401] Processed chunk 24, current sample position: 1568000/3849600
[NeMo I 2025-12-25 22:12:23 speech_to_text_aed_streaming_infer:401] Processed chunk 25, current sample position: 1632000/3849600
[NeMo I 2025-12-25 22:12:25 speech_to_text_aed_streaming_infer:401] Processed chunk 26, current sample position: 1696000/3849600
[NeMo I 2025-12-25 22:12:26 speech_to_text_aed_streaming_infer:401] Processed chunk 27, current sample position: 1760000/3849600
[NeMo I 2025-12-25 22:12:28 speech_to_text_aed_streaming_infer:401] Processed chunk 28, current sample position: 1824000/3849600
[NeMo I 2025-12-25 22:12:30 speech_to_text_aed_streaming_infer:401] Processed chunk 29, current sample position: 1888000/3849600
[NeMo I 2025-12-25 22:12:32 speech_to_text_aed_streaming_infer:401] Processed chunk 30, current sample position: 1952000/3849600
[NeMo I 2025-12-25 22:12:33 speech_to_text_aed_streaming_infer:401] Processed chunk 31, current sample position: 2016000/3849600
[NeMo I 2025-12-25 22:12:35 speech_to_text_aed_streaming_infer:401] Processed chunk 32, current sample position: 2080000/3849600
[NeMo I 2025-12-25 22:12:37 speech_to_text_aed_streaming_infer:401] Processed chunk 33, current sample position: 2144000/3849600
[NeMo I 2025-12-25 22:12:38 speech_to_text_aed_streaming_infer:401] Processed chunk 34, current sample position: 2208000/3849600
[NeMo I 2025-12-25 22:12:40 speech_to_text_aed_streaming_infer:401] Processed chunk 35, current sample position: 2272000/3849600
[NeMo I 2025-12-25 22:12:42 speech_to_text_aed_streaming_infer:401] Processed chunk 36, current sample position: 2336000/3849600
[NeMo I 2025-12-25 22:12:44 speech_to_text_aed_streaming_infer:401] Processed chunk 37, current sample position: 2400000/3849600
[NeMo I 2025-12-25 22:12:45 speech_to_text_aed_streaming_infer:401] Processed chunk 38, current sample position: 2464000/3849600
[NeMo I 2025-12-25 22:12:47 speech_to_text_aed_streaming_infer:401] Processed chunk 39, current sample position: 2528000/3849600
[NeMo I 2025-12-25 22:12:49 speech_to_text_aed_streaming_infer:401] Processed chunk 40, current sample position: 2592000/3849600
[NeMo I 2025-12-25 22:12:50 speech_to_text_aed_streaming_infer:401] Processed chunk 41, current sample position: 2656000/3849600
[NeMo I 2025-12-25 22:12:52 speech_to_text_aed_streaming_infer:401] Processed chunk 42, current sample position: 2720000/3849600
[NeMo I 2025-12-25 22:12:54 speech_to_text_aed_streaming_infer:401] Processed chunk 43, current sample position: 2784000/3849600
[NeMo I 2025-12-25 22:12:56 speech_to_text_aed_streaming_infer:401] Processed chunk 44, current sample position: 2848000/3849600
[NeMo I 2025-12-25 22:12:57 speech_to_text_aed_streaming_infer:401] Processed chunk 45, current sample position: 2912000/3849600
[NeMo I 2025-12-25 22:12:59 speech_to_text_aed_streaming_infer:401] Processed chunk 46, current sample position: 2976000/3849600
[NeMo I 2025-12-25 22:13:01 speech_to_text_aed_streaming_infer:401] Processed chunk 47, current sample position: 3040000/3849600
[NeMo I 2025-12-25 22:13:02 speech_to_text_aed_streaming_infer:401] Processed chunk 48, current sample position: 3104000/3849600
[NeMo I 2025-12-25 22:13:04 speech_to_text_aed_streaming_infer:401] Processed chunk 49, current sample position: 3168000/3849600
[NeMo I 2025-12-25 22:13:06 speech_to_text_aed_streaming_infer:401] Processed chunk 50, current sample position: 3232000/3849600
[NeMo I 2025-12-25 22:13:07 speech_to_text_aed_streaming_infer:401] Processed chunk 51, current sample position: 3296000/3849600
[NeMo I 2025-12-25 22:13:09 speech_to_text_aed_streaming_infer:401] Processed chunk 52, current sample position: 3360000/3849600
[NeMo I 2025-12-25 22:13:11 speech_to_text_aed_streaming_infer:401] Processed chunk 53, current sample position: 3424000/3849600
[NeMo I 2025-12-25 22:13:13 speech_to_text_aed_streaming_infer:401] Processed chunk 54, current sample position: 3488000/3849600
[NeMo I 2025-12-25 22:13:14 speech_to_text_aed_streaming_infer:401] Processed chunk 55, current sample position: 3552000/3849600
[NeMo I 2025-12-25 22:13:16 speech_to_text_aed_streaming_infer:401] Processed chunk 56, current sample position: 3616000/3849600
[NeMo I 2025-12-25 22:13:18 speech_to_text_aed_streaming_infer:401] Processed chunk 57, current sample position: 3680000/3849600
[NeMo I 2025-12-25 22:13:19 speech_to_text_aed_streaming_infer:401] Processed chunk 58, current sample position: 3744000/3849600
[NeMo I 2025-12-25 22:13:21 speech_to_text_aed_streaming_infer:401] Processed chunk 59, current sample position: 3808000/3849600
[NeMo I 2025-12-25 22:13:23 speech_to_text_aed_streaming_infer:401] Processed chunk 60, current sample position: 3849600/3849600
[NeMo I 2025-12-25 22:13:23 speech_to_text_aed_streaming_infer:409] i: Absolutely, so many emotions, the full range, excitements, a little bit of surprise, joy. People were totally overwhelmed, guys. I'm truly struggling to find the words to appropriately describe that moment. A, when the white smoke came out, and then B, B, B
100%|███████████████████████████████| 1/1 [01:44<00:00, 104.58s/it]
[NeMo I 2025-12-25 22:13:23 speech_to_text_aed_streaming_infer:427] Finished writing predictions to inference_output/submission.json!
[NeMo I 2025-12-25 22:13:23 eval_utils:190] ground-truth text attribute text is not present in manifest! Cannot calculate WER. Returning!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions