canary-1b-v2 enter stale status in streaming mode

**Describe the bug**
Follow the instruction in [https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/streaming_decoding/canary_chunked_and_streaming_decoding.html](url)
to perform inference in streaming mode, there is no transcription generated after certain time (20~40 seconds later) for 4 mins audio file.
Script as below:
uv run --with "ml-dtypes==0.5.4" --with "onnx==1.20.0" --with "numpy<2.4" python examples/asr/asr_chunked_inference/aed/speech_to_text_aed_streaming_infer.py \
    pretrained_name="nvidia/canary-1b-v2" \
    audio_dir="inference_input" \
    output_filename="inference_output/submission.json" \
    batch_size=1 \
    num_workers=0 \
    debug_mode=True \
    chunk_secs=4.0 \
    left_context_secs=10.0 \
    right_context_secs=2 \
    decoding.streaming_policy="alignatt" \
    decoding.alignatt_thr=8 \
    decoding.exclude_sink_frames=8 \
    decoding.xatt_scores_layer=-2 \
    decoding.hallucinations_detector=True \
    calculate_latency=False \
    allow_mps=False \
    +prompt.pnc="yes" \
    +prompt.task="asr" \
    +prompt.source_lang="en" \
    +prompt.target_lang="en"

**Steps/Code to reproduce bug**

1) Clone the Nemo Repo
2) Setup venv with uv
3) put a 4 min long audio in the input directory
4) run the script for transcription

**Expected behavior**

A clear and concise description of what you expected to happen.

 - Environment location: Local Machine, MacOS 15.5 M4 Pro
 - Method of NeMo install:  git clone https://github.com/NVIDIA-NeMo/NeMo.git 

**Environment details**
Use the standard pyproject.toml in the Nemo Repo

**Additional context**
I tried both alignatt and waitk policy, all the same result. The transcription stops automatically after 20~30 seconds. My own tracing shows the alignatt policy threshold will never be met after certain time.
Inference log as blow
--------------
(TraeAI-5) ~/DevLibrary/NeMo [0] $ ./run_streaming_inference.sh
[NeMo W 2025-12-25 22:11:22 megatron_init:62] Megatron num_microbatches_calculator not found, using Apex version.
W1225 22:11:22.040000 81677 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
No exporters were provided. This means that no telemetry data will be collected.
[NeMo W 2025-12-25 22:11:22 nemo_logging:364] /Users/xxxx/DevLibrary/NeMo/.venv/lib/python3.11/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
[NeMo I 2025-12-25 22:11:22 speech_to_text_aed_streaming_infer:168] Hydra config: model_path: null
    pretrained_name: nvidia/canary-1b-v2
    audio_dir: inference_input
    dataset_manifest: null
    output_filename: inference_output/submission.json
    batch_size: 1
    num_workers: 0
    random_seed: null
    chunk_secs: 4.0
    left_context_secs: 10.0
    right_context_secs: 2.0
    cuda: null
    allow_mps: false
    compute_dtype: null
    matmul_precision: high
    audio_type: wav
    sort_input_manifest: true
    overwrite_transcripts: true
    decoding:
      streaming_policy: alignatt
      alignatt_thr: 8.0
      waitk_lagging: 2
      exclude_sink_frames: 8
      xatt_scores_layer: -2
      max_tokens_per_alignatt_step: 30
      max_generation_length: 512
      use_avgpool_for_alignatt: false
      hallucinations_detector: true
    calculate_wer: true
    calculate_bleu: false
    calculate_latency: false
    clean_groundtruth_text: false
    ignore_capitalization: true
    ignore_punctuation: true
    langid: en
    use_cer: false
    presort_manifest: true
    return_hypotheses: false
    channel_selector: null
    gt_text_attr_name: text
    gt_lang_attr_name: source_lang
    timestamps: false
    prompt:
      pnc: 'yes'
      task: asr
      source_lang: en
      target_lang: en
    debug_mode: true
    
[NeMo I 2025-12-25 22:11:22 speech_to_text_aed_streaming_infer:195] Inference will be done on device : cpu with compute_dtype: torch.float32
[NeMo I 2025-12-25 22:11:25 mixins:184] Tokenizer CanaryBPETokenizer initialized with 16384 tokens
[NeMo W 2025-12-25 22:11:25 modelPT:188] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    use_lhotse: true
    skip_missing_manifest_entries: true
    input_cfg: null
    tarred_audio_filepaths: null
    manifest_filepath: null
    sample_rate: 16000
    shuffle: true
    num_workers: 4
    pin_memory: true
    prompt_format: canary2
    max_duration: 40.0
    min_duration: 0.01
    text_field: answer
    lang_field: target_lang
    use_bucketing: true
    max_tps: null
    bucket_duration_bins: null
    bucket_batch_size: null
    num_buckets: null
    bucket_buffer_size: 20000
    shuffle_buffer_size: 10000
    
[NeMo W 2025-12-25 22:11:25 modelPT:195] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    use_lhotse: true
    prompt_format: canary2
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 4
    shuffle: true
    max_duration: 40.0
    min_duration: 0.1
    num_workers: 2
    pin_memory: true
    text_field: answer
    lang_field: target_lang
    
Error getting class at nemo.collections.asr.modules.transformer.get_nemo_transformer: Located non-class of type 'function' while loading 'nemo.collections.asr.modules.transformer.get_nemo_transformer'
[NeMo I 2025-12-25 22:11:31 mixins:184] Tokenizer SentencePieceTokenizer initialized with 16384 tokens
[NeMo W 2025-12-25 22:11:32 modelPT:188] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    use_lhotse: true
    skip_missing_manifest_entries: true
    input_cfg: null
    tarred_audio_filepaths: null
    manifest_filepath: null
    sample_rate: 16000
    shuffle: true
    num_workers: 2
    pin_memory: true
    max_duration: 40.0
    min_duration: 0.1
    text_field: answer
    batch_duration: null
    max_tps: null
    use_bucketing: true
    bucket_duration_bins: null
    bucket_batch_size: null
    num_buckets: null
    bucket_buffer_size: 20000
    shuffle_buffer_size: 10000
    
[NeMo W 2025-12-25 22:11:32 modelPT:195] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    use_lhotse: true
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 16
    shuffle: false
    max_duration: 40.0
    min_duration: 0.1
    num_workers: 2
    pin_memory: true
    text_field: answer
    
[NeMo I 2025-12-25 22:11:36 save_restore_connector:284] Model EncDecCTCModelBPE was successfully restored from /Users/xxxx/.cache/huggingface/hub/models--nvidia--canary-1b-v2/snapshots/87bc52657add533cd0156b3fc1aef027280754bf/canary-1b-v2.nemo.
[NeMo I 2025-12-25 22:11:38 save_restore_connector:284] Model EncDecMultiTaskModel was successfully restored from /Users/xxxx/.cache/huggingface/hub/models--nvidia--canary-1b-v2/snapshots/87bc52657add533cd0156b3fc1aef027280754bf/canary-1b-v2.nemo.
[NeMo I 2025-12-25 22:11:38 aed_multitask_models:292] Changed decoding strategy to 
    strategy: greedy
    compute_hypothesis_token_set: false
    preserve_alignments: null
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      tdt_include_duration: false
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    compute_langs: false
    greedy:
      temperature: null
      max_generation_delta: -1
      preserve_alignments: false
      preserve_token_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
      n_samples: 1
    beam:
      beam_size: 1
      search_type: default
      len_pen: 1.0
      max_generation_delta: -1
      return_best_hypothesis: true
      preserve_alignments: false
      ngram_lm_model: null
      ngram_lm_alpha: 0.0
      boosting_tree:
        model_path: null
        key_phrases_file: null
        key_phrases_list: null
        context_score: 1.0
        depth_scaling: 1.0
        unk_score: 0.0
        final_eos_score: 1.0
        score_per_phrase: 0.0
        source_lang: en
        use_triton: true
        uniform_weights: false
        use_bpe_dropout: false
        num_of_transcriptions: 5
        bpe_alpha: 0.3
      boosting_tree_alpha: 0.0
    temperature: 1.0
    
[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:275] Corrected contexts (sec): Left 10.00, Chunk 4.00, Right 2.00
[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:281] Corrected contexts (subsampled encoder frames): Left 125 - Chunk 50 - Right 25
[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:282] Corrected contexts (in audio samples): Left 160000 - Chunk 64000 - Right 32000
[NeMo I 2025-12-25 22:11:38 speech_to_text_aed_streaming_infer:284] Theoretical latency: 6.00 seconds
  0%|                                                                                                                      | 0/1 [00:00<?, ?it/s][NeMo I 2025-12-25 22:11:40 speech_to_text_aed_streaming_infer:401] Processed chunk 1, current sample position: 96000/3849600
[NeMo I 2025-12-25 22:11:42 speech_to_text_aed_streaming_infer:401] Processed chunk 2, current sample position: 160000/3849600
[NeMo I 2025-12-25 22:11:44 speech_to_text_aed_streaming_infer:401] Processed chunk 3, current sample position: 224000/3849600
[NeMo I 2025-12-25 22:11:46 speech_to_text_aed_streaming_infer:401] Processed chunk 4, current sample position: 288000/3849600
[NeMo I 2025-12-25 22:11:48 speech_to_text_aed_streaming_infer:401] Processed chunk 5, current sample position: 352000/3849600
[NeMo I 2025-12-25 22:11:50 aed_batched_streaming:368] !!! hallucination 'a b a b a b' detected !!!
[NeMo I 2025-12-25 22:11:50 speech_to_text_aed_streaming_infer:401] Processed chunk 6, current sample position: 416000/3849600
[NeMo I 2025-12-25 22:11:52 speech_to_text_aed_streaming_infer:401] Processed chunk 7, current sample position: 480000/3849600
[NeMo I 2025-12-25 22:11:54 speech_to_text_aed_streaming_infer:401] Processed chunk 8, current sample position: 544000/3849600
[NeMo I 2025-12-25 22:11:55 speech_to_text_aed_streaming_infer:401] Processed chunk 9, current sample position: 608000/3849600
[NeMo I 2025-12-25 22:11:57 speech_to_text_aed_streaming_infer:401] Processed chunk 10, current sample position: 672000/3849600
[NeMo I 2025-12-25 22:11:59 speech_to_text_aed_streaming_infer:401] Processed chunk 11, current sample position: 736000/3849600
[NeMo I 2025-12-25 22:12:01 speech_to_text_aed_streaming_infer:401] Processed chunk 12, current sample position: 800000/3849600
[NeMo I 2025-12-25 22:12:02 speech_to_text_aed_streaming_infer:401] Processed chunk 13, current sample position: 864000/3849600
[NeMo I 2025-12-25 22:12:04 speech_to_text_aed_streaming_infer:401] Processed chunk 14, current sample position: 928000/3849600
[NeMo I 2025-12-25 22:12:06 speech_to_text_aed_streaming_infer:401] Processed chunk 15, current sample position: 992000/3849600
[NeMo I 2025-12-25 22:12:08 speech_to_text_aed_streaming_infer:401] Processed chunk 16, current sample position: 1056000/3849600
[NeMo I 2025-12-25 22:12:09 speech_to_text_aed_streaming_infer:401] Processed chunk 17, current sample position: 1120000/3849600
[NeMo I 2025-12-25 22:12:11 speech_to_text_aed_streaming_infer:401] Processed chunk 18, current sample position: 1184000/3849600
[NeMo I 2025-12-25 22:12:13 speech_to_text_aed_streaming_infer:401] Processed chunk 19, current sample position: 1248000/3849600
[NeMo I 2025-12-25 22:12:14 speech_to_text_aed_streaming_infer:401] Processed chunk 20, current sample position: 1312000/3849600
[NeMo I 2025-12-25 22:12:16 speech_to_text_aed_streaming_infer:401] Processed chunk 21, current sample position: 1376000/3849600
[NeMo I 2025-12-25 22:12:18 speech_to_text_aed_streaming_infer:401] Processed chunk 22, current sample position: 1440000/3849600
[NeMo I 2025-12-25 22:12:20 speech_to_text_aed_streaming_infer:401] Processed chunk 23, current sample position: 1504000/3849600
[NeMo I 2025-12-25 22:12:21 speech_to_text_aed_streaming_infer:401] Processed chunk 24, current sample position: 1568000/3849600
[NeMo I 2025-12-25 22:12:23 speech_to_text_aed_streaming_infer:401] Processed chunk 25, current sample position: 1632000/3849600
[NeMo I 2025-12-25 22:12:25 speech_to_text_aed_streaming_infer:401] Processed chunk 26, current sample position: 1696000/3849600
[NeMo I 2025-12-25 22:12:26 speech_to_text_aed_streaming_infer:401] Processed chunk 27, current sample position: 1760000/3849600
[NeMo I 2025-12-25 22:12:28 speech_to_text_aed_streaming_infer:401] Processed chunk 28, current sample position: 1824000/3849600
[NeMo I 2025-12-25 22:12:30 speech_to_text_aed_streaming_infer:401] Processed chunk 29, current sample position: 1888000/3849600
[NeMo I 2025-12-25 22:12:32 speech_to_text_aed_streaming_infer:401] Processed chunk 30, current sample position: 1952000/3849600
[NeMo I 2025-12-25 22:12:33 speech_to_text_aed_streaming_infer:401] Processed chunk 31, current sample position: 2016000/3849600
[NeMo I 2025-12-25 22:12:35 speech_to_text_aed_streaming_infer:401] Processed chunk 32, current sample position: 2080000/3849600
[NeMo I 2025-12-25 22:12:37 speech_to_text_aed_streaming_infer:401] Processed chunk 33, current sample position: 2144000/3849600
[NeMo I 2025-12-25 22:12:38 speech_to_text_aed_streaming_infer:401] Processed chunk 34, current sample position: 2208000/3849600
[NeMo I 2025-12-25 22:12:40 speech_to_text_aed_streaming_infer:401] Processed chunk 35, current sample position: 2272000/3849600
[NeMo I 2025-12-25 22:12:42 speech_to_text_aed_streaming_infer:401] Processed chunk 36, current sample position: 2336000/3849600
[NeMo I 2025-12-25 22:12:44 speech_to_text_aed_streaming_infer:401] Processed chunk 37, current sample position: 2400000/3849600
[NeMo I 2025-12-25 22:12:45 speech_to_text_aed_streaming_infer:401] Processed chunk 38, current sample position: 2464000/3849600
[NeMo I 2025-12-25 22:12:47 speech_to_text_aed_streaming_infer:401] Processed chunk 39, current sample position: 2528000/3849600
[NeMo I 2025-12-25 22:12:49 speech_to_text_aed_streaming_infer:401] Processed chunk 40, current sample position: 2592000/3849600
[NeMo I 2025-12-25 22:12:50 speech_to_text_aed_streaming_infer:401] Processed chunk 41, current sample position: 2656000/3849600
[NeMo I 2025-12-25 22:12:52 speech_to_text_aed_streaming_infer:401] Processed chunk 42, current sample position: 2720000/3849600
[NeMo I 2025-12-25 22:12:54 speech_to_text_aed_streaming_infer:401] Processed chunk 43, current sample position: 2784000/3849600
[NeMo I 2025-12-25 22:12:56 speech_to_text_aed_streaming_infer:401] Processed chunk 44, current sample position: 2848000/3849600
[NeMo I 2025-12-25 22:12:57 speech_to_text_aed_streaming_infer:401] Processed chunk 45, current sample position: 2912000/3849600
[NeMo I 2025-12-25 22:12:59 speech_to_text_aed_streaming_infer:401] Processed chunk 46, current sample position: 2976000/3849600
[NeMo I 2025-12-25 22:13:01 speech_to_text_aed_streaming_infer:401] Processed chunk 47, current sample position: 3040000/3849600
[NeMo I 2025-12-25 22:13:02 speech_to_text_aed_streaming_infer:401] Processed chunk 48, current sample position: 3104000/3849600
[NeMo I 2025-12-25 22:13:04 speech_to_text_aed_streaming_infer:401] Processed chunk 49, current sample position: 3168000/3849600
[NeMo I 2025-12-25 22:13:06 speech_to_text_aed_streaming_infer:401] Processed chunk 50, current sample position: 3232000/3849600
[NeMo I 2025-12-25 22:13:07 speech_to_text_aed_streaming_infer:401] Processed chunk 51, current sample position: 3296000/3849600
[NeMo I 2025-12-25 22:13:09 speech_to_text_aed_streaming_infer:401] Processed chunk 52, current sample position: 3360000/3849600
[NeMo I 2025-12-25 22:13:11 speech_to_text_aed_streaming_infer:401] Processed chunk 53, current sample position: 3424000/3849600
[NeMo I 2025-12-25 22:13:13 speech_to_text_aed_streaming_infer:401] Processed chunk 54, current sample position: 3488000/3849600
[NeMo I 2025-12-25 22:13:14 speech_to_text_aed_streaming_infer:401] Processed chunk 55, current sample position: 3552000/3849600
[NeMo I 2025-12-25 22:13:16 speech_to_text_aed_streaming_infer:401] Processed chunk 56, current sample position: 3616000/3849600
[NeMo I 2025-12-25 22:13:18 speech_to_text_aed_streaming_infer:401] Processed chunk 57, current sample position: 3680000/3849600
[NeMo I 2025-12-25 22:13:19 speech_to_text_aed_streaming_infer:401] Processed chunk 58, current sample position: 3744000/3849600
[NeMo I 2025-12-25 22:13:21 speech_to_text_aed_streaming_infer:401] Processed chunk 59, current sample position: 3808000/3849600
[NeMo I 2025-12-25 22:13:23 speech_to_text_aed_streaming_infer:401] Processed chunk 60, current sample position: 3849600/3849600
[NeMo I 2025-12-25 22:13:23 speech_to_text_aed_streaming_infer:409] i: Absolutely, so many emotions, the full range, excitements, a little bit of surprise, joy. People were totally overwhelmed, guys. I'm truly struggling to find the words to appropriately describe that moment. A, when the white smoke came out, and then B, B, B
100%|███████████████████████████████| 1/1 [01:44<00:00, 104.58s/it]
[NeMo I 2025-12-25 22:13:23 speech_to_text_aed_streaming_infer:427] Finished writing predictions to inference_output/submission.json!
[NeMo I 2025-12-25 22:13:23 eval_utils:190] ground-truth text attribute text is not present in manifest! Cannot calculate WER. Returning!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

canary-1b-v2 enter stale status in streaming mode #15231

Additional context
I tried both alignatt and waitk policy, all the same result. The transcription stops automatically after 20~30 seconds. My own tracing shows the alignatt policy threshold will never be met after certain time.
Inference log as blow

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

canary-1b-v2 enter stale status in streaming mode #15231

Description

Additional context I tried both alignatt and waitk policy, all the same result. The transcription stops automatically after 20~30 seconds. My own tracing shows the alignatt policy threshold will never be met after certain time. Inference log as blow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Additional context
I tried both alignatt and waitk policy, all the same result. The transcription stops automatically after 20~30 seconds. My own tracing shows the alignatt policy threshold will never be met after certain time.
Inference log as blow