Gate set_detect_anomaly behind STYLETTS2_DETECT_ANOMALY env var by shreyaskarnik · Pull Request #3 · semidark/StyleTTS2

shreyaskarnik · 2026-04-25T21:18:07Z

Summary

train_second.py:10 currently turns on torch.autograd.set_detect_anomaly(True) unconditionally at module import. This PR gates it behind a STYLETTS2_DETECT_ANOMALY=1 env var (default off; opt-in for debugging).

Why

Anomaly detection is a debugging feature, not a production setting. With it on:

Backward pass is 5–10× slower (every op gets traced).
More memory is held (causes OOMs that wouldn't otherwise happen at the same batch size).
On some single-GPU + bf16 setups the first backward pass deadlocks entirely (no error, just stuck). I hit this on a single-A100 SXM 80 GB Stage 2 run — torch 2.6.0 + cu124 + mixed_precision=bf16 + bs=8 + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. Removing the line let training proceed normally.

So the default cost is high (slower, less memory, occasional hang) for the benefit of a feature most users don't actually want during a normal training run.

Change

import os
...
if os.environ.get("STYLETTS2_DETECT_ANOMALY") == "1":
    torch.autograd.set_detect_anomaly(True)

To get the previous behaviour: STYLETTS2_DETECT_ANOMALY=1 accelerate launch ... train_second.py ....

Context

Hit this while training shreyaskarnik/bol-tts-marathi, a Marathi fine-tune of Kokoro-82M built on the semidark/kokoro-deutsch recipe (which uses this fork as a submodule). Worth flagging since anyone running Stage 2 single-GPU on bf16 will likely hit it.

Happy to amend if you'd prefer different env var naming or a different gating mechanism (e.g. CLI flag).

Test plan

Run Stage 2 with default env (no var set): set_detect_anomaly is not invoked → faster, no hang.
Run with STYLETTS2_DETECT_ANOMALY=1: set_detect_anomaly(True) is invoked → previous behaviour restored.
No other code paths are affected (single conditional at module import).

Currently train_second.py turns torch.autograd.set_detect_anomaly(True) on unconditionally at module import time. That's a debugging tool, not a production setting: * 5–10× slower backward (anomaly detection traces every op). * Materially more memory held (causes OOMs that wouldn't happen otherwise). * On single-A100 + bf16, can deadlock the first backward pass entirely (no error, just stuck) — this bit us on a Marathi fine-tune Stage 2 run until we manually patched the file. Make it opt-in via an env var. Default off; set STYLETTS2_DETECT_ANOMALY=1 when diagnosing a NaN/inf to get the same behaviour as before. No behaviour change for users who explicitly want anomaly detection. Diagnosing-by-default is the regression we're fixing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gate set_detect_anomaly behind STYLETTS2_DETECT_ANOMALY env var#3

Gate set_detect_anomaly behind STYLETTS2_DETECT_ANOMALY env var#3
shreyaskarnik wants to merge 1 commit intosemidark:mainfrom
shreyaskarnik:fix/anomaly-detection-env-gate

shreyaskarnik commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shreyaskarnik commented Apr 25, 2026

Summary

Why

Change

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant