[InternalMetrics][Fix] fix dummy_fwd && refactor code && add CI #1468

nil0x9 · 2026-01-30T13:33:24Z

This PR involves several improvements for internal metrics:

(critical) Fixes error incurs when turning on internal metrics monitor and chunk loss is enabled. This is bc this PR adapted chunk loss to use torch.autograd.grad instead of torch.func.grad_and_value, which requires loss calculation to be performed on require-grad tensors. The original dummy forward in internal metrics monitor would perform loss calculation in no_grad mode, causing a runtime error. This PR removes unnecessary loss_ctx to avoid this.
Refactor: remove global vars from original implementation for that we found the recompile count is actually on par when these vars are moved to class attributes.
Add UT of internal metrics to catch aforementioned issue goes unnoticed.

…or dummy fwd

nil0x9 added 4 commits January 22, 2026 16:28

[Enhance] internal_metrics: move attn metrics to class attrs

74f6916

[Refactor] refactor internal metrics codes

041102c

[Enhance] internal_metrics: rm loss_ctx and disable intra-layer bsz f…

5cfcf71

…or dummy fwd

[Refactor] rm engine from InternalMetricsRecorder __init__

37c128f

nil0x9 force-pushed the linty/dev-rm-metric-globals branch from 0168ca8 to 5e3568b Compare January 30, 2026 18:07

[CI] Add internal_metrics UT

9a168a6

nil0x9 force-pushed the linty/dev-rm-metric-globals branch from 5e3568b to 9a168a6 Compare January 31, 2026 06:31

Provide feedback