Skip to content

Conversation

@nil0x9
Copy link
Contributor

@nil0x9 nil0x9 commented Jan 30, 2026

This PR involves several improvements for internal metrics:

  1. (critical) Fixes error incurs when turning on internal metrics monitor and chunk loss is enabled. This is bc this PR adapted chunk loss to use torch.autograd.grad instead of torch.func.grad_and_value, which requires loss calculation to be performed on require-grad tensors. The original dummy forward in internal metrics monitor would perform loss calculation in no_grad mode, causing a runtime error. This PR removes unnecessary loss_ctx to avoid this.
  2. Refactor: remove global vars from original implementation for that we found the recompile count is actually on par when these vars are moved to class attributes.
  3. Add UT of internal metrics to catch aforementioned issue goes unnoticed.

@nil0x9 nil0x9 force-pushed the linty/dev-rm-metric-globals branch from 0168ca8 to 5e3568b Compare January 30, 2026 18:07
@nil0x9 nil0x9 force-pushed the linty/dev-rm-metric-globals branch from 5e3568b to 9a168a6 Compare January 31, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant