Hey team,
I was diving through the model layer logic in C:\Users\my pc\Desktop\torchtune\torchtune\modules\transformer.py and noticed that the self-attention block relies on sequential, unfused operators for the pre-attention normalization and linear operations.
At scale, this forces the PyTorch autograd engine to materialize intermediate activation tensors back to High Bandwidth Memory (HBM) purely to preserve states for the backward pass, creating an execution bottleneck on memory bandwidth.
I've been working on a hardware-accelerated optimization layer (renorm-native) that resolves this by using custom Triton kernel fusion. It keeps the tensor arrays local to the GPU's SRAM registers during the block pass, skipping HBM transit entirely. In baseline stress tests on A100/H100 nodes, we've seen attention-pass VRAM footprints drop by roughly 35% while yielding up to a 1.68x throughput increase depending on batch sizing.
I've already prototyped a clean drop-in wrapper over an official framework fork (github.com/Tobi-Adesoye/torchtune) to benchmark the performance. Would the maintainers be open to a lightweight PR introducing an optional fused Triton backend path here to help users running massive context windows/batch sizes avoid OOMs?
Happy to drop in a compiled binary for benchmarking if anyone wants to stress-test the throughput stability!
Hey team,
I was diving through the model layer logic in C:\Users\my pc\Desktop\torchtune\torchtune\modules\transformer.py and noticed that the self-attention block relies on sequential, unfused operators for the pre-attention normalization and linear operations.
At scale, this forces the PyTorch autograd engine to materialize intermediate activation tensors back to High Bandwidth Memory (HBM) purely to preserve states for the backward pass, creating an execution bottleneck on memory bandwidth.
I've been working on a hardware-accelerated optimization layer (renorm-native) that resolves this by using custom Triton kernel fusion. It keeps the tensor arrays local to the GPU's SRAM registers during the block pass, skipping HBM transit entirely. In baseline stress tests on A100/H100 nodes, we've seen attention-pass VRAM footprints drop by roughly 35% while yielding up to a 1.68x throughput increase depending on batch sizing.
I've already prototyped a clean drop-in wrapper over an official framework fork (github.com/Tobi-Adesoye/torchtune) to benchmark the performance. Would the maintainers be open to a lightweight PR introducing an optional fused Triton backend path here to help users running massive context windows/batch sizes avoid OOMs?
Happy to drop in a compiled binary for benchmarking if anyone wants to stress-test the throughput stability!