Skip to content

Matmul: Switch to trailing batch dims and allow mat-vec, vec-mat#132

Merged
maleadt merged 3 commits intoJuliaGPU:mainfrom
AntonOresten:matmul-shapes
Mar 25, 2026
Merged

Matmul: Switch to trailing batch dims and allow mat-vec, vec-mat#132
maleadt merged 3 commits intoJuliaGPU:mainfrom
AntonOresten:matmul-shapes

Conversation

@AntonOresten
Copy link
Contributor

Closes #115

When operands have different numbers of batch dimensions (e.g.
(M, K, 4) * (K, N, 2, 4)), _matmul pads the shorter batch tuple
with ones to align them before computing the output shape and
creating the zero accumulator. _muladd does the same padding to
reshape operands before broadcasting.

These two functions disagreed on *where* to pad: _matmul inserted
leading ones ((1, 4) for a 1-batch operand against a 2-batch one)
while _muladd appended trailing ones ((4, 1)). This meant the acc
shape from _matmul wouldn't match what _muladd expected, causing
a reshape element-count mismatch at the Tile IR level.

Fix _matmul to use trailing ones, consistent with _muladd.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hope the additional permutes here are free?

@maleadt maleadt merged commit caf0027 into JuliaGPU:main Mar 25, 2026
9 checks passed
@AntonOresten
Copy link
Contributor Author

AntonOresten commented Mar 25, 2026

Hopefully🤞
A month or two ago I defined a batched mul helper for tiles, explicitly writing out the broadcasting and contractions with the old ct.reduce_sum, and surprisingly 8x8 performed way better than 4x4, I assume because the compiler noticed it could start using tensor cores. Hard to imagine it not being robust to some extra permutations. We're already permuting in e.g. reshape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Matmul broadcasting

2 participants