Skip to content

[ck_tile][fmha] Gate unpadded qr_async_trload fwd instances on seqlen alignment#3747

Open
goldcoderZ wants to merge 2 commits into
ROCm:developfrom
goldcoderZ:fmha-trload-seqlen-guard
Open

[ck_tile][fmha] Gate unpadded qr_async_trload fwd instances on seqlen alignment#3747
goldcoderZ wants to merge 2 commits into
ROCm:developfrom
goldcoderZ:fmha-trload-seqlen-guard

Conversation

@goldcoderZ

Copy link
Copy Markdown
Contributor

Problem

On gfx950 the ck_tile FMHA forward dispatcher can select an unpadded qr_async_trload (transpose-load) instance for tile-unaligned attention shapes: the two unpadded trload variants (spad=f, skpad=f) had scheck/skcheck returning "true" unconditionally, so the no-pad transpose-load kernel had no alignment guard. It then issues unpredicated full-tile loads/stores of round_up(seqlen, tile) rows, over-reading Q/K/V and over-writing O/LSE past the buffer end -> "Memory access fault by GPU node-N ... Reason: Unknown" on page-terminal allocations. The regular qr_async path is already safe (always spad=t, gates skpad on seqlen_k); only qr_async_trload lacked the guard.

Fix

Gate the unpadded qr_async_trload / qr_async_trload_v3 instances on seqlen alignment, mirroring qr_async, so they're only selected for aligned shapes; tile-unaligned shapes fall through to the padded qr_async instance.

Tradeoff

Tile-unaligned shapes use the padded qr_async kernel instead of the transpose-load fast path. Affects only unaligned shapes; aligned shapes unchanged.

Validation (gfx950 / MI350X)

Regenerated dispatcher diff touches only the unpadded qr_async_trload branches (each gains seqlen_q % bm0 == 0 + seqlen_k % bn0 == 0). An unaligned shape (bf16, b=5 h=2 sq=4 sk=656 d=128, no bias) selects the padded qr_async kernel; output matches the reference attention to max_abs_diff ~1e-3 (bf16).

@goldcoderZ

Copy link
Copy Markdown
Contributor Author

cc @poyenc @qianfengz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant