Skip to content

tests: add quantized attention tests (SDPA + eager) with MHA fusion c…#4204

Draft
yizhuoz004 wants to merge 3 commits intopytorch:mainfrom
yizhuoz004:hlo-quant-attention-tests
Draft

tests: add quantized attention tests (SDPA + eager) with MHA fusion c…#4204
yizhuoz004 wants to merge 3 commits intopytorch:mainfrom
yizhuoz004:hlo-quant-attention-tests

Conversation

@yizhuoz004
Copy link
Copy Markdown
Contributor

Description

Adds test_quantized_attention.py covering FP8 and INT8 PTQ via modelopt for both SDPA-based (VanillaAttention, GQAAttention) and hand-rolled eager attention (EagerAttention, mirroring HF ViT) patterns.

Test coverage:
_QuantAttentionMixin (SDPA, IAttentionLayer path):
- test_static: fixed shapes, causal/non-causal, LLM-realistic configs
- test_dynamic_batch / test_dynamic_seq: dynamic dims incl. decode (seq=1)
- test_edge_cases: single head, non-pow2 head_dim, causal prefill
- test_gqa: GQA/MQA with separate Q and KV head counts
- test_mha_kernel_precision: @expectedfailure (#4167) — MHA inputs are Half not FP8/Int8; normalization_quantize_scale not set in torch-trt

_EagerAttentionMixin (hand-rolled matmul+softmax+matmul):
- test_static: ViT-realistic shapes
- test_dynamic_seq: dynamic seq covering ViT patch-count range
- test_mha_kernel_precision: @expectedfailure (#4200) — TRT fuses into _gemm_mha_v2 but selects a Half tactic; quantizer scales don't reach the fused kernel boundary

MHA fusion check (_assert_mha_fused) matches both _gemm_mha (prefill) and _gemv_mha (decode) kernel prefixes.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 22, 2026

Hi @yizhuoz004!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@github-actions github-actions Bot added the component: tests Issues re: Tests label Apr 22, 2026
@yizhuoz004 yizhuoz004 force-pushed the hlo-quant-attention-tests branch from 51b4f6e to 023def9 Compare April 23, 2026 00:37
@github-actions github-actions Bot added component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: converters Issues re: Specific op converters component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Apr 23, 2026
@yizhuoz004 yizhuoz004 force-pushed the hlo-quant-attention-tests branch 3 times, most recently from 809fc3a to 9d065d5 Compare April 23, 2026 04:08
@github-actions github-actions Bot added the component: lowering Issues re: The lowering / preprocessing passes label Apr 23, 2026
yizhuoz004 and others added 3 commits April 23, 2026 10:51
Adds tests/py/dynamo/hlo/test_quantized_attention.py covering FP8, INT8,
and NVFP4 bmm quantization via modelopt PTQ.

Test coverage:
- test_static: fixed shapes, causal/non-causal, LLM-realistic configs
  (Qwen2.5, Llama-3.2 style)
- test_dynamic_batch / test_dynamic_seq: dynamic dims including decode
  (seq=1) and prefill
- test_edge_cases: single head, non-pow2 head_dim, large batch+heads,
  causal prefill
- test_gqa: GQA/MQA with separate Q and KV head counts

Precision verification (_assert_quantized): locates the fused MHA kernel
(_gemm_mha_v2 / _gemv_mha_v1) in the serialized TRT engine via
IEngineInspector and asserts that the layers preceding it carry the
CastMulCast QDQ pattern, confirming inputs to the kernel are quantized
rather than full-precision.  Falls back to a broader CastMulCast search
for GQA/MQA configs where TRT decomposes attention into separate matmuls.
@yizhuoz004 yizhuoz004 force-pushed the hlo-quant-attention-tests branch from 9d065d5 to 323519e Compare April 23, 2026 18:28
@github-actions github-actions Bot added the component: build system Issues re: Build system label Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: api [Python] Issues re: Python API component: build system Issues re: Build system component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: lowering Issues re: The lowering / preprocessing passes component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants