Skip to content

Added planar types to speed up complex half precision GEMMs#1142

Open
cliffburdick wants to merge 6 commits intomainfrom
planar_tensor
Open

Added planar types to speed up complex half precision GEMMs#1142
cliffburdick wants to merge 6 commits intomainfrom
planar_tensor

Conversation

@cliffburdick
Copy link
Copy Markdown
Collaborator

No description provided.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This PR introduces matxFp16ComplexPlanar and matxBf16ComplexPlanar tag types that allow callers to express at the type level that a complex tensor uses split real/imaginary planes ([real₀…real_{n-1}][imag₀…imag_{n-1}]) rather than interleaved storage. The primary motivation is avoiding the per-call interleaved→planar→interleaved conversion overhead in complex-half GEMM paths: when the user already owns a planar-layout buffer, the conversion steps are simply skipped.

Key changes:

  • New planar types (half_complex.h, type_utils_both.h, type_utils.h): thin tag structs inheriting from the interleaved complex types; propagated through all existing type-trait machinery.
  • tensor_impl.h: PlanarComplexProxy return type from mutable operator() for planar tensors, with LoadPlanarComplex/StorePlanarComplex helpers using TotalSize() as the plane offset (safe because contiguity is enforced at construction time).
  • tensor.h: ValidatePlanarLayoutOnCreate_ asserts unit innermost stride and contiguity for every planar tensor at construction/reset, closing the non-contiguous view loophole flagged in the prior review round.
  • set.h: scalar EPT is gated on is_planar_complex_v<T::value_type>, preserving vectorized EPT for all non-planar SetOp assignments.
  • matmul_cuda.h: each of A, B, C is individually tested for planarity; already-planar inputs skip the interleaved→planar conversion step, and ldc is forced to c.Size(RANK-1) for all complex-half C outputs.
  • sparse2dense_cusparse.h / matmul_cusparse.h: hard stride assertions replaced with contiguous-temporary fallback paths.
  • fft_fftw.h: num_threads added to the FFTW plan cache key, fixing a latent cache-collision bug.

Confidence Score: 4/5

Core planar GEMM and tensor machinery is correct; two P1 issues in the cuSPARSE/sparse-matmul fallback paths need review before merge.

The three previously-flagged critical issues (SetOp EPT regression, TotalSize non-contiguous offset, c_adj ldc mismatch) are all resolved. Two new P1 findings remain in the cuSPARSE fallbacks: the isSameView write-back guards are dead code that obscure intent. ReshapeOp unconditional scalar EPT is a P2 performance concern. All other findings are style/cleanup.

include/matx/transforms/convert/sparse2dense_cusparse.h and include/matx/transforms/matmul/matmul_cusparse.h (dead write-back guards); include/matx/operators/reshape.h (unconditional scalar EPT).

Important Files Changed

Filename Overview
include/matx/core/half_complex.h Adds matxFp16ComplexPlanar and matxBf16ComplexPlanar tag types that inherit from the interleaved counterparts; clean, minimal change.
include/matx/core/type_utils_both.h Adds planar types to all relevant type-trait concepts/variables and introduces is_planar_complex_v.
include/matx/core/tensor_impl.h Adds PlanarComplexProxy for non-addressable planar memory, LoadPlanarComplex/StorePlanarComplex helpers, and routes operator() through them; proxy lifetime and EPT forcing look correct but proxy return-type changes are subtle.
include/matx/core/tensor.h Adds ValidatePlanarLayoutOnCreate_ to all constructors and Reset overloads to enforce contiguous unit-stride constraint for planar types at construction time.
include/matx/operators/set.h Gates scalar EPT on planar-complex output type only, preserving vectorized EPT for non-planar SetOp; addresses the previous regression comment.
include/matx/transforms/matmul/matmul_cuda.h Skips interleaved-to-planar conversion for already-planar inputs/outputs; updates ldc to c.Size(RANK-1) for complex-half; addresses previous c_adj pointer mismatch comment.
include/matx/operators/reshape.h Forces scalar EPT for all ReshapeOp instances unconditionally (not scoped to planar types); also adds initializer-list constraint to prevent ambiguous overload resolution.
include/matx/transforms/fft/fft_fftw.h Adds num_threads to FFTW params struct, hash, and equality check — fixes a latent cache key collision when the same FFT shape is called with different thread counts.
include/matx/transforms/convert/sparse2dense_cusparse.h Replaces hard MATX_ASSERT with a contiguous-temporary fallback; the write-back guard !o.isSameView(O) is dead code in practice.
include/matx/transforms/matmul/matmul_cusparse.h Similarly replaces a hard stride assertion with a contiguous-temporary fallback; mirrors the sparse2dense approach with the same dead write-back guard concern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["User calls matmul(a, b) → c\nwhere types are complex half"] --> B{is_complex_half_v?}
    B -- No --> Z["Normal GEMM path"]
    B -- Yes --> C{a_is_planar?}
    C -- No --> D["Alloc a_hp\nplanar(a) → a_planar\na_adj.Reset(a_planar.Data())"]
    C -- Yes --> E["a_adj unchanged\n(already planar layout)"]
    D --> F{b_is_planar?}
    E --> F
    F -- No --> G["Alloc b_hp\nplanar(b) → b_planar\nb_adj.Reset(b_planar.Data())"]
    F -- Yes --> H["b_adj unchanged"]
    G --> I{c_is_planar?}
    H --> I
    I -- No --> J["Alloc c_hp\nc_adj.Reset(c_planar.Data())"]
    I -- Yes --> K["c_adj.Reset(c.Data())\n(no-op, already correct)"]
    J --> L["cuBLASLt / cuBLAS GEMM\nusing a_adj, b_adj, c_adj\nparams.ldc = c.Size(RANK-1)"]
    K --> L
    L --> M{c_is_planar?}
    M -- No --> N["interleaved(c_planar) → c\n(convert back to user buffer)"]
    M -- Yes --> O["c already holds planar result\nno conversion needed"]
Loading

Reviews (5): Last reviewed commit: "Fix failing sparse and reshape unit test..." | Re-trigger Greptile

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

1 similar comment
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant