Skip to content

Allow O2K_CONST_VEC to handle some TYP_SIMD32/64#127390

Closed
tannergooding wants to merge 3 commits intodotnet:mainfrom
tannergooding:vec-aprop-const2
Closed

Allow O2K_CONST_VEC to handle some TYP_SIMD32/64#127390
tannergooding wants to merge 3 commits intodotnet:mainfrom
tannergooding:vec-aprop-const2

Conversation

@tannergooding
Copy link
Copy Markdown
Member

Follow up to #127124. It handles certain TYP_SIMD32/64 values by checking if they repeat the same TYP_SIMD16 value across all "lanes". This allows handling many of the most common cases such as Zero, AllBitsSet, Create(cns), etc.

There are some remaining cases like Indices, CreateSequence(...) and others that aren't handled yet and might be good to support in the future, but that can be expanded as needed by having a set of flags for the vector constant case of the union.

Copilot AI review requested due to automatic review settings April 24, 2026 14:55
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 24, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a follow-up to the introduction of O2K_CONST_VEC, extending assertion propagation so it can handle common TYP_SIMD32/64 (Vector256/Vector512 on xarch) constants when they are broadcasts of the low TYP_SIMD16 lane, avoiding the need for heap allocation while still covering core constants like Zero, AllBitsSet, and Create(cns).

Changes:

  • Teach optCreateAssertion to accept TYP_SIMD32/64 GT_CNS_VEC constants when they are lane-broadcasts, storing only the low 16 bytes in the assertion payload.
  • Extend optAssertionGenJtrue handling to include Vector256/Vector512 equality/inequality intrinsics on xarch.
  • Teach optConstantAssertionProp to reconstruct TYP_SIMD32/64 constants by broadcasting the stored simd16_t into the upper lanes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/coreclr/jit/compiler.h Updates/clarifies the O2K_CONST_VEC storage comment to document the “store v128, support wider via broadcast” approach.
src/coreclr/jit/assertionprop.cpp Implements broadcast-detection for TYP_SIMD32/64 constant assertions (xarch), expands recognized SIMD equality intrinsics, and reconstructs widened constants during propagation.

Comment thread src/coreclr/jit/assertionprop.cpp Outdated
Comment thread src/coreclr/jit/assertionprop.cpp Outdated
Comment thread src/coreclr/jit/assertionprop.cpp Outdated
// sizes in the future by adding m_encodedVconFlags and tracking
// whether a different heap allocated value was used or other
// special cases like Indices or Sequences
simd16_t m_simdVal;
Copy link
Copy Markdown
Member

@EgorBo EgorBo Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth the complixity to detect SIMD32/SIMD64 that we can fit into simd16_t or we just add simd_t* to this union and allocate it on heap? Less changes + will support all kinds of constants

Copy link
Copy Markdown
Member Author

@tannergooding tannergooding Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume it is and may even be worthwhile to do for VEC_CNS as well. -- This is based on past throughput and allocation considerations we saw for VN and other phases.

Particularly on xarch, we are most typically hitting the TYP_SIMD32 path for hardware and so this would mean a lot more constants are heap allocating when they don't need to be.

We'd have to have a if (size > 16) { speciallyHandleTheHeapAlloc } anyways with such a scenario, so this approach shouldn't be adding extra overhead. Rather, it just optimizes and avoids the allocation for the most common cases. -- and I'd actually expect assertion prop doesn't care about most constants, particularly "arbitrary" ones, so it may be unimportant to ever add the heap allocation support.

We may even want to consider doing the same for GT_CNS_VEC so that we aren't forced to have a large node for all constants and to avoid needing to do repeat checks like "is this a broadcast".

Copy link
Copy Markdown
Member

@EgorBo EgorBo Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I am not sure I follow. Your PR tries to fit CNS_VEC for SIMD32 and SIMD64 into simd16_t in AssertionDsc2 for things where 16 bytes components are the same. I just don't see why if we can support all kinds of constants naturally, just that SIMD32 and SIMD64 will require a small arena allocation because 99% of assertions are used for non-SIMD stuff and we don't want to increase the struct size for this nieche case (pay to play). Basically, I suggest this change: https://github.com/EgorBo/runtime-1/pull/new/jit/wide-simd-assertions (just vibe-coded, probably can be simplified). I doubt we will see any noticieable regression anywhere (both TP and alloc size).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example, my commit can fold this:

Vector256<long> Test(ref Vector256<long> v)
{
    var a = v;
    if (a == Vector256.Create(1, 2, 3, 4))
        return a + a; // folded into [2,4,6,8]
    return default;
}

while your cannot.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't see why if we can support all kinds of constants naturally, just that SIMD32 and SIMD64 will require a small arena allocation

That would work too. The downside is that it's forcing an allocation for the most common cases when we can trivially avoid it for the majority scenario (most SIMD constants are broadcastable or repeating the V128 value).

Based on what we had seen in VN, there were enough simd32/64 constants that this added up and pessimized throughput. We fixed that by splitting it out into separate maps. We had similarly seen even in GenTreeVecCon and other areas that the cost of the size check and doing different things for each vector size was significantly less than always touching a full cache line.

I would expect that the "optimization" this PR is doing is still beneficial even with your full fix and that we may even want to do something similar for GenTreeVecCon itself to reduce the amount of data that has to be checked for the most common patterns/optimizations.


I'm fine with waiting for the "full fix" to go in and then comparing SPMI metrics or factoring it into this PR so we can compare and choose. Thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions are created only for x VectorX_op_Equality cns today - do we really have so many of them to have an impact on tp?

It's small "now" but may matter more as this expands. It can be useful for optimizing out a lot of edge cases for floating-point, conversions, and other scenarios when we know a valid is within a given constraint.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, https://github.com/EgorBo/runtime-1/pull/new/jit/wide-simd-assertions is less changes, generates -8,949 bytes on Windows-x64 locally

Do you have numbers on the throughput (I'm guessing near 0%) and allocation impact?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to submit it as a PR to grab real numbers, i'm 100% sure TP is 0, alloc rate might be visible (>0) but not big enough to care IMO

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: this pr produced -3,260 diff on win-x64 and simd_t* is -8,949

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract it to #127401 to see the metrics and get full differences. We can take either or both, depending on what the numbers report.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 24, 2026 15:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends JIT assertion propagation for O2K_CONST_VEC so it can cover common TYP_SIMD32/64 constants by treating them as a repeated/broadcast TYP_SIMD16 payload when all 128-bit lanes match. This builds on the existing vector-constant assertion kind to enable more constant propagation opportunities without introducing heap allocation for large vector constants.

Changes:

  • Teach assertion creation (optCreateAssertion) to accept TYP_SIMD32/64 GT_CNS_VEC when the value is a repeated TYP_SIMD16 across all lanes (xarch).
  • Teach constant assertion propagation (optConstantAssertionProp) to materialize TYP_SIMD32/64 constants by broadcasting the stored simd16_t payload (xarch).
  • Enable assertion generation for Vector256/Vector512 equality/inequality intrinsics in optAssertionGenJtrue (xarch).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/coreclr/jit/compiler.h Updates O2K_CONST_VEC storage comment to document the “store v128, support v256/v512 via broadcast” approach.
src/coreclr/jit/assertionprop.cpp Adds lane-equality checks for TYP_SIMD32/64 constants and broadcasts the stored simd16_t when propagating back to TYP_SIMD32/64 uses; expands supported vector equality/inequality intrinsics.

@tannergooding
Copy link
Copy Markdown
Member Author

Closing in favor of #127402

The heap allocations are indeed not meaningful at scale and its better to handle all constants.

@tannergooding tannergooding deleted the vec-aprop-const2 branch April 25, 2026 00:04
tannergooding added a commit that referenced this pull request Apr 25, 2026
This provides O2K_CONST_VEC support for all SIMD constants by heap
allocating for values larger than TYP_SIMD16. It is an alternative to
#127390.

---------

Co-authored-by: EgorBo <egorbo@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants