Allow O2K_CONST_VEC to handle some TYP_SIMD32/64#127390
Allow O2K_CONST_VEC to handle some TYP_SIMD32/64#127390tannergooding wants to merge 3 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR is a follow-up to the introduction of O2K_CONST_VEC, extending assertion propagation so it can handle common TYP_SIMD32/64 (Vector256/Vector512 on xarch) constants when they are broadcasts of the low TYP_SIMD16 lane, avoiding the need for heap allocation while still covering core constants like Zero, AllBitsSet, and Create(cns).
Changes:
- Teach
optCreateAssertionto acceptTYP_SIMD32/64GT_CNS_VECconstants when they are lane-broadcasts, storing only the low 16 bytes in the assertion payload. - Extend
optAssertionGenJtruehandling to includeVector256/Vector512equality/inequality intrinsics on xarch. - Teach
optConstantAssertionPropto reconstructTYP_SIMD32/64constants by broadcasting the storedsimd16_tinto the upper lanes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/coreclr/jit/compiler.h | Updates/clarifies the O2K_CONST_VEC storage comment to document the “store v128, support wider via broadcast” approach. |
| src/coreclr/jit/assertionprop.cpp | Implements broadcast-detection for TYP_SIMD32/64 constant assertions (xarch), expands recognized SIMD equality intrinsics, and reconstructs widened constants during propagation. |
| // sizes in the future by adding m_encodedVconFlags and tracking | ||
| // whether a different heap allocated value was used or other | ||
| // special cases like Indices or Sequences | ||
| simd16_t m_simdVal; |
There was a problem hiding this comment.
Do you think it's worth the complixity to detect SIMD32/SIMD64 that we can fit into simd16_t or we just add simd_t* to this union and allocate it on heap? Less changes + will support all kinds of constants
There was a problem hiding this comment.
I would assume it is and may even be worthwhile to do for VEC_CNS as well. -- This is based on past throughput and allocation considerations we saw for VN and other phases.
Particularly on xarch, we are most typically hitting the TYP_SIMD32 path for hardware and so this would mean a lot more constants are heap allocating when they don't need to be.
We'd have to have a if (size > 16) { speciallyHandleTheHeapAlloc } anyways with such a scenario, so this approach shouldn't be adding extra overhead. Rather, it just optimizes and avoids the allocation for the most common cases. -- and I'd actually expect assertion prop doesn't care about most constants, particularly "arbitrary" ones, so it may be unimportant to ever add the heap allocation support.
We may even want to consider doing the same for GT_CNS_VEC so that we aren't forced to have a large node for all constants and to avoid needing to do repeat checks like "is this a broadcast".
There was a problem hiding this comment.
@tannergooding I am not sure I follow. Your PR tries to fit CNS_VEC for SIMD32 and SIMD64 into simd16_t in AssertionDsc2 for things where 16 bytes components are the same. I just don't see why if we can support all kinds of constants naturally, just that SIMD32 and SIMD64 will require a small arena allocation because 99% of assertions are used for non-SIMD stuff and we don't want to increase the struct size for this nieche case (pay to play). Basically, I suggest this change: https://github.com/EgorBo/runtime-1/pull/new/jit/wide-simd-assertions (just vibe-coded, probably can be simplified). I doubt we will see any noticieable regression anywhere (both TP and alloc size).
There was a problem hiding this comment.
Example, my commit can fold this:
Vector256<long> Test(ref Vector256<long> v)
{
var a = v;
if (a == Vector256.Create(1, 2, 3, 4))
return a + a; // folded into [2,4,6,8]
return default;
}while your cannot.
There was a problem hiding this comment.
I just don't see why if we can support all kinds of constants naturally, just that SIMD32 and SIMD64 will require a small arena allocation
That would work too. The downside is that it's forcing an allocation for the most common cases when we can trivially avoid it for the majority scenario (most SIMD constants are broadcastable or repeating the V128 value).
Based on what we had seen in VN, there were enough simd32/64 constants that this added up and pessimized throughput. We fixed that by splitting it out into separate maps. We had similarly seen even in GenTreeVecCon and other areas that the cost of the size check and doing different things for each vector size was significantly less than always touching a full cache line.
I would expect that the "optimization" this PR is doing is still beneficial even with your full fix and that we may even want to do something similar for GenTreeVecCon itself to reduce the amount of data that has to be checked for the most common patterns/optimizations.
I'm fine with waiting for the "full fix" to go in and then comparing SPMI metrics or factoring it into this PR so we can compare and choose. Thoughts?
There was a problem hiding this comment.
Assertions are created only for x VectorX_op_Equality cns today - do we really have so many of them to have an impact on tp?
It's small "now" but may matter more as this expands. It can be useful for optimizing out a lot of edge cases for floating-point, conversions, and other scenarios when we know a valid is within a given constraint.
There was a problem hiding this comment.
Up to you, https://github.com/EgorBo/runtime-1/pull/new/jit/wide-simd-assertions is less changes, generates -8,949 bytes on Windows-x64 locally
Do you have numbers on the throughput (I'm guessing near 0%) and allocation impact?
There was a problem hiding this comment.
feel free to submit it as a PR to grab real numbers, i'm 100% sure TP is 0, alloc rate might be visible (>0) but not big enough to care IMO
There was a problem hiding this comment.
fyi: this pr produced -3,260 diff on win-x64 and simd_t* is -8,949
There was a problem hiding this comment.
Extract it to #127401 to see the metrics and get full differences. We can take either or both, depending on what the numbers report.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Extends JIT assertion propagation for O2K_CONST_VEC so it can cover common TYP_SIMD32/64 constants by treating them as a repeated/broadcast TYP_SIMD16 payload when all 128-bit lanes match. This builds on the existing vector-constant assertion kind to enable more constant propagation opportunities without introducing heap allocation for large vector constants.
Changes:
- Teach assertion creation (
optCreateAssertion) to acceptTYP_SIMD32/64GT_CNS_VECwhen the value is a repeatedTYP_SIMD16across all lanes (xarch). - Teach constant assertion propagation (
optConstantAssertionProp) to materializeTYP_SIMD32/64constants by broadcasting the storedsimd16_tpayload (xarch). - Enable assertion generation for
Vector256/Vector512equality/inequality intrinsics inoptAssertionGenJtrue(xarch).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/coreclr/jit/compiler.h | Updates O2K_CONST_VEC storage comment to document the “store v128, support v256/v512 via broadcast” approach. |
| src/coreclr/jit/assertionprop.cpp | Adds lane-equality checks for TYP_SIMD32/64 constants and broadcasts the stored simd16_t when propagating back to TYP_SIMD32/64 uses; expands supported vector equality/inequality intrinsics. |
|
Closing in favor of #127402 The heap allocations are indeed not meaningful at scale and its better to handle all constants. |
This provides O2K_CONST_VEC support for all SIMD constants by heap allocating for values larger than TYP_SIMD16. It is an alternative to #127390. --------- Co-authored-by: EgorBo <egorbo@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow up to #127124. It handles certain
TYP_SIMD32/64values by checking if they repeat the sameTYP_SIMD16value across all "lanes". This allows handling many of the most common cases such asZero,AllBitsSet,Create(cns), etc.There are some remaining cases like
Indices,CreateSequence(...)and others that aren't handled yet and might be good to support in the future, but that can be expanded as needed by having a set of flags for the vector constant case of the union.