-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
This is just to report some findings I had during the investigation of a failure on translated LAPACK project GitHub runners.
The runner architecture is randomly assigned
- AMD EPYC 7763 64-Core Processor
- Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
This is generally fine under OPENBLAS_CORETYPE=Haswell + Windows + MSYS2 + CLANG64 for both platforms.
However, both drot and srot Haswell kernels have scalar tails (lines 79-83 in drot, lines 80-84 in srot) where the computation is left to the compiler, while the SkylakeX kernels use masked SIMD operations for their tails.
OpenBLAS/kernel/x86_64/drot_microk_haswell-2.c
Lines 79 to 83 in 18638c7
| for (i = tail_index_4; i < n; ++i) { | |
| FLOAT temp = c * x[i] + s * y[i]; | |
| y[i] = c * y[i] - s * x[i]; | |
| x[i] = temp; | |
| } |
OpenBLAS/kernel/x86_64/drot_microk_skylakex-2.c
Lines 85 to 95 in 18638c7
| if ((n1&7) > 0) { | |
| unsigned char tail_mask8 = (((unsigned char) 0xff) >> (8 -(n1&7))); | |
| __m512d tail_x = _mm512_maskz_loadu_pd(*((__mmask8*) &tail_mask8), &x[tail_index_8]); | |
| __m512d tail_y = _mm512_maskz_loadu_pd(*((__mmask8*) &tail_mask8), &y[tail_index_8]); | |
| __m512d temp = _mm512_mul_pd(s_512, tail_y); | |
| temp = _mm512_fmadd_pd(c_512, tail_x, temp); | |
| _mm512_mask_storeu_pd(&x[tail_index_8],*((__mmask8*)&tail_mask8), temp); | |
| temp = _mm512_mul_pd(s_512, tail_x); | |
| temp = _mm512_fmsub_pd(c_512, tail_y, temp); | |
| _mm512_mask_storeu_pd(&y[tail_index_8], *((__mmask8*)&tail_mask8), temp); | |
| } |
So as far as I understand, the scalar c * x[i] + s * y[i] leaves the FMA decomposition up to the compiler. GCC could emit:
mul(s,y); fmadd(c, x, sy) : matches SIMD (intermediate rounding on s·y)
mul(c,x); fmadd(s, y, cx) : different (intermediate rounding on c·x instead)
mul(c,x); mul(s,y); add(cx, sy) : no FMA at all, three roundings
Options 2 and 3 give different bit-level results from the SIMD body for the same input values.
The reason why this count matters: when dhgeqz calls cblas_drot with job="S" vs job="E", the same physical matrix element sits at a different offset within the call. With job="S" (ifrstm=0, count=n), element at row ilo is at offset ilo. With job="E" (ifrstm=ilo, count=ihi-ilo+1), the same element is at offset 0. Whether that element falls in the SIMD body or the scalar tail depends on count % 4 and the count is different between the two calls.
SkylakeX is not affected because its tail uses the exact same _mm512_fmadd_pd intrinsics as the main body, just with masked load/store. Every element gets identical arithmetic regardless of where it falls.
Zen/Haswell is fragile because elements near the end of the call spill into a scalar tail with compiler-determined FMA ordering that can differ from the explicit SIMD intrinsics. So something like the following might work but my SIMD is not up to date as much as I want it to be.
// At the top we define these unconditionally
__m256d c_256 = _mm256_set1_pd(c);
__m256d s_256 = _mm256_set1_pd(s);
....
if ((n & 3) > 0) {
const int64_t mask_v[8] = {-1,-1,-1,-1, 0,0,0,0};
__m256i tail_mask = _mm256_loadu_si256((__m256i*)&mask_v[4 - (n & 3)]);
x0 = _mm256_maskload_pd(&x[tail_index_4], tail_mask);
y0 = _mm256_maskload_pd(&y[tail_index_4], tail_mask);
t0 = _mm256_mul_pd(s_256, y0);
t0 = _mm256_fmadd_pd(c_256, x0, t0);
_mm256_maskstore_pd(&x[tail_index_4], tail_mask, t0);
t0 = _mm256_mul_pd(s_256, x0);
t0 = _mm256_fmsub_pd(c_256, y0, t0);
_mm256_maskstore_pd(&y[tail_index_4], tail_mask, t0);
}I'm a bit parroting what I see in SkylakeX though. So feel free to ignore what I have here.
Alternatively, Clang seems to respect the order it sees (emphasis on seems) but obviously there are no guarantees. Tomorrow something might change and it can start reordering too.