Skip to content

(S/D)ROT Haswell kernel does not round the tail as its SIMD body. #5658

@ilayn

Description

@ilayn

This is just to report some findings I had during the investigation of a failure on translated LAPACK project GitHub runners.

The runner architecture is randomly assigned

  • AMD EPYC 7763 64-Core Processor
  • Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz

This is generally fine under OPENBLAS_CORETYPE=Haswell + Windows + MSYS2 + CLANG64 for both platforms.

However, both drot and srot Haswell kernels have scalar tails (lines 79-83 in drot, lines 80-84 in srot) where the computation is left to the compiler, while the SkylakeX kernels use masked SIMD operations for their tails.

for (i = tail_index_4; i < n; ++i) {
FLOAT temp = c * x[i] + s * y[i];
y[i] = c * y[i] - s * x[i];
x[i] = temp;
}

if ((n1&7) > 0) {
unsigned char tail_mask8 = (((unsigned char) 0xff) >> (8 -(n1&7)));
__m512d tail_x = _mm512_maskz_loadu_pd(*((__mmask8*) &tail_mask8), &x[tail_index_8]);
__m512d tail_y = _mm512_maskz_loadu_pd(*((__mmask8*) &tail_mask8), &y[tail_index_8]);
__m512d temp = _mm512_mul_pd(s_512, tail_y);
temp = _mm512_fmadd_pd(c_512, tail_x, temp);
_mm512_mask_storeu_pd(&x[tail_index_8],*((__mmask8*)&tail_mask8), temp);
temp = _mm512_mul_pd(s_512, tail_x);
temp = _mm512_fmsub_pd(c_512, tail_y, temp);
_mm512_mask_storeu_pd(&y[tail_index_8], *((__mmask8*)&tail_mask8), temp);
}

So as far as I understand, the scalar c * x[i] + s * y[i] leaves the FMA decomposition up to the compiler. GCC could emit:

mul(s,y); fmadd(c, x, sy) : matches SIMD (intermediate rounding on s·y)
mul(c,x); fmadd(s, y, cx) : different (intermediate rounding on c·x instead)
mul(c,x); mul(s,y); add(cx, sy) : no FMA at all, three roundings

Options 2 and 3 give different bit-level results from the SIMD body for the same input values.

The reason why this count matters: when dhgeqz calls cblas_drot with job="S" vs job="E", the same physical matrix element sits at a different offset within the call. With job="S" (ifrstm=0, count=n), element at row ilo is at offset ilo. With job="E" (ifrstm=ilo, count=ihi-ilo+1), the same element is at offset 0. Whether that element falls in the SIMD body or the scalar tail depends on count % 4 and the count is different between the two calls.

SkylakeX is not affected because its tail uses the exact same _mm512_fmadd_pd intrinsics as the main body, just with masked load/store. Every element gets identical arithmetic regardless of where it falls.

Zen/Haswell is fragile because elements near the end of the call spill into a scalar tail with compiler-determined FMA ordering that can differ from the explicit SIMD intrinsics. So something like the following might work but my SIMD is not up to date as much as I want it to be.

    // At the top we define these unconditionally
    __m256d c_256 = _mm256_set1_pd(c);
    __m256d s_256 = _mm256_set1_pd(s);

    ....

    if ((n & 3) > 0) {
        const int64_t mask_v[8] = {-1,-1,-1,-1, 0,0,0,0};
        __m256i tail_mask = _mm256_loadu_si256((__m256i*)&mask_v[4 - (n & 3)]);

        x0 = _mm256_maskload_pd(&x[tail_index_4], tail_mask);
        y0 = _mm256_maskload_pd(&y[tail_index_4], tail_mask);

        t0 = _mm256_mul_pd(s_256, y0);
        t0 = _mm256_fmadd_pd(c_256, x0, t0);
        _mm256_maskstore_pd(&x[tail_index_4], tail_mask, t0);

        t0 = _mm256_mul_pd(s_256, x0);
        t0 = _mm256_fmsub_pd(c_256, y0, t0);
        _mm256_maskstore_pd(&y[tail_index_4], tail_mask, t0);
    }

I'm a bit parroting what I see in SkylakeX though. So feel free to ignore what I have here.

Alternatively, Clang seems to respect the order it sees (emphasis on seems) but obviously there are no guarantees. Tomorrow something might change and it can start reordering too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions