(S/D)ROT Haswell kernel does not round the tail as its SIMD body.

This is just to report some findings I had during the investigation of a failure on translated LAPACK project GitHub runners. 

The runner architecture is randomly assigned 

- AMD EPYC 7763 64-Core Processor   
- Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz

This is generally fine under OPENBLAS_CORETYPE=Haswell + Windows + MSYS2 + CLANG64 for both platforms.

However, both `drot` and `srot` Haswell kernels have scalar tails (lines 79-83 in drot, lines 80-84 in srot) where the computation is left to the compiler, while the SkylakeX kernels use masked SIMD operations for their tails.

https://github.com/OpenMathLib/OpenBLAS/blob/18638c70eff1f0d08e2833b2724deaa128d6a334/kernel/x86_64/drot_microk_haswell-2.c#L79-L83

https://github.com/OpenMathLib/OpenBLAS/blob/18638c70eff1f0d08e2833b2724deaa128d6a334/kernel/x86_64/drot_microk_skylakex-2.c#L85-L95

So as far as I understand, the scalar `c * x[i] + s * y[i]` leaves the FMA decomposition up to the compiler. GCC could emit:

`mul(s,y); fmadd(c, x, sy)` : matches SIMD (intermediate rounding on s·y)
`mul(c,x); fmadd(s, y, cx)` : different (intermediate rounding on c·x instead)
`mul(c,x); mul(s,y); add(cx, sy)` : no FMA at all, three roundings

Options 2 and 3 give different bit-level results from the SIMD body for the same input values.

The reason why this count matters: when `dhgeqz` calls `cblas_drot` with `job="S"` vs `job="E"`, the same physical matrix element sits at a different offset within the call. With `job="S"` (`ifrstm=0`, `count=n`), element at row `ilo` is at offset `ilo`. With `job="E"` (`ifrstm=ilo`, `count=ihi-ilo+1`), the same element is at offset 0. Whether that element falls in the SIMD body or the scalar tail depends on `count % 4`  and the count is different between the two calls.

SkylakeX is not affected because its tail uses the exact same `_mm512_fmadd_pd` intrinsics as the main body, just with masked load/store. Every element gets identical arithmetic regardless of where it falls.


Zen/Haswell is fragile because elements near the end of the call spill into a scalar tail with compiler-determined FMA ordering that can differ from the explicit SIMD intrinsics. So something like the following might work but my SIMD is not up to date as much as I want it to be.


```C
    // At the top we define these unconditionally
    __m256d c_256 = _mm256_set1_pd(c);
    __m256d s_256 = _mm256_set1_pd(s);

    ....

    if ((n & 3) > 0) {
        const int64_t mask_v[8] = {-1,-1,-1,-1, 0,0,0,0};
        __m256i tail_mask = _mm256_loadu_si256((__m256i*)&mask_v[4 - (n & 3)]);

        x0 = _mm256_maskload_pd(&x[tail_index_4], tail_mask);
        y0 = _mm256_maskload_pd(&y[tail_index_4], tail_mask);

        t0 = _mm256_mul_pd(s_256, y0);
        t0 = _mm256_fmadd_pd(c_256, x0, t0);
        _mm256_maskstore_pd(&x[tail_index_4], tail_mask, t0);

        t0 = _mm256_mul_pd(s_256, x0);
        t0 = _mm256_fmsub_pd(c_256, y0, t0);
        _mm256_maskstore_pd(&y[tail_index_4], tail_mask, t0);
    }
```

I'm a bit parroting what I see in SkylakeX though. So feel free to ignore what I have here.

Alternatively, Clang seems to respect the order it sees (emphasis on seems) but obviously there are no guarantees. Tomorrow something might change and it can start reordering too. 

	for (i = tail_index_4; i < n; ++i) {
	FLOAT temp = c * x[i] + s * y[i];
	y[i] = c * y[i] - s * x[i];
	x[i] = temp;
	}

	if ((n1&7) > 0) {
	unsigned char tail_mask8 = (((unsigned char) 0xff) >> (8 -(n1&7)));
	__m512d tail_x = _mm512_maskz_loadu_pd(((__mmask8) &tail_mask8), &x[tail_index_8]);
	__m512d tail_y = _mm512_maskz_loadu_pd(((__mmask8) &tail_mask8), &y[tail_index_8]);
	__m512d temp = _mm512_mul_pd(s_512, tail_y);
	temp = _mm512_fmadd_pd(c_512, tail_x, temp);
	_mm512_mask_storeu_pd(&x[tail_index_8],((__mmask8)&tail_mask8), temp);
	temp = _mm512_mul_pd(s_512, tail_x);
	temp = _mm512_fmsub_pd(c_512, tail_y, temp);
	_mm512_mask_storeu_pd(&y[tail_index_8], ((__mmask8)&tail_mask8), temp);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(S/D)ROT Haswell kernel does not round the tail as its SIMD body. #5658

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

(S/D)ROT Haswell kernel does not round the tail as its SIMD body. #5658

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions