Skip to content

Commit e102d32

Browse files
committed
Add floating point matrix multiply-accumulate widening intrinsics
Adds intrinsic support for the FMMLA matrix multiply-add widening instructions introduced by the 2024 dpISA. FEAT_F8F32MM: Neon/SVE2 FP8 to single-precision FEAT_F8F16MM: Neon/SVE2 FP8 to half-precision FEAT_SVE_F16F32MM: SVE half-precision to single-precision
1 parent a4bc412 commit e102d32

File tree

4 files changed

+67
-0
lines changed

4 files changed

+67
-0
lines changed

main/acle.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,6 +466,8 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
466466
* Added feature test macro for FEAT_SSVE_FEXPA.
467467
* Added feature test macro for FEAT_CSSC.
468468
* Added support for FEAT_FPRCVT intrinsics and `__ARM_FEATURE_FPRCVT`.
469+
* Added support for modal 8-bit floating point matrix multiply-accumulate widening intrinsics.
470+
* Added support for 16-bit floating point matrix multiply-accumulate widening intrinsics.
469471

470472
### References
471473

@@ -2354,6 +2356,26 @@ is hardware support for the SVE forms of these instructions and if the
23542356
associated ACLE intrinsics are available. This implies that
23552357
`__ARM_FEATURE_MATMUL_INT8` and `__ARM_FEATURE_SVE` are both nonzero.
23562358

2359+
##### Multiplication of modal 8-bit floating-point matrices
2360+
2361+
This section is in
2362+
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
2363+
extended in the future.
2364+
2365+
`__ARM_FEATURE_F8F16MM` is defined to `1` if there is hardware support
2366+
for the NEON and SVE modal 8-bit floating-point matrix multiply-accumulate to half-precision (FEAT_F8F16MM)
2367+
instructions and if the associated ACLE intrinsics are available.
2368+
2369+
`__ARM_FEATURE_F8F32MM` is defined to `1` if there is hardware support
2370+
for the NEON and SVE modal 8-bit floating-point matrix multiply-accumulate to single-precision (FEAT_F8F32MM)
2371+
instructions and if the associated ACLE intrinsics are available.
2372+
2373+
##### Multiplication of 16-bit floating-point matrices
2374+
2375+
`__ARM_FEATURE_SVE_F16F32MM` is defined to `1` if there is hardware support
2376+
for the SVE 16-bit floating-point to 32-bit floating-point matrix multiply and add
2377+
(FEAT_SVE_F16F32MM) instructions and if the associated ACLE intrinsics are available.
2378+
23572379
##### Multiplication of 32-bit floating-point matrices
23582380

23592381
`__ARM_FEATURE_SVE_MATMUL_FP32` is defined to `1` if there is hardware support
@@ -2646,6 +2668,9 @@ be found in [[BA]](#BA).
26462668
| [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve) | The number of bits in an SVE vector, when known in advance | 256 |
26472669
| [`__ARM_FEATURE_SVE_MATMUL_FP32`](#multiplication-of-32-bit-floating-point-matrices) | 32-bit floating-point matrix multiply extension (FEAT_F32MM) | 1 |
26482670
| [`__ARM_FEATURE_SVE_MATMUL_FP64`](#multiplication-of-64-bit-floating-point-matrices) | 64-bit floating-point matrix multiply extension (FEAT_F64MM) | 1 |
2671+
| [`__ARM_FEATURE_F8F16MM`](#multiplication-of-modal-8-bit-floating-point-matrices) | Modal 8-bit floating-point matrix multiply-accumulate to half-precision extension (FEAT_F8F16MM) | 1 |
2672+
| [`__ARM_FEATURE_F8F32MM`](#multiplication-of-modal-8-bit-floating-point-matrices) | Modal 8-bit floating-point matrix multiply-accumulate to single-precision extension (FEAT_F8F32MM) | 1 |
2673+
| [`__ARM_FEATURE_SVE_F16F32MM`](#multiplication-of-16-bit-floating-point-matrices) | 16-bit floating-point matrix multiply-accumulate to single-precision extension (FEAT_SVE_F16F32MM) | 1 |
26492674
| [`__ARM_FEATURE_SVE_MATMUL_INT8`](#multiplication-of-8-bit-integer-matrices) | SVE support for the integer matrix multiply extension (FEAT_I8MM) | 1 |
26502675
| [`__ARM_FEATURE_SVE_PREDICATE_OPERATORS`](#scalable-vector-extension-sve) | Level of support for C and C++ operators on SVE vector types | 1 |
26512676
| [`__ARM_FEATURE_SVE_VECTOR_OPERATORS`](#scalable-vector-extension-sve) | Level of support for C and C++ operators on SVE predicate types | 1 |
@@ -9383,6 +9408,31 @@ BFloat16 floating-point multiply vectors.
93839408
uint64_t imm_idx);
93849409
```
93859410

9411+
### SVE2 floating-point matrix multiply-accumulate instructions.
9412+
9413+
#### FMMLA (widening, FP8 to FP16)
9414+
9415+
Modal 8-bit floating-point matrix multiply-accumulate to half-precision.
9416+
```c
9417+
// Only if (__ARM_FEATURE_SVE2 && __ARM_FEATURE_F8F16MM)
9418+
svfloat16_t svmmla[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
9419+
```
9420+
9421+
#### FMMLA (widening, FP8 to FP32)
9422+
9423+
Modal 8-bit floating-point matrix multiply-accumulate to single-precision.
9424+
```c
9425+
// Only if (__ARM_FEATURE_SVE2 && __ARM_FEATURE_F8F32MM)
9426+
svfloat32_t svmmla[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
9427+
```
9428+
#### FMMLA (widening, FP16 to FP32)
9429+
9430+
16-bit floating-point matrix multiply-accumulate to single-precision.
9431+
```c
9432+
// Only if __ARM_FEATURE_SVE_F16F32MM
9433+
svfloat32_t svmmla[_f32_f16](svfloat32_t zda, svfloat16_t zn, svfloat16_t zm);
9434+
```
9435+
93869436
### SVE2.1 instruction intrinsics
93879437

93889438
The specification for SVE2.1 is in

neon_intrinsics/advsimd.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6202,3 +6202,14 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``.
62026202
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlalltbq_laneq_f32_mf8_fpm" target="_blank">vmlalltbq_laneq_f32_mf8_fpm</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t vd,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vn,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vm,<br>&nbsp;&nbsp;&nbsp;&nbsp; const int lane,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `vd -> Vd.4S`<br>`vm -> Vn.16B`<br>`vm -> Vm.B`<br>`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
62036203
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlallttq_lane_f32_mf8_fpm" target="_blank">vmlallttq_lane_f32_mf8_fpm</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t vd,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vn,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x8_t vm,<br>&nbsp;&nbsp;&nbsp;&nbsp; const int lane,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `vd -> Vd.4S`<br>`vm -> Vn.16B`<br>`vm -> Vm.B`<br>`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
62046204
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlallttq_laneq_f32_mf8_fpm" target="_blank">vmlallttq_laneq_f32_mf8_fpm</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t vd,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vn,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t vm,<br>&nbsp;&nbsp;&nbsp;&nbsp; const int lane,<br>&nbsp;&nbsp;&nbsp;&nbsp; fpm_t fpm)</code> | `vd -> Vd.4S`<br>`vm -> Vn.16B`<br>`vm -> Vm.B`<br>`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
6205+
6206+
## Matrix multiplication intrinsics from Armv9.6-A
6207+
6208+
### Vector arithmetic
6209+
6210+
#### Matrix multiply
6211+
6212+
| Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures |
6213+
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-------------------------------|-------------------|---------------------------|
6214+
| <code>float16x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmmlaq_f16_mf8" target="_blank">vmmlaq_f16_mf8</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float16x4_t r,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t a,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t b)</code> | `r -> Vd.4H`<br>`a -> Vn.16B`<br>`b -> Vm.16B` | `FMMLA Vd.4H, Vn.16B, Vm.16B` | `Vd.4H -> result` | `A64` |
6215+
| <code>float32x4_t <a href="https://developer.arm.com/architectures/instruction-sets/intrinsics/vmmlaq_f32_mf8" target="_blank">vmmlaq_f32_mf8</a>(<br>&nbsp;&nbsp;&nbsp;&nbsp; float32x4_t r,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t a,<br>&nbsp;&nbsp;&nbsp;&nbsp; mfloat8x16_t b)</code> | `r -> Vd.4S`<br>`a -> Vn.16B`<br>`b -> Vm.16B` | `FMMLA Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` |

tools/intrinsic_db/advsimd.csv

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4830,3 +4830,7 @@ float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x
48304830
float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
48314831
float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
48324832
float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
4833+
4834+
<SECTION> Matrix multiplication intrinsics from Armv9.6-A
4835+
float16x4_t vmmlaq_f16_mf8(float16x4_t r, mfloat8x16_t a, mfloat8x16_t b) r -> Vd.4H;a -> Vn.16B;b -> Vm.16B FMMLA Vd.4H, Vn.16B, Vm.16B Vd.4H -> result A64
4836+
float32x4_t vmmlaq_f32_mf8(float32x4_t r, mfloat8x16_t a, mfloat8x16_t b) r -> Vd.4S;a -> Vn.16B;b -> Vm.16B FMMLA Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64

tools/intrinsic_db/advsimd_classification.csv

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4717,3 +4717,5 @@ vmlalltbq_lane_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and wi
47174717
vmlalltbq_laneq_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
47184718
vmlallttq_lane_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
47194719
vmlallttq_laneq_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
4720+
vmmlaq_f16_mf8 Vector arithmetic|Matrix multiply
4721+
vmmlaq_f32_mf8 Vector arithmetic|Matrix multiply

0 commit comments

Comments
 (0)