Skip to content

Comments

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)#5640

Merged
martin-frbg merged 15 commits intoOpenMathLib:developfrom
ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM
Feb 20, 2026
Merged

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)#5640
martin-frbg merged 15 commits intoOpenMathLib:developfrom
ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM

Conversation

@ChipKerchner
Copy link
Contributor

@ChipKerchner ChipKerchner commented Feb 10, 2026

Added ability to accumulate in FP16 for GEMM. Widens once at the end of loops.

Testing LLVM FP16 LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         24948910

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         18968190

Accumulation differences are about 4 epsilons - compared to the widening (previous) version. But the performance it up to 2.7X faster - Note: BananaPi shows only 1.85X faster.

BF16 is 1.5X faster and now works for RVA22 systems like BananaPi

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 10, 2026

Unfortunately BF16 only has widening MADD instructions. So the same changes can not be made for BF16.

@ChipKerchner
Copy link
Contributor Author

These are for VLEN = 256 only currently

@ChipKerchner
Copy link
Contributor Author

It now works for VLEN = 128.

@ChipKerchner ChipKerchner marked this pull request as draft February 10, 2026 22:04
@ChipKerchner
Copy link
Contributor Author

Main loop now uses LMUL = 2

@ChipKerchner ChipKerchner marked this pull request as ready for review February 11, 2026 00:38
@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 11, 2026

Even faster!!!

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         13400067

@ChipKerchner
Copy link
Contributor Author

Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster.

@ChipKerchner
Copy link
Contributor Author

Convert BF16 values once (and vectorized) - 3-4% faster.

@ChipKerchner ChipKerchner marked this pull request as draft February 13, 2026 15:20
@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 13, 2026

Latest BF16 version is 48% faster than the current version. And is 3% faster than the FP32 .

It uses 1.5X more memory unfortunately.

It would be possible to make this version to work on the BananaPi - which doesn't support BF16 vector instructions. The conversions could be done with uint32 conversion and shift. And fixing the remaining BF16 vectors to not uses vector BF16 MADDs.

@ChipKerchner ChipKerchner marked this pull request as ready for review February 13, 2026 17:56
@ChipKerchner ChipKerchner changed the title Added ability to accumulate in FP16 for GEMM for RISC-V Added ability to accumulate in FP16 and one set of conversions for BF16 for GEMM in RISC-V Feb 13, 2026
@ChipKerchner
Copy link
Contributor Author

BF16 GEMM now works for pre-RVA23 systems like BananaPi

@ChipKerchner ChipKerchner changed the title Added ability to accumulate in FP16 and one set of conversions for BF16 for GEMM in RISC-V Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23) Feb 15, 2026
@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 19, 2026

This is ready to go unless someone has comments.

@martin-frbg martin-frbg added this to the 0.3.32 milestone Feb 20, 2026
@martin-frbg martin-frbg merged commit 30cf14c into OpenMathLib:develop Feb 20, 2026
100 of 102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants