Optimized RVV q1_0 dot#31
Conversation
|
Thanks, that's impressive speed on such device :) Do people need a special setup to build and run this, or the llama.cpp build tools work? Would be happy to merge it to our fork, don't have a similar device to test it myself though. Will review more closely later this week. For some reason stopped getting email notifications from Github. |
There was a problem hiding this comment.
Pull request overview
Adds a RISC-V RVV-specific implementation for the q1_0 × q8_0 dot product in the CPU backend, continuing the codebase’s architecture-specific quantized dot-product optimizations.
Changes:
- Added two fixed-width RVV kernels for
ggml_vec_dot_q1_0_q8_0targeting 128-bit and 256-bit vector configurations. - Added RVV runtime dispatch in the RISC-V quantized dot-product path.
- Updated the RISC-V fallback aliasing so this path can call the true generic implementation when needed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ggml/src/ggml-cpu/arch/riscv/quants.c |
Adds the new RVV q1_0×q8_0 kernels, helper tables, and runtime dispatch logic. |
ggml/src/ggml-cpu/arch-fallback.h |
Removes the RISC-V alias for the q1_0 generic dot product so the arch-specific implementation can fall back correctly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
I don't actually know if llama.cpp accounts for Zvl64b, it seems it's for embedded or 32 bit cores |
|
yeah copilot might be confused. Saw a similar PR in main llama.cpp ggml-org#22500 |
Continuation of #10 for risc-v V extension
Implemented two fixed vlen kernels loosely inspired by AVX2 implementation
VLA causes severe overhead and task only have two realistic VL combinations (in simple form)
Benchmarks were performed with:
OrangePI RV2 sbc (Ky X1 / spacemit k1) 8gb
Armbian Debian trixie rolling release at 6.18.26-current-spacemit kernel
Built with official Spacemit toolchain, but IME wasn't used.
Command:
llama-bench -m Bonsai-1.7B.gguf -p 64 -n 16 -t 8 -r 3 -fa 1 -mmp 0Perplexity for 5x512 chunks: Mean KLD 0.00027, PPL 21.09, Same top p 99,22%
pp 64t/stg 16t/sVL128*VL256*forced VLEN 128 kernel with LMUL=2, for VLEN >= 256: LMUL=1As always, I would appreciate your feedback