Skip to content

Optimized RVV q1_0 dot#31

Open
pl752 wants to merge 3 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_rvv_dot
Open

Optimized RVV q1_0 dot#31
pl752 wants to merge 3 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_rvv_dot

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented May 4, 2026

Continuation of #10 for risc-v V extension

Implemented two fixed vlen kernels loosely inspired by AVX2 implementation
VLA causes severe overhead and task only have two realistic VL combinations (in simple form)

Benchmarks were performed with:
OrangePI RV2 sbc (Ky X1 / spacemit k1) 8gb
Armbian Debian trixie rolling release at 6.18.26-current-spacemit kernel
Built with official Spacemit toolchain, but IME wasn't used.
Command: llama-bench -m Bonsai-1.7B.gguf -p 64 -n 16 -t 8 -r 3 -fa 1 -mmp 0
Perplexity for 5x512 chunks: Mean KLD 0.00027, PPL 21.09, Same top p 99,22%

Flow pp 64 t/s tg 16 t/s Speedup
Scalar 1.19 0.94 1.0x / 1.0x
VL128* 6.08 4.65 5.1x / 4.9x
VL256 10.56 7.84 8.9x / 8.3x
  • * forced VLEN 128 kernel with LMUL=2, for VLEN >= 256: LMUL=1

As always, I would appreciate your feedback

@github-actions github-actions Bot added the ggml label May 4, 2026
@khosravipasha
Copy link
Copy Markdown
Collaborator

Thanks, that's impressive speed on such device :)

Do people need a special setup to build and run this, or the llama.cpp build tools work?

Would be happy to merge it to our fork, don't have a similar device to test it myself though. Will review more closely later this week.

For some reason stopped getting email notifications from Github.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a RISC-V RVV-specific implementation for the q1_0 × q8_0 dot product in the CPU backend, continuing the codebase’s architecture-specific quantized dot-product optimizations.

Changes:

  • Added two fixed-width RVV kernels for ggml_vec_dot_q1_0_q8_0 targeting 128-bit and 256-bit vector configurations.
  • Added RVV runtime dispatch in the RISC-V quantized dot-product path.
  • Updated the RISC-V fallback aliasing so this path can call the true generic implementation when needed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
ggml/src/ggml-cpu/arch/riscv/quants.c Adds the new RVV q1_0×q8_0 kernels, helper tables, and runtime dispatch logic.
ggml/src/ggml-cpu/arch-fallback.h Removes the RISC-V alias for the q1_0 generic dot product so the arch-specific implementation can fall back correctly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ggml/src/ggml-cpu/arch/riscv/quants.c Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@pl752
Copy link
Copy Markdown
Author

pl752 commented May 5, 2026

I don't actually know if llama.cpp accounts for Zvl64b, it seems it's for embedded or 32 bit cores

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented May 5, 2026

yeah copilot might be confused.

Saw a similar PR in main llama.cpp ggml-org#22500
is that related?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants