Performance Discrepancy with -ctv q4_0 compared to q8_0 #17493

TkskKurumi · 2025-11-25T12:04:51Z

TkskKurumi
Nov 25, 2025

Found that -ctv q4_0 is much slower than q8_0 (more than 10x slower on pp512), with following environments.
OS: Windows 10
llama.cpp build: prebuilt b7154 windows cuda12.4 from GitHub releases page. (llama-b7154-bin-win-cuda-12.4-x64.zip + cudart-llama-bin-win-cuda-12.4-x64.zip)
device: RTX 3090 24GB.
model: https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF/blob/main/Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf
llama-bench results:

model	size	params	backend	ngl	n_ubatch	type_k	type_v	fa	dev	test	t/s
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q8_0	1	CUDA1	pp512	2915.41 ± 300.53
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q8_0	1	CUDA1	tg128	151.21 ± 0.48
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q4_0	1	CUDA1	pp512	164.29 ± 15.78
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q4_0	1	CUDA1	tg128	29.07 ± 0.90

launching command line:

.\llama-bench.exe --model "E:\LLM\Qwen3-VL-30B-A3B-Instruct\Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf" `
                   -ctk q8_0 `
                   -ctv q8_0,q4_0 `
                   -fa 1 `
                   -ngl 99 `
                   --threads 8 `
                   --device CUDA1 `
                   -ub 4096 `
                   -b 2048

Is this performance discrepancy expected or not?
I can think of the following possibilities:

The 4-bit quantization introduces additional overhead.
The 8-bit operations are much more optimized at the hardware level.

I would be happy to provide any additional information needed. Thank you for your time and for the continued work on this project.

Answered by TkskKurumi

Nov 25, 2025

Found that when -ctk is also Q4_0, the performance boost up back.

model	size	params	backend	ngl	n_ubatch	type_k	type_v	fa	dev	test	t/s
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q8_0	1	CUDA1	pp512	3240.28 ± 142.19
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q8_0	1	CUDA1	tg128	150.97 ± 1.08
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q4_0	1	CUDA1	pp512	164.68 ± 16.57
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q4_0	1	CUDA1	tg128	30.97 ± 0.77
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q4_0	q8_0	1	CUDA1	pp512	172.30 ± 5.63
qwen3vlmoe 30B.A3B Q4_K - Medium	1…

View full answer

TkskKurumi · 2025-11-25T12:24:41Z

TkskKurumi
Nov 25, 2025
Author

Found that when -ctk is also Q4_0, the performance boost up back.

model	size	params	backend	ngl	n_ubatch	type_k	type_v	fa	dev	test	t/s
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q8_0	1	CUDA1	pp512	3240.28 ± 142.19
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q8_0	1	CUDA1	tg128	150.97 ± 1.08
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q4_0	1	CUDA1	pp512	164.68 ± 16.57
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q8_0	q4_0	1	CUDA1	tg128	30.97 ± 0.77
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q4_0	q8_0	1	CUDA1	pp512	172.30 ± 5.63
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q4_0	q8_0	1	CUDA1	tg128	31.18 ± 0.46
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q4_0	q4_0	1	CUDA1	pp512	3480.12 ± 29.99
qwen3vlmoe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA	99	4096	q4_0	q4_0	1	CUDA1	tg128	150.12 ± 0.93

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Discrepancy with -ctv q4_0 compared to q8_0 #17493

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance Discrepancy with -ctv q4_0 compared to q8_0 #17493

Uh oh!

Uh oh!

TkskKurumi Nov 25, 2025

Replies: 1 comment

Uh oh!

TkskKurumi Nov 25, 2025 Author

TkskKurumi
Nov 25, 2025

TkskKurumi
Nov 25, 2025
Author