/s/bee-rocm/build/bin/llama-server --version
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 20464 MiB):
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 20464 MiB
version: 9459 (07ac3ce)
built with GNU 16.1.0 for Linux x86_64
1. Baseline – no DFlash
remove draft model / --spec-type dflash / other --spec-* DFlash args
export GGML_DFLASH_PROFILE=1
/s/bee-rocm/build/bin/llama-server
-m /root/.models/dflash/Qwen3.6-27B-Q4_K_M.gguf \
-md /root/.models/dflash/dflash-draft-3.6-q4_k_m.gguf
--host 0.0.0.0 --port 8080 \
--jinja --metrics -ngl all -ngld all
--reasoning on -fa on --mlock --no-mmap -np 1
-ctk turbo4 -ctv turbo4
-lv 3 --log-timestamps --log-prefix --log-colors off
-c 128000
##result
tokens: 481
wall tokens/s: 37.03
decode tokens/s: 41.01
2. DFlash default
same command, with DFlash enabled, but no forced --spec-draft-n-max 3
added
--spec-type dflash
--spec-dflash-cross-ctx 512
--spec-draft-ngl 999 \
##result
tokens: 479
wall tokens/s: 38.07
decode tokens/s: 42.30
3. DFlash with n-max 3
same as #2, but add --spec-draft-n-max 3
##result
tokens: 481
wall tokens/s: 37.22
decode tokens/s: 41.14
Test: Write a complete Python 3 module implementing a doubly-linked list with the following methods: append, prepend, insert_at, remove_at, find, reverse, to_list, length, is_empty, iter. Include comprehensive docstrings, type hints, and pytest unit tests for every method. Return only the code, no commentary.
/s/bee-rocm/build/bin/llama-server --version
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 20464 MiB):
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 20464 MiB
version: 9459 (07ac3ce)
built with GNU 16.1.0 for Linux x86_64
1. Baseline – no DFlash
remove draft model / --spec-type dflash / other --spec-* DFlash args
export GGML_DFLASH_PROFILE=1
/s/bee-rocm/build/bin/llama-server
-m /root/.models/dflash/Qwen3.6-27B-Q4_K_M.gguf \
-md /root/.models/dflash/dflash-draft-3.6-q4_k_m.gguf
--host 0.0.0.0 --port 8080 \
--jinja --metrics -ngl all -ngld all
--reasoning on -fa on --mlock --no-mmap -np 1
-ctk turbo4 -ctv turbo4
-lv 3 --log-timestamps --log-prefix --log-colors off
-c 128000
##result
tokens: 481
wall tokens/s: 37.03
decode tokens/s: 41.01
2. DFlash default
same command, with DFlash enabled, but no forced --spec-draft-n-max 3
added
--spec-type dflash
--spec-dflash-cross-ctx 512
--spec-draft-ngl 999 \
##result
tokens: 479
wall tokens/s: 38.07
decode tokens/s: 42.30
3. DFlash with n-max 3
same as #2, but add --spec-draft-n-max 3
##result
tokens: 481
wall tokens/s: 37.22
decode tokens/s: 41.14
Test: Write a complete Python 3 module implementing a doubly-linked list with the following methods: append, prepend, insert_at, remove_at, find, reverse, to_list, length, is_empty, iter. Include comprehensive docstrings, type hints, and pytest unit tests for every method. Return only the code, no commentary.