Skip to content

profiling radeon 7900xt (20gb vram) #40

@cryptopsy0

Description

@cryptopsy0

/s/bee-rocm/build/bin/llama-server --version

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 20464 MiB):
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 20464 MiB
version: 9459 (07ac3ce)
built with GNU 16.1.0 for Linux x86_64

1. Baseline – no DFlash

remove draft model / --spec-type dflash / other --spec-* DFlash args

export GGML_DFLASH_PROFILE=1
/s/bee-rocm/build/bin/llama-server
-m /root/.models/dflash/Qwen3.6-27B-Q4_K_M.gguf \
-md /root/.models/dflash/dflash-draft-3.6-q4_k_m.gguf
--host 0.0.0.0 --port 8080 \
--jinja --metrics -ngl all -ngld all
--reasoning on -fa on --mlock --no-mmap -np 1
-ctk turbo4 -ctv turbo4
-lv 3 --log-timestamps --log-prefix --log-colors off
-c 128000
##result
tokens: 481
wall tokens/s: 37.03
decode tokens/s: 41.01

2. DFlash default

same command, with DFlash enabled, but no forced --spec-draft-n-max 3

added
--spec-type dflash
--spec-dflash-cross-ctx 512
--spec-draft-ngl 999 \
##result
tokens: 479
wall tokens/s: 38.07
decode tokens/s: 42.30

3. DFlash with n-max 3

same as #2, but add --spec-draft-n-max 3

##result
tokens: 481
wall tokens/s: 37.22
decode tokens/s: 41.14

Test: Write a complete Python 3 module implementing a doubly-linked list with the following methods: append, prepend, insert_at, remove_at, find, reverse, to_list, length, is_empty, iter. Include comprehensive docstrings, type hints, and pytest unit tests for every method. Return only the code, no commentary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions