-
Notifications
You must be signed in to change notification settings - Fork 99
Description
When running DMR on a Linux machine using runtime_flags, they are always ignored after a reboot. It also seems to revert to the default config unexpectedly when docker model config is manually run. Recreating the containers will apply them again. Any idea why this is happening and how to handle this?
Docker compose:
models:
llm:
model: ai/qwen3-vl:8B
context_size: 16384
runtime_flags:
- "-c"
- "16384"
- "-np"
- "4"
- "-b"
- "2048"
- "--no-mmap"
- "--flash-attn"
- "on"
- "--cache-type-k"
- "q8_0"
- "--cache-type-v"
- "q8_0"
instant:
model: ai/qwen3-vl:2B-UD-Q4_K_XL
context_size: 8192
runtime_flags:
- "-c"
- "8192"
- "-np"
- "4"
- "-b"
- "2048"
- "--no-mmap"
- "--flash-attn"
- "on"
- "--cache-type-k"
- "q8_0"
- "--cache-type-v"
- "q8_0"
After reboot and executing the first request (see last line, no runtime_flags):
time="2026-03-03T10:13:24Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:13:24Z" level=info msg="Getting model by reference: sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="llama.cpp args: [-ngl 999 --metrics --model /models/bundles/sha256/a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5/model/model.gguf --host inference-runner-0.sock --mmproj /models/bundles/sha256/a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5/model/model.mmproj]"
After a --force-recreate of the containers (see last line, runtime_flags applied):
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Configuring llama.cpp runner for sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d"
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Configuring llama.cpp runner for sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5"
time="2026-03-03T10:19:40Z" level=info msg="Loading llama.cpp backend runner with model sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d in completion mode"
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:19:40Z" level=info msg="llama.cpp args: [-ngl 999 --metrics --model /models/bundles/sha256/50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d/model/model.gguf --host inference-runner-0.sock --ctx-size 8192 -c 8192 -np 4 -b 2048 --no-mmap --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --mmproj /models/bundles/sha256/50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d/model/model.mmproj]"
I've also tried using cronjobs to keep the models loaded with the config, but after a reboot it also ignores them and reverts to the default context size of 4096, causing issues with larger requests:
*/5 * * * * docker model configure --context-size 16384 ai/qwen3-vl:8B -- -c 16384 -np 4 -b 2048 --no-mmap --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 && docker model run --detach ai/qwen3-vl:8B 2>&1
*/5 * * * * docker model configure --context-size 16384 ai/qwen3-vl:2B-UD-Q4_K_XL -- -c 8192 -np 4 -b 2048 --no-mmap --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 && docker model run --detach ai/qwen3-vl:2B-UD-Q4_K_XL 2>&1
It also seems to reset to 4096 tokens when the cronjob is run later on, at least, that's how I interpret the n_ctx_slot = 4096:
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="srv params_from_: Chat format: Hermes 2 Pro" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.583 (> 0.100 thold), f_keep = 0.003" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist " component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot launch_slot_: id 3 | task 4254 | processing task, is_child = 0" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id 3 | task 4254 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 12" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id 3 | task 4254 | n_tokens = 7, memory_seq_rm [7, end)" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id 3 | task 4254 | prompt processing progress, n_tokens = 12, batch.n_tokens = 5, progress = 1.000000" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id 3 | task 4254 | prompt done, n_tokens = 12, batch.n_tokens = 5" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot init_sampler: id 3 | task 4254 | init sampler, took 0.00 ms, tokens: text = 12, total = 12" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Configuration for llama.cpp runner for modelID sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d unchanged"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Configuration for llama.cpp runner for modelID sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 unchanged"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
Versions:
$ docker version
Client: Docker Engine - Community
Version: 29.2.1
API version: 1.53
Go version: go1.25.6
Git commit: a5c7197
Built: Mon Feb 2 17:17:26 2026
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 29.2.1
API version: 1.53 (minimum version 1.44)
Go version: go1.25.6
Git commit: 6bc6209
Built: Mon Feb 2 17:17:26 2026
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v2.2.1
GitCommit: dea7da592f5d1d2b7755e3a161be07f43fad8f75
runc:
Version: 1.3.4
GitCommit: v1.3.4-0-gd6d73eb8
docker-init:
Version: 0.19.0
GitCommit: de40ad0
$ docker model version
Client:
Version: v1.1.5
OS/Arch: linux/amd64
Server:
Version: v1.1.0
Engine: Docker Engine