Skip to content

Runtime_flags are ignored on Linux in certain cases #726

@janick-nl

Description

@janick-nl

When running DMR on a Linux machine using runtime_flags, they are always ignored after a reboot. It also seems to revert to the default config unexpectedly when docker model config is manually run. Recreating the containers will apply them again. Any idea why this is happening and how to handle this?

Docker compose:

models:
 llm:
   model: ai/qwen3-vl:8B
   context_size: 16384
   runtime_flags:
     - "-c" 
     - "16384"
     - "-np"
     - "4"
     - "-b"  
     - "2048"
     - "--no-mmap" 
     - "--flash-attn" 
     - "on"
     - "--cache-type-k"
     - "q8_0"
     - "--cache-type-v" 
     - "q8_0"
 instant:
   model: ai/qwen3-vl:2B-UD-Q4_K_XL
   context_size: 8192
   runtime_flags:
     - "-c" 
     - "8192"
     - "-np"
     - "4"
     - "-b"  
     - "2048"
     - "--no-mmap" 
     - "--flash-attn" 
     - "on"
     - "--cache-type-k" 
     - "q8_0"
     - "--cache-type-v" 
     - "q8_0"

After reboot and executing the first request (see last line, no runtime_flags):

time="2026-03-03T10:13:24Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:13:24Z" level=info msg="Getting model by reference: sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:13:24Z" level=info msg="llama.cpp args: [-ngl 999 --metrics --model /models/bundles/sha256/a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5/model/model.gguf --host inference-runner-0.sock --mmproj /models/bundles/sha256/a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5/model/model.mmproj]"

After a --force-recreate of the containers (see last line, runtime_flags applied):

time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Configuring llama.cpp runner for sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d"
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Configuring llama.cpp runner for sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5"
time="2026-03-03T10:19:40Z" level=info msg="Loading llama.cpp backend runner with model sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d in completion mode"
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Listing available models" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Successfully listed models, count: 2" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:19:40Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:19:40Z" level=info msg="llama.cpp args: [-ngl 999 --metrics --model /models/bundles/sha256/50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d/model/model.gguf --host inference-runner-0.sock --ctx-size 8192 -c 8192 -np 4 -b 2048 --no-mmap --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --mmproj /models/bundles/sha256/50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d/model/model.mmproj]"

I've also tried using cronjobs to keep the models loaded with the config, but after a reboot it also ignores them and reverts to the default context size of 4096, causing issues with larger requests:

*/5 * * * * docker model configure --context-size 16384 ai/qwen3-vl:8B -- -c 16384 -np 4 -b 2048 --no-mmap --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 && docker model run --detach ai/qwen3-vl:8B 2>&1
*/5 * * * * docker model configure --context-size 16384 ai/qwen3-vl:2B-UD-Q4_K_XL -- -c 8192 -np 4 -b 2048 --no-mmap --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 && docker model run --detach ai/qwen3-vl:2B-UD-Q4_K_XL 2>&1

It also seems to reset to 4096 tokens when the cronjob is run later on, at least, that's how I interpret the n_ctx_slot = 4096:

time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="srv  params_from_: Chat format: Hermes 2 Pro" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.583 (> 0.100 thold), f_keep = 0.003" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist " component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot launch_slot_: id  3 | task 4254 | processing task, is_child = 0" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id  3 | task 4254 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 12" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id  3 | task 4254 | n_tokens = 7, memory_seq_rm [7, end)" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id  3 | task 4254 | prompt processing progress, n_tokens = 12, batch.n_tokens = 5, progress = 1.000000" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot update_slots: id  3 | task 4254 | prompt done, n_tokens = 12, batch.n_tokens = 5" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="slot init_sampler: id  3 | task 4254 | init sampler, took 0.00 ms, tokens: text = 12, total = 12" component=llama.cpp
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Configuration for llama.cpp runner for modelID sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d unchanged"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Configuration for llama.cpp runner for modelID sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 unchanged"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:2B-UD-Q4_K_XL" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:50f70f7f0ca537b2ca4843bae5456bb4b6f9d9d58d4b357faf1a3ad9b574888d in completion mode"
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Getting model by reference: ai/qwen3-vl:8B" component=model-manager
time="2026-03-03T10:40:01Z" level=info msg="Loading llama.cpp backend runner with model sha256:a18971a77b8fda79c555f603c4f94ca0183cb40499f336dc2825659504e29fc5 in completion mode"

Versions:

$ docker version
Client: Docker Engine - Community
 Version:           29.2.1
 API version:       1.53
 Go version:        go1.25.6
 Git commit:        a5c7197
 Built:             Mon Feb  2 17:17:26 2026
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          29.2.1
  API version:      1.53 (minimum version 1.44)
  Go version:       go1.25.6
  Git commit:       6bc6209
  Built:            Mon Feb  2 17:17:26 2026
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v2.2.1
  GitCommit:        dea7da592f5d1d2b7755e3a161be07f43fad8f75
 runc:
  Version:          1.3.4
  GitCommit:        v1.3.4-0-gd6d73eb8
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ docker model version
Client:
 Version:    v1.1.5
 OS/Arch:    linux/amd64

Server:
 Version:    v1.1.0
 Engine:     Docker Engine

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions