Skip to content

Eval bug: Segfault at slot initialization with CUDA on SM75 (Turing) — ngl > 0 #41

@Bino5150

Description

@Bino5150

Name and Version

./llama-server --version
version: 9867 (e727109)
built with GNU 14.2.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server --model model.gguf --port 8081 -ngl 32 -c 8192 --flash-attn on -ctk turbo4 -ctv turbo3 -t 6

Problem description & steps to reproduce

CPU: Intel i7
GPU: NVIDIA Quadro T1000 Max-Q (SM75 / Turing 4GB)
CUDA: 12.8

Server segfaults immediately after initializing slots when any CUDA layers are offloaded (-ngl > 0) on SM75 (Turing) hardware. CPU-only (-ngl 0) works correctly. Reproducible on both MTP and non-MTP model.

Segfaults at initializing slots with -ngl > 0
Warning: fused Gated Delta Net (chunked) not supported, set to disabled
CPU-only (-ngl 0) works fine
Reproducible on both MTP and non-MTP model
Occurs with and without --spec-type mtp
Occurs with and without --no-warmup
fused Gated Delta Net disabled warning appears consistently before crash
CPU-only run (-ngl 0) loads and serves correctly
Model: Qwopus3.5-4B-v3-MTP Q5_K_M GGUF

First Bad Commit

No response

Relevant log output

Logs
Here's the last log lines before the SegFault:
W sched_reserve: layer 0 is assigned to device CPU but the fused Gated Delta Net tensor is assigned to device CUDA0
W sched_reserve: fused Gated Delta Net (chunked) not supported, set to disabled
I srv load_model: initializing slots, n_slots = 4
[segfault]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions