Why don't we need a `ggml_cont` for `v` in `llm_graph_context::build_attn_mha`? #14351

AgainstEntropy · 2025-06-23T22:53:44Z

AgainstEntropy
Jun 23, 2025

Hi community,

I'm going over with llama.cpp's implementation of attention and KV cache, and I noticed something potentially inefficient when v_trans = true.

In llm_graph_context::build_attn

llama.cpp/src/llama-graph.cpp

Lines 1217 to 1238 in ce82bd0

    
           // these nodes are added to the graph together so that they are not reordered 
        
           // by doing so, the number of splits in the graph is reduced 
        
           ggml_build_forward_expand(gf, q_cur); 
        
           ggml_build_forward_expand(gf, k_cur); 
        
           ggml_build_forward_expand(gf, v_cur); 
        
           const auto * mctx_cur = static_cast<const llama_kv_cache_unified_context *>(mctx); 
        
           // store to KV cache 
        
           { 
        
               ggml_build_forward_expand(gf, mctx_cur->cpy_k(ctx0, k_cur, il)); 
        
               ggml_build_forward_expand(gf, mctx_cur->cpy_v(ctx0, v_cur, il)); 
        
           } 
        
           const auto & kq_mask = inp->get_kq_mask(); 
        
           ggml_tensor * q = q_cur; 
        
           ggml_tensor * k = mctx_cur->get_k(ctx0, il); 
        
           ggml_tensor * v = mctx_cur->get_v(ctx0, il); 
        
           ggml_tensor * cur = build_attn_mha(gf, q, k, v, kq_b, kq_mask, v_mla, kq_scale); 
        
           cb(cur, "kqv_out", il);

v_cache (layers[ikv].v) is first updated by copying v_cur into a transposed view of v_cache via llama_kv_cache_unified::cpy_v

llama.cpp/src/llama-kv-cache-unified.cpp

Lines 779 to 784 in ce82bd0

    
           // note: the V cache is transposed when not using flash attention 
        
           v_view = ggml_view_2d(ctx, v, n_tokens, hparams.n_embd_v_gqa(il), 
        
                   (v->ne[1])*ggml_element_size(v), 
        
                   (head_cur)*ggml_element_size(v)); 
        
           v_cur = ggml_transpose(ctx, v_cur);

This ensures the first axis (n_tokens) is contiguous in memory, but the other axes are not.

Then, when v_cache is fetched via llama_kv_cache_unified::get_v and passed to build_attn_mha

llama.cpp/src/llama-kv-cache-unified.cpp

Lines 741 to 746 in ce82bd0

    
           // note: v->nb[1] > v->nb[2] 
        
           return ggml_view_3d(ctx, v, 
        
                   n_kv, hparams.n_head_kv(il), hparams.n_embd_head_v, 
        
                   ggml_row_size(v->type, v->ne[1]*hparams.n_embd_head_v), // v->nb[1] 
        
                   ggml_row_size(v->type, v->ne[1]),                       // v->nb[2] 
        
                   0);

A 3D view is returned with
ne = [n_tokens, n_head_kv, n_embd_head_v, 1] and
nb = [e, e * n_ctx * n_embd_head_v, e * n_ctx, e * n_ctx * n_embd_head_v]
(with e = element_size(v)), since the underlying storage has shape [n_embd_k_gqa, n_ctx].

llama.cpp/src/llama-kv-cache-unified.cpp

Line 96 in ce82bd0

v = ggml_new_tensor_2d(ctx, type_v, n_embd_v_gqa, kv_size);

llama.cpp/src/llama-model.cpp

Lines 13805 to 13816 in 9eaa51e

    
           res = new llama_kv_cache_unified( 
        
                   *this, 
        
                   nullptr, 
        
                   params.type_k, 
        
                   params.type_v, 
        
                   !cparams.flash_attn, 
        
                   cparams.offload_kqv, 
        
                   cparams.n_ctx, 
        
                   cparams.n_seq_max, 
        
                   padding, 
        
                   hparams.n_swa, 
        
                   hparams.swa_type);

This tensor is then permuted with ggml_permute(0, 2, 1, 3), yielding:
ne = [n_tokens, n_embd_head_v, n_head_kv, 1] and
nb = [e, e * n_ctx, e * n_ctx * n_embd_head_v, e * n_ctx * n_embd_head_v].

llama.cpp/src/llama-graph.cpp

Line 1024 in ce82bd0

v = ggml_permute(ctx0, v, 0, 2, 1, 3);
Finally, this v is used in ggml_mul_mat(ctx0, v, kq).

Here’s my concern: after the permutation, v is not fully contiguous, especially along the 2nd and 3rd axes. This likely leads to a non-contiguous mul_mat, which can hurt performance.

Would adding ggml_cont(v) before ggml_mul_mat improve performance? It would make all axes contiguous:
ne = [n_tokens, n_embd_head_v, n_head_kv, 1] and
nb = [e, e * n_tokens, e * n_tokens * n_embd_head_v, e * n_tokens * n_embd_head_v]

While ggml_cont may introduce a copy, I suspect the cost is less than that of an inefficient mul_mat. Or is v already made contiguous elsewhere before the matmul?

Am I missing something here? I’d appreciate any insights or corrections—thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why don't we need a `ggml_cont` for `v` in `llm_graph_context::build_attn_mha`? #14351

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why don't we need a ggml_cont for v in llm_graph_context::build_attn_mha? #14351

Uh oh!

AgainstEntropy Jun 23, 2025

Replies: 0 comments

Why don't we need a `ggml_cont` for `v` in `llm_graph_context::build_attn_mha`? #14351

AgainstEntropy
Jun 23, 2025