You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since llada2 uses kv cache outside its diffusion blocks in the transformers implementation I implemented that in my pr with a diffusion cli arg toggle, but got advised that its not recommended to add optimizations in the pr for support only later on in another optimization pr, so now I have an issue. I dont wanna use a weird cli arg just for llada2, which would be needed to add proper masking support with kv cache handling in diffusion cli itself which is also not good (without the full so i set out and tried to implement it inside llada2.cpp, but there I got issues with kv cache allocation crashes so I thought of another way to implement the proper masking, and found that i can do it in llama-kv-cache.cpp
since other models might need the same logic in the future and it would be gated by block size so only diffusion models could access that code:
if (block_length > 0) {
constint32_t block_p0 = p0 / block_length;
constint32_t block_p1 = p1 / block_length;
// mask if key is in a future blockif (block_p0 > block_p1) {
continue;
}
}
Now my question is basically if that is a good addition for future models or if it might cause issues with future block based diffusion models that use kv cache like llada2 which might rely on something else? That way i can use kv cache later without issues (;
Also should i open another pr for that extra support?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Since llada2 uses kv cache outside its diffusion blocks in the transformers implementation I implemented that in my pr with a diffusion cli arg toggle, but got advised that its not recommended to add optimizations in the pr for support only later on in another optimization pr, so now I have an issue. I dont wanna use a weird cli arg just for llada2, which would be needed to add proper masking support with kv cache handling in diffusion cli itself which is also not good (without the full so i set out and tried to implement it inside llada2.cpp, but there I got issues with kv cache allocation crashes so I thought of another way to implement the proper masking, and found that i can do it in llama-kv-cache.cpp
undervoid llama_kv_cache::set_input_kq_mask(ggml_tensor * dst, const llama_ubatch * ubatch, bool causal_attn, int32_t block_length) constsince other models might need the same logic in the future and it would be gated by block size so only diffusion models could access that code:
Now my question is basically if that is a good addition for future models or if it might cause issues with future block based diffusion models that use kv cache like llada2 which might rely on something else? That way i can use kv cache later without issues (;
Also should i open another pr for that extra support?
Beta Was this translation helpful? Give feedback.
All reactions