[KVCache] DSA for v1 cache manager#7787
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 将 per-layer KV cache 的分配逻辑下沉到 AttentionBackend(通过新增 create_kv_cache 接口),使 cache_manager/v1 的 CacheController 仅负责 role→存储名映射、注册与可选的 set_data_ipc pin,从而减少 controller 对不同 attention variant(GQA/MLA/DSA)的耦合。
Changes:
AttentionBackend新增pin_kv_cache_for_cudagraph与默认create_kv_cache(...)(GQA/MHA:key/value,fp8 额外 scale)。- MLA/DSA backend 覆写
create_kv_cache:MLA 仅 key;DSA 返回 key+indexer(uint8)。 CacheController.initialize_kv_cache / initialize_mtp_kv_cache改为逐层调用attn_backend.create_kv_cache,并新增"indexer"role 的存储名映射及 cudagraph pin 逻辑。
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/attention/base_attention_backend.py | 为 attention backend 增加通用 KV cache 分配入口与 cudagraph pin 标志位。 |
| fastdeploy/model_executor/layers/attention/mla_attention_backend.py | MLA backend 覆写 KV cache 分配:仅分配压缩 latent key cache,并要求 pin。 |
| fastdeploy/model_executor/layers/attention/dsa_attention_backend.py | DSA backend 覆写 KV cache 分配:分配 uint8 key + uint8 indexer,并要求 pin。 |
| fastdeploy/cache_manager/v1/cache_controller.py | controller 重构为 role 注册/命名映射 + 可选 pin;主模型与 MTP 走同一分配路径。 |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 6/10 通过
2.2 可选任务 — 28/32 通过
3 失败详情(仅 required)Approval — 代码规范/审批门控(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请联系xyxinyang或zyyzghb完成PR审批 关联变更: PR 新增 链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7787 +/- ##
==========================================
Coverage ? 72.10%
==========================================
Files ? 398
Lines ? 55976
Branches ? 8749
==========================================
Hits ? 40364
Misses ? 12844
Partials ? 2768
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
fastdeploy/model_executor/layers/attention/base_attention_backend.py:137
create_host_kv_cache的 docstring 说明“host alloc 不可用时返回空 dict”,但实现里在cuda_host_alloc is None时直接raise RuntimeError。这会让调用方(如 CacheController)难以按文档处理降级逻辑。建议要么按文档返回{}并由上层跳过 swap space 初始化,要么修正文档并让上层显式捕获该异常。
Returns:
Dict keyed by ``(role, layer_idx)``. Empty dict if host alloc is
unavailable on the current platform.
"""
if cuda_host_alloc is None:
fastdeploy/cache_manager/v1/cache_controller.py:544
initialize_host_cache目前只捕获NotImplementedError。但默认实现AttentionBackend.create_host_kv_cache()在cuda_host_alloc is None时会抛RuntimeError(以及部分 backend 可能同样抛 RuntimeError),这会让启用 swap space 的场景直接初始化失败。建议在这里同时捕获RuntimeError(必要时也可捕获TypeError/AttributeError)并以 warning 方式跳过 host cache 初始化,保证在不支持 pinned host alloc 的平台上可降级运行。
try:
host_caches = attn_backend.create_host_kv_cache(
num_layers=num_layers,
num_blocks=num_host_blocks,
cache_item_bytes=cache_item_bytes,
| class AttentionBackend(ABC): | ||
| """The base class of attention backends""" | ||
|
|
||
| @abstractmethod | ||
| def init_attention_metadata(self, forward_meta: ForwardMeta): | ||
| """Initialize the forward metadata.""" | ||
| raise NotImplementedError |
| caches = attn_backend.create_kv_cache( | ||
| num_layers=self._num_layers, | ||
| num_blocks=num_gpu_blocks, | ||
| cache_dtype=cache_dtype, | ||
| kv_cache_quant_type=kv_cache_quant_type, |
| ) | ||
| return caches | ||
|
|
||
| def create_host_kv_cache( |
There was a problem hiding this comment.
也支持一个 free_host_kv_cache 方法吧。把controller里边的实现下移到这里
| Returns: | ||
| Dict keyed by ``(role, layer_idx)``. Empty dict if host alloc is | ||
| unavailable on the current platform. | ||
| """ | ||
| if cuda_host_alloc is None: | ||
| raise RuntimeError( | ||
| f"[create_host_kv_cache][{type(self).__name__}] cuda_host_alloc " "is not available on this platform" | ||
| ) |
| DSA cache: uint8 key cache + uint8 indexer cache (no separate value, no scales). | ||
|
|
||
| `cache_dtype` is ignored; DSA always stores packed fp8+scales as uint8. | ||
| `kv_cache_quant_type` is coerced to "uint8" internally. | ||
| """ | ||
| key_shape, _, indexer_shape = self.get_kv_cache_shape(max_num_blocks=num_blocks, kv_cache_quant_type="uint8") | ||
| logger.info( | ||
| f"[create_kv_cache][DSA] num_layers={num_layers} layer_offset={layer_offset} " | ||
| f"key_shape={key_shape} indexer_shape={indexer_shape} dtype=uint8" | ||
| ) | ||
| caches = {} | ||
| for layer_idx in range(layer_offset, layer_offset + num_layers): | ||
| caches[("key", layer_idx)] = paddle.full(shape=key_shape, fill_value=0, dtype="uint8") | ||
| caches[("indexer", layer_idx)] = paddle.full(shape=indexer_shape, fill_value=0, dtype="uint8") |
| # fp8 scales use float32 (4 bytes), shape [num_blocks, k1, k2]. | ||
| scale_elems = key_shape[1] * key_shape[2] if is_fp8 else 0 | ||
| scale_bytes = num_blocks * 4 * scale_elems if is_fp8 else 0 |
| raise RuntimeError( | ||
| f"[create_host_kv_cache][{type(self).__name__}] cuda_host_alloc " "is not available on this platform" | ||
| ) |
| if kv_cache_quant_type == "block_wise_fp8": | ||
| caches[("key_scale", layer_idx)] = paddle.zeros([1], dtype="float32") | ||
| if resolved_val_shape is not None: | ||
| caches[("value_scale", layer_idx)] = paddle.zeros([1], dtype="float32") | ||
| return caches |
| Dict keyed by ``(role, layer_idx)``. Empty dict if host alloc is | ||
| unavailable on the current platform. |
| # fp8 scales use float32 (4 bytes), shape [num_blocks, k1, k2]. | ||
| scale_elems = key_shape[1] * key_shape[2] if is_fp8 else 0 | ||
| scale_bytes = num_blocks * 4 * scale_elems if is_fp8 else 0 | ||
|
|
||
| logger.info( | ||
| f"[create_host_kv_cache][{type(self).__name__}] num_layers={num_layers} " | ||
| f"layer_offset={layer_offset} num_blocks={num_blocks} " | ||
| f"key_bytes_per_layer={key_bytes} value_bytes_per_layer={value_bytes} " | ||
| f"scale_bytes_per_layer={scale_bytes} kv_cache_quant_type={kv_cache_quant_type}" |
| name->ptr bookkeeping; the backend reference is captured at | ||
| ``initialize_host_cache`` time. |
| caches = self.attn_backend.create_kv_cache( | ||
| num_layers=self._num_layers, | ||
| num_blocks=num_gpu_blocks, | ||
| cache_dtype=cache_dtype, | ||
| kv_cache_quant_type=kv_cache_quant_type, |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-13 20:31:34
📋 Review 摘要
PR 概述:将 per-layer KV cache 分配逻辑从 CacheController 下沉到各 AttentionBackend,新增 DSA layout 支持(key + indexer uint8),并删除原有的 MLACacheController/DSACacheController 子类。
变更范围:fastdeploy/cache_manager/v1/、fastdeploy/model_executor/layers/attention/
影响面 Tag:[KVCache] [OP]
📝 PR 规范检查
标题格式与描述模板结构均合规;## Accuracy Tests 中标注"MLA / GQA 模型验证待补充",建议在合入前补充对应精度数据或说明原因。Add unit tests Checklist 项未勾选但测试文件已更新,建议补充勾选。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | base_attention_backend.py:139 |
create_host_kv_cache 抛 RuntimeError,但调用方只捕获 NotImplementedError,非 GPU 平台初始化会崩溃 |
| 🟡 建议 | mla_attention_backend.py:658 |
直接 import 后检查 if cuda_host_alloc is None 是死代码,无法提供平台保护 |
| 🟡 建议 | cache_controller.py:554 |
NotImplementedError 提前 return 后,_host_key_cache_shape 等属性从未赋值,下游访问会 AttributeError |
| 🟡 建议 | test_cache_controller.py:763 |
make_mock_attn_backend 仅 mock create_kv_cache,未 mock create_host_kv_cache,host cache 测试路径可能被掩盖 |
总体评价
重构思路清晰,将 variant-specific 分配逻辑下沉到 backend 可以有效降低 CacheController 复杂度。但存在一个 P0 异常类型不匹配问题需要修复:非 GPU 平台上 create_host_kv_cache 会抛出 RuntimeError 而非 NotImplementedError,调用方的异常捕获会完全失效。另有两处 P1 风险(死代码保护、属性未初始化)和测试覆盖缺口建议一并处理。
| unavailable on the current platform. | ||
| """ | ||
| if cuda_host_alloc is None: | ||
| raise RuntimeError( |
There was a problem hiding this comment.
🔴 Bug RuntimeError 与调用方期望的 NotImplementedError 不匹配
cache_controller.initialize_host_cache 仅 except NotImplementedError,当 cuda_host_alloc is None(非 GPU 平台)时此处抛出 RuntimeError 会向上穿透,导致初始化崩溃。
建议将 RuntimeError 改为 NotImplementedError,或调用方改为 except (NotImplementedError, RuntimeError)。
| from fastdeploy.cache_manager.ops import cuda_host_alloc | ||
|
|
||
| if cuda_host_alloc is None: | ||
| raise RuntimeError("[create_host_kv_cache][MLA] cuda_host_alloc is not available") |
There was a problem hiding this comment.
🟡 建议 None 检查永远不会触发 — 死代码
此处通过 from ... import cuda_host_alloc 直接导入,直接导入绝不会返回 None(若包不存在会抛 ImportError)。if cuda_host_alloc is None 的判断永远为 False,起不到保护作用。
建议改为模块级 try/except 包裹(与 base_attention_backend.py 一致):
try:
from fastdeploy.cache_manager.ops import cuda_host_alloc
except Exception:
cuda_host_alloc = None
# 然后在函数体内检查
if cuda_host_alloc is None:
raise NotImplementedError(...)| logger.warning( | ||
| f"[CacheController] Host kv cache offload not supported by " | ||
| f"{type(attn_backend).__name__}: {e}. Skipping swap space setup." | ||
| ) |
There was a problem hiding this comment.
🟡 建议 提前 return 后 _host_key_cache_shape 等属性从未赋值
DSA 调用 create_host_kv_cache → NotImplementedError → 此处 return,但 _host_key_cache_shape、_host_value_cache_shape、_host_cache_scale_shape、_num_host_blocks 均在后续才被赋值。若下游代码(如 transfer_manager)访问这些属性,会抛 AttributeError。
建议在 __init__ 中将这些属性初始化为 None,确保对象始终具备完整的属性集合。
| return caches | ||
|
|
||
| backend.create_kv_cache.side_effect = fake_create_kv_cache | ||
| return backend |
There was a problem hiding this comment.
🟡 建议 make_mock_attn_backend 未 mock create_host_kv_cache
当前仅 mock 了 create_kv_cache,create_host_kv_cache 由 MagicMock 自动生成,返回值是 MagicMock 对象。cache_controller.initialize_host_cache 会对其调用 .items(),得到不正确的迭代结果,可能掩盖真实的 host cache 初始化问题。
建议补充:
backend.create_host_kv_cache.return_value = {} # 或按需返回 {(role, layer_idx): ptr}
Motivation
将 per-layer KV cache 分配逻辑从
CacheController下沉到AttentionBackend,使CacheController变为 variant-agnostic。新增 DSA(DeepSeek V3.2-Exp-BF16)cache layout 支持(key uint8 + indexer uint8),并为后续新增 attention 变体提供可扩展基础(无需修改CacheController)。Modifications
base_attention_backend.py:新增create_kv_cache()默认实现(GQA/MHA key + value,含 block_wise_fp8 scale 支持);新增create_host_kv_cache()和free_host_kv_cache()默认实现dsa_attention_backend.py:overridecreate_kv_cache()返回{"key": uint8, "indexer": uint8};overridecreate_host_kv_cache()抛出NotImplementedError(暂不支持 host cache 下沉)mla_attention_backend.py:overridecreate_kv_cache()返回{"key": tensor};overridecreate_host_kv_cache()仅分配 key buffercache_controller.py:重写initialize_kv_cache/initialize_mtp_kv_cache,统一通过attn_backend.create_kv_cache()分配;新增_format_cache_name();重写initialize_host_cache、_free_host_cache,委托给 backend;删除MLACacheController、DSACacheController、create_cache_controller()Usage or Command
N/A
Accuracy Tests
DSA(DeepSeek V3.2-Exp-BF16)端到端
/v1/chat/completions请求验证通过。MLA / GQA 模型验证待补充。
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.