[Draft] Add NIXL transfer release cancellation hook by yifjiang · Pull Request #13495 · NVIDIA/TensorRT-LLM

yifjiang · 2026-04-27T07:01:03Z

Summary

This draft uses the same intended base/merge point as #13439: 4e69c14f732a6e6afce4f71616db5b5cd2b10530.

It keeps the conservative fail-closed request lifetime hardening from #13439, then adds the cancellation primitive we discussed: when TRT-LLM owns a NIXL transfer handle, cancellation should release that handle through NIXL instead of only timing out at the TRT-LLM layer.

The important semantic boundary is intentional: release() means the backend accepted release of the transfer handle. It is not treated as proof that remote KV memory is quiesced and immediately safe to recycle, especially for UCX-backed one-sided RMA paths.

What Changed

Added TransferStatus::release() to the C++ transfer status interface.
Implemented NixlTransferStatus::release() by calling nixlAgent::releaseXferReq() and clearing the handle after success.
Made NixlTransferStatus release outstanding handles in its destructor as a final cleanup guard.
Changed the sender-side transfer wait loop to poll in bounded intervals, observe getTransferTerminate(), and call status->release() before surfacing cancellation.
Changed sync and ready notification waits to return whether the expected notification actually arrived.
Made receive paths fail when termination stops the notification wait instead of continuing as though synchronization succeeded.
Exposed release() through the nanobind and Python transfer-status wrappers.
Updated the disaggregated KV hardening notes to document what is now implemented and what remains intentionally fail-closed.

Cancellation Model

sequenceDiagram
    participant Py as Python executor
    participant Conn as AgentConnection
    participant Status as TransferStatus
    participant Nixl as NIXL agent
    participant UCX as UCX backend

    Py->>Conn: request cancellation sets transfer terminate
    Conn->>Status: wait with bounded polling
    Conn->>Conn: observe transfer terminate
    Conn->>Status: release
    Status->>Nixl: releaseXferReq
    Nixl->>UCX: backend release or cancel request
    UCX-->>Nixl: release accepted or failed
    Nixl-->>Status: status
    Status-->>Conn: release result
    Conn-->>Py: cancellation is surfaced

Before

sequenceDiagram
    participant Py as Python executor
    participant Conn as AgentConnection
    participant Status as TransferStatus
    participant Nixl as NIXL agent

    Py->>Conn: transfer starts
    Conn->>Status: wait until complete
    Py->>Conn: cancellation requested
    Conn->>Status: still waiting
    Status-->>Conn: no TRT level release hook
    Conn-->>Py: cancellation depends on outer timeout or failure path
    Conn->>Nixl: transfer handle may remain live until natural completion or object cleanup

After

sequenceDiagram
    participant Py as Python executor
    participant Conn as AgentConnection
    participant Status as TransferStatus
    participant Nixl as NIXL agent

    Py->>Conn: transfer starts
    Conn->>Status: wait with short poll interval
    Py->>Conn: cancellation requested
    Conn->>Status: release
    Status->>Nixl: releaseXferReq
    Nixl-->>Status: accepted or failed
    Conn-->>Py: cancellation is reported

Safety Notes

This PR does not claim that receiver-side cancellation can abort an already-issued remote sender RMA. The receiver often cannot prove that the sender is no longer writing into the target KV blocks.

For that reason, the branch preserves the conservative behavior from #13439: ambiguous in-flight generation receive failures should remain fail-closed rather than freeing and reusing KV memory as if backend quiescence had been proven.

Validation

git diff --check
PYTHONPYCACHEPREFIX=/tmp/trtllm-cancel-pycache python3 -m py_compile tensorrt_llm/_torch/disaggregation/base/agent.py tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py

Not yet run: full C++ build or TRT-LLM runtime tests.

yifjiang and others added 4 commits April 24, 2026 08:42

Harden disagg transceiver request lifetime

d55c952

Document disagg KV transfer hardening follow-ups

230bbf2

Harden disagg transfer cleanup paths

0dee369

Add NIXL transfer release cancellation hook

037258e

github-actions Bot assigned yifjiang Apr 27, 2026

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Add NIXL transfer release cancellation hook#13495

[Draft] Add NIXL transfer release cancellation hook#13495
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang:codex/nixl-transfer-cancel-release

yifjiang commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yifjiang commented Apr 27, 2026

Summary

What Changed

Cancellation Model

Before

After

Safety Notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants