[Draft] Add NIXL transfer release cancellation hook#13495
Draft
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
Draft
[Draft] Add NIXL transfer release cancellation hook#13495yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This draft uses the same intended base/merge point as #13439:
4e69c14f732a6e6afce4f71616db5b5cd2b10530.It keeps the conservative fail-closed request lifetime hardening from #13439, then adds the cancellation primitive we discussed: when TRT-LLM owns a NIXL transfer handle, cancellation should release that handle through NIXL instead of only timing out at the TRT-LLM layer.
The important semantic boundary is intentional:
release()means the backend accepted release of the transfer handle. It is not treated as proof that remote KV memory is quiesced and immediately safe to recycle, especially for UCX-backed one-sided RMA paths.What Changed
TransferStatus::release()to the C++ transfer status interface.NixlTransferStatus::release()by callingnixlAgent::releaseXferReq()and clearing the handle after success.NixlTransferStatusrelease outstanding handles in its destructor as a final cleanup guard.getTransferTerminate(), and callstatus->release()before surfacing cancellation.release()through the nanobind and Python transfer-status wrappers.Cancellation Model
sequenceDiagram participant Py as Python executor participant Conn as AgentConnection participant Status as TransferStatus participant Nixl as NIXL agent participant UCX as UCX backend Py->>Conn: request cancellation sets transfer terminate Conn->>Status: wait with bounded polling Conn->>Conn: observe transfer terminate Conn->>Status: release Status->>Nixl: releaseXferReq Nixl->>UCX: backend release or cancel request UCX-->>Nixl: release accepted or failed Nixl-->>Status: status Status-->>Conn: release result Conn-->>Py: cancellation is surfacedBefore
sequenceDiagram participant Py as Python executor participant Conn as AgentConnection participant Status as TransferStatus participant Nixl as NIXL agent Py->>Conn: transfer starts Conn->>Status: wait until complete Py->>Conn: cancellation requested Conn->>Status: still waiting Status-->>Conn: no TRT level release hook Conn-->>Py: cancellation depends on outer timeout or failure path Conn->>Nixl: transfer handle may remain live until natural completion or object cleanupAfter
sequenceDiagram participant Py as Python executor participant Conn as AgentConnection participant Status as TransferStatus participant Nixl as NIXL agent Py->>Conn: transfer starts Conn->>Status: wait with short poll interval Py->>Conn: cancellation requested Conn->>Status: release Status->>Nixl: releaseXferReq Nixl-->>Status: accepted or failed Conn-->>Py: cancellation is reportedSafety Notes
This PR does not claim that receiver-side cancellation can abort an already-issued remote sender RMA. The receiver often cannot prove that the sender is no longer writing into the target KV blocks.
For that reason, the branch preserves the conservative behavior from #13439: ambiguous in-flight generation receive failures should remain fail-closed rather than freeing and reusing KV memory as if backend quiescence had been proven.
Validation
git diff --checkPYTHONPYCACHEPREFIX=/tmp/trtllm-cancel-pycache python3 -m py_compile tensorrt_llm/_torch/disaggregation/base/agent.py tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py tensorrt_llm/_torch/disaggregation/nixl/_agent_py.pyNot yet run: full C++ build or TRT-LLM runtime tests.