Skip to content

[Draft] Add NIXL transfer release cancellation hook#13495

Draft
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang:codex/nixl-transfer-cancel-release
Draft

[Draft] Add NIXL transfer release cancellation hook#13495
yifjiang wants to merge 4 commits intoNVIDIA:mainfrom
yifjiang:codex/nixl-transfer-cancel-release

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

Summary

This draft uses the same intended base/merge point as #13439: 4e69c14f732a6e6afce4f71616db5b5cd2b10530.

It keeps the conservative fail-closed request lifetime hardening from #13439, then adds the cancellation primitive we discussed: when TRT-LLM owns a NIXL transfer handle, cancellation should release that handle through NIXL instead of only timing out at the TRT-LLM layer.

The important semantic boundary is intentional: release() means the backend accepted release of the transfer handle. It is not treated as proof that remote KV memory is quiesced and immediately safe to recycle, especially for UCX-backed one-sided RMA paths.

What Changed

  • Added TransferStatus::release() to the C++ transfer status interface.
  • Implemented NixlTransferStatus::release() by calling nixlAgent::releaseXferReq() and clearing the handle after success.
  • Made NixlTransferStatus release outstanding handles in its destructor as a final cleanup guard.
  • Changed the sender-side transfer wait loop to poll in bounded intervals, observe getTransferTerminate(), and call status->release() before surfacing cancellation.
  • Changed sync and ready notification waits to return whether the expected notification actually arrived.
  • Made receive paths fail when termination stops the notification wait instead of continuing as though synchronization succeeded.
  • Exposed release() through the nanobind and Python transfer-status wrappers.
  • Updated the disaggregated KV hardening notes to document what is now implemented and what remains intentionally fail-closed.

Cancellation Model

sequenceDiagram
    participant Py as Python executor
    participant Conn as AgentConnection
    participant Status as TransferStatus
    participant Nixl as NIXL agent
    participant UCX as UCX backend

    Py->>Conn: request cancellation sets transfer terminate
    Conn->>Status: wait with bounded polling
    Conn->>Conn: observe transfer terminate
    Conn->>Status: release
    Status->>Nixl: releaseXferReq
    Nixl->>UCX: backend release or cancel request
    UCX-->>Nixl: release accepted or failed
    Nixl-->>Status: status
    Status-->>Conn: release result
    Conn-->>Py: cancellation is surfaced
Loading

Before

sequenceDiagram
    participant Py as Python executor
    participant Conn as AgentConnection
    participant Status as TransferStatus
    participant Nixl as NIXL agent

    Py->>Conn: transfer starts
    Conn->>Status: wait until complete
    Py->>Conn: cancellation requested
    Conn->>Status: still waiting
    Status-->>Conn: no TRT level release hook
    Conn-->>Py: cancellation depends on outer timeout or failure path
    Conn->>Nixl: transfer handle may remain live until natural completion or object cleanup
Loading

After

sequenceDiagram
    participant Py as Python executor
    participant Conn as AgentConnection
    participant Status as TransferStatus
    participant Nixl as NIXL agent

    Py->>Conn: transfer starts
    Conn->>Status: wait with short poll interval
    Py->>Conn: cancellation requested
    Conn->>Status: release
    Status->>Nixl: releaseXferReq
    Nixl-->>Status: accepted or failed
    Conn-->>Py: cancellation is reported
Loading

Safety Notes

This PR does not claim that receiver-side cancellation can abort an already-issued remote sender RMA. The receiver often cannot prove that the sender is no longer writing into the target KV blocks.

For that reason, the branch preserves the conservative behavior from #13439: ambiguous in-flight generation receive failures should remain fail-closed rather than freeing and reusing KV memory as if backend quiescence had been proven.

Validation

  • git diff --check
  • PYTHONPYCACHEPREFIX=/tmp/trtllm-cancel-pycache python3 -m py_compile tensorrt_llm/_torch/disaggregation/base/agent.py tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py

Not yet run: full C++ build or TRT-LLM runtime tests.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants