Skip to content

[https://nvbugs/5973199][fix] Fall back to SCM_RIGHTS for MNNVL fd exchange when pidfd_getfd is blocked#13473

Open
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug5973199
Open

[https://nvbugs/5973199][fix] Fall back to SCM_RIGHTS for MNNVL fd exchange when pidfd_getfd is blocked#13473
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug5973199

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented Apr 26, 2026

Summary

  • Root cause: The MNNVL memory setup uses pidfd_getfd to duplicate file descriptors across peer processes for multi-GPU memory sharing. On certain container environments (e.g., 4xB200 DGX), seccomp profiles block the pidfd_getfd syscall, causing it to fail with EPERM even on a self-pidfd probe. This caused an Executor worker error during AutoDeploy NemotronSuperV3 nvfp4 accuracy tests.
  • Fix: Extracted the fd duplication logic into a new _dup_remote_fds classmethod that probes pidfd_getfd availability at runtime by attempting a self-pidfd operation. When the syscall is blocked, it falls back to exchanging file descriptors over Unix-domain sockets using SCM_RIGHTS ancillary messages, which is a well-established POSIX mechanism not subject to seccomp restrictions. This avoids the need to modify container security policies while maintaining the fast pidfd_getfd path when available.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of multi-process memory operations through enhanced resource management with automatic system compatibility fallback.

…change when pidfd_getfd is blocked

The pidfd_getfd syscall may be blocked by the container's seccomp
profile, causing MNNVL memory initialization to fail with EPERM.
This broke the NemotronSuperV3 NVFP4 accuracy test on B200 GPUs
when running with attention DP + expert parallelism.

Extract the fd-duplication logic into _dup_remote_fds which probes
pidfd_getfd on startup and, when blocked, transparently falls back
to exchanging POSIX file descriptors over AF_UNIX sockets using
SCM_RIGHTS ancillary data — a mechanism that requires no special
container capabilities.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

📝 Walkthrough

Walkthrough

Introduces a shared helper method _dup_remote_fds that duplicates remote POSIX file descriptors between MPI ranks with capability-aware control flow: probes pidfd_getfd availability, then executes either a fast path using pidfd syscalls or a fallback Unix-socket FD exchange. The existing open_mnnvl_memory logic is refactored to delegate remote FD duplication to this helper.

Changes

Cohort / File(s) Summary
Remote FD Duplication Helper
tensorrt_llm/_mnnvl_utils.py
Added _dup_remote_fds class method implementing capability detection (probes pidfd_getfd support) with branching logic: fast path using pidfd syscalls for peer handle resolution, or fallback to Unix-socket SCM_RIGHTS-based FD exchange. Refactored open_mnnvl_memory to invoke the new helper instead of inline pidfd-based duplication.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a fallback to SCM_RIGHTS when pidfd_getfd is blocked in MNNVL fd exchange.
Description check ✅ Passed The PR description covers the root cause, the solution, and test plans. However, it lacks specific details on which tests were run and does not address all template sections comprehensively.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_mnnvl_utils.py`:
- Around line 249-251: Rename the local variable named anc to avoid the
codespell false positive: change the tuple assignment from "msg, anc, _, _ =
sock.recvmsg(...)" to use a clearer name like "ancillary_data" and update its
usage in the following loop (e.g., "for lvl, tp, data in ancillary_data:") so
the code behavior is unchanged but the identifier avoids the CI spellcheck
failure; ensure any other references in the surrounding scope are updated to the
new name.
- Around line 184-296: _dup_remote_fds leaks OS FDs: when using the pidfd_getfd
fast path the pidfds opened in the first loop and any intermediate FDs must be
closed after duplicating (append actual duplicated descriptors to remote_fds and
close pidfds), and in the SCM_RIGHTS fallback ensure the received peer_fds are
recorded into remote_fds (not just fds) and close any temporary local
socket-owned FDs (e.g., local_fd copies, server/client sockets) when no longer
needed; update the logic in _dup_remote_fds to append duplicated FDs to
remote_fds for both branches, close all pidfds in pidfds after use, and ensure
any temporary/duplicate FDs created during SCM_RIGHTS exchange are closed to
avoid leaking until EMFILE.
- Around line 256-267: The current code constructs a predictable Unix-socket
path in sock_path ("/tmp/.mnnvl_fd_{os.getpid()}") and binds a server socket
there (server.bind(sock_path)) which is raceable; replace that pattern by
creating a private, mode 0o700 temporary directory (e.g., via tempfile.mkdtemp)
or use an abstract AF_UNIX address (prefix '\0') to generate a randomized,
non-guessable socket path, update sock_path to point into that secure directory
or the abstract name, ensure you still handle os.unlink cleanup (or remove the
temp dir) and keep the existing server.listen(comm_size) and
comm.allgather(sock_path) usage so the barrier logic remains unchanged.
- Line 221: The loop pairing pidfds and all_handles_data uses zip(pidfds,
all_handles_data) which can silently truncate mismatched lengths; change the
iteration in the for loop that unpacks pidfd and fd (referencing the variables
pidfds and all_handles_data in this module/function) to call zip with
strict=True so a length mismatch raises immediately (use zip(..., strict=True)
in the for loop header).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd961077-fbd5-4192-8069-696de306a154

📥 Commits

Reviewing files that changed from the base of the PR and between eeba2eb and 4e72b71.

📒 Files selected for processing (1)
  • tensorrt_llm/_mnnvl_utils.py

Comment on lines +184 to +296
def _dup_remote_fds(cls, comm, all_handles_data, pidfds, remote_fds):
"""Duplicate POSIX file-descriptor handles from peer processes.

Tries pidfd_getfd first; falls back to Unix-socket SCM_RIGHTS when
the syscall is blocked (e.g. by a container seccomp profile).
"""
import socket
import struct

libc = ctypes.CDLL(None, use_errno=True)
syscall = libc.syscall
syscall.restype = ctypes.c_long
SYS_pidfd_open = 434
SYS_pidfd_getfd = 438

# Probe whether pidfd_getfd is usable (may be blocked by seccomp).
self_pidfd = syscall(SYS_pidfd_open, os.getpid(), 0)
use_pidfd = False
if self_pidfd >= 0:
probe = syscall(SYS_pidfd_getfd, self_pidfd, self_pidfd, 0)
if probe >= 0:
os.close(probe)
use_pidfd = True
os.close(self_pidfd)

if use_pidfd:
# ---- fast path: pidfd_getfd available ----
all_pids = comm.allgather(os.getpid())
for pid in all_pids:
pidfd = syscall(SYS_pidfd_open, pid, 0)
if pidfd < 0:
err = ctypes.get_errno()
raise RuntimeError(
f"pidfd_open({pid}) failed with errno {err}: {os.strerror(err)}"
)
pidfds.append(pidfd)

for pidfd, fd in zip(pidfds, all_handles_data):
remote_fd = syscall(SYS_pidfd_getfd, pidfd, fd, 0)
if remote_fd < 0:
err = ctypes.get_errno()
raise RuntimeError(
f"pidfd_getfd(pidfd={pidfd}, fd={fd}) failed with errno {err}: "
f"{os.strerror(err)}"
)
remote_fds.append(remote_fd)
return remote_fds

# ---- fallback: exchange fds via Unix-socket SCM_RIGHTS ----
logger.info("pidfd_getfd unavailable; using Unix-socket SCM_RIGHTS for fd exchange")
comm_rank = comm.Get_rank()
comm_size = comm.Get_size()
local_fd = all_handles_data[comm_rank]

fds = [None] * comm_size
fds[comm_rank] = local_fd

def _send_fd(sock, rank, fd):
sock.sendmsg(
[struct.pack("i", rank)],
[(socket.SOL_SOCKET, socket.SCM_RIGHTS, struct.pack("i", fd))],
)

def _recv_fd(sock):
cmsg_space = socket.CMSG_SPACE(struct.calcsize("i"))
msg, anc, _, _ = sock.recvmsg(4, cmsg_space)
peer_rank = struct.unpack("i", msg[:4])[0]
for lvl, tp, data in anc:
if lvl == socket.SOL_SOCKET and tp == socket.SCM_RIGHTS:
return peer_rank, struct.unpack("i", data[:4])[0]
raise RuntimeError("No SCM_RIGHTS ancillary data received")

sock_path = f"/tmp/.mnnvl_fd_{os.getpid()}"
try:
os.unlink(sock_path)
except FileNotFoundError:
pass

server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
server.bind(sock_path)
server.listen(comm_size)
# allgather paths acts as a barrier: all servers are listening before
# any rank starts connecting.
all_paths = comm.allgather(sock_path)

try:
# Connect to all lower-ranked peers (they are the servers).
for r in range(comm_rank):
client = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
client.connect(all_paths[r])
try:
_send_fd(client, comm_rank, local_fd)
_, fds[r] = _recv_fd(client)
finally:
client.close()

# Accept from all higher-ranked peers.
for _ in range(comm_size - 1 - comm_rank):
conn, _ = server.accept()
try:
peer_rank, peer_fd = _recv_fd(conn)
fds[peer_rank] = peer_fd
_send_fd(conn, comm_rank, local_fd)
finally:
conn.close()
finally:
server.close()
try:
os.unlink(sock_path)
except OSError:
pass

return fds
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Close duplicated OS handles on the success path.

_dup_remote_fds() opens pidfds/shareable FDs in both branches, but the caller only closes them on exceptions. In the SCM_RIGHTS path the received FDs are not even recorded in remote_fds, so every successful allocation leaks host FDs until the process eventually hits EMFILE.

Also applies to: 343-348

🧰 Tools
🪛 GitHub Actions: Release Checks

[error] 249-251: pre-commit hook 'codespell' failed (exit code 65). Reported spelling issues: 'anc ==> and' at lines 249 and 251.

🪛 Ruff (0.15.11)

[warning] 221-221: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


[error] 256-256: Probable insecure usage of temporary file or directory: "/tmp/.mnnvl_fd_"

(S108)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_mnnvl_utils.py` around lines 184 - 296, _dup_remote_fds leaks
OS FDs: when using the pidfd_getfd fast path the pidfds opened in the first loop
and any intermediate FDs must be closed after duplicating (append actual
duplicated descriptors to remote_fds and close pidfds), and in the SCM_RIGHTS
fallback ensure the received peer_fds are recorded into remote_fds (not just
fds) and close any temporary local socket-owned FDs (e.g., local_fd copies,
server/client sockets) when no longer needed; update the logic in
_dup_remote_fds to append duplicated FDs to remote_fds for both branches, close
all pidfds in pidfds after use, and ensure any temporary/duplicate FDs created
during SCM_RIGHTS exchange are closed to avoid leaking until EMFILE.

)
pidfds.append(pidfd)

for pidfd, fd in zip(pidfds, all_handles_data):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
sed -n '218,223p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 284


🏁 Script executed:

cd tensorrt_llm && python -m ruff check _mnnvl_utils.py --select B905 2>&1 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 771


🏁 Script executed:

find . -maxdepth 2 -name "ruff.toml" -o -name "pyproject.toml" -o -name ".ruff.toml" | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 127


🏁 Script executed:

sed -n '180,225p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1816


🏁 Script executed:

sed -n '184,190p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 389


🏁 Script executed:

rg "_dup_remote_fds" tensorrt_llm/ -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 733


🏁 Script executed:

sed -n '145,170p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1209


🏁 Script executed:

sed -n '100,140p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1704


🏁 Script executed:

sed -n '240,280p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1652


🏁 Script executed:

sed -n '155,168p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 696


🏁 Script executed:

sed -n '312,340p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1372


🏁 Script executed:

sed -n '340,360p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 962


Add strict=True to zip() call on line 221.

The zip(pidfds, all_handles_data) call should enforce length matching with strict=True, as both sequences are derived from comm.allgather() and must always have equal length. Plain zip() will silently truncate on mismatch, masking a critical bug. Ruff already flags this with B905.

🧰 Tools
🪛 Ruff (0.15.11)

[warning] 221-221: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_mnnvl_utils.py` at line 221, The loop pairing pidfds and
all_handles_data uses zip(pidfds, all_handles_data) which can silently truncate
mismatched lengths; change the iteration in the for loop that unpacks pidfd and
fd (referencing the variables pidfds and all_handles_data in this
module/function) to call zip with strict=True so a length mismatch raises
immediately (use zip(..., strict=True) in the for loop header).

Comment on lines +249 to +251
msg, anc, _, _ = sock.recvmsg(4, cmsg_space)
peer_rank = struct.unpack("i", msg[:4])[0]
for lvl, tp, data in anc:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename anc to unblock codespell.

CI is already failing on this local name. Renaming it to something like ancillary_data avoids the false positive without changing behavior.

🧰 Tools
🪛 GitHub Actions: Release Checks

[error] 249-251: pre-commit hook 'codespell' failed (exit code 65). Reported spelling issues: 'anc ==> and' at lines 249 and 251.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_mnnvl_utils.py` around lines 249 - 251, Rename the local
variable named anc to avoid the codespell false positive: change the tuple
assignment from "msg, anc, _, _ = sock.recvmsg(...)" to use a clearer name like
"ancillary_data" and update its usage in the following loop (e.g., "for lvl, tp,
data in ancillary_data:") so the code behavior is unchanged but the identifier
avoids the CI spellcheck failure; ensure any other references in the surrounding
scope are updated to the new name.

Comment on lines +256 to +267
sock_path = f"/tmp/.mnnvl_fd_{os.getpid()}"
try:
os.unlink(sock_path)
except FileNotFoundError:
pass

server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
server.bind(sock_path)
server.listen(comm_size)
# allgather paths acts as a barrier: all servers are listening before
# any rank starts connecting.
all_paths = comm.allgather(sock_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid a predictable Unix-socket path in /tmp.

/tmp/.mnnvl_fd_{pid} is guessable and shared. Another local process can race this path and interfere with the FD exchange. Please bind inside a private 0700 temp directory or use a randomized/abstract AF_UNIX address instead.

🧰 Tools
🪛 Ruff (0.15.11)

[error] 256-256: Probable insecure usage of temporary file or directory: "/tmp/.mnnvl_fd_"

(S108)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_mnnvl_utils.py` around lines 256 - 267, The current code
constructs a predictable Unix-socket path in sock_path
("/tmp/.mnnvl_fd_{os.getpid()}") and binds a server socket there
(server.bind(sock_path)) which is raceable; replace that pattern by creating a
private, mode 0o700 temporary directory (e.g., via tempfile.mkdtemp) or use an
abstract AF_UNIX address (prefix '\0') to generate a randomized, non-guessable
socket path, update sock_path to point into that secure directory or the
abstract name, ensure you still handle os.unlink cleanup (or remove the temp
dir) and keep the existing server.listen(comm_size) and
comm.allgather(sock_path) usage so the barrier logic remains unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant