[https://nvbugs/5973199][fix] Fall back to SCM_RIGHTS for MNNVL fd exchange when pidfd_getfd is blocked#13473
Conversation
…change when pidfd_getfd is blocked The pidfd_getfd syscall may be blocked by the container's seccomp profile, causing MNNVL memory initialization to fail with EPERM. This broke the NemotronSuperV3 NVFP4 accuracy test on B200 GPUs when running with attention DP + expert parallelism. Extract the fd-duplication logic into _dup_remote_fds which probes pidfd_getfd on startup and, when blocked, transparently falls back to exchanging POSIX file descriptors over AF_UNIX sockets using SCM_RIGHTS ancillary data — a mechanism that requires no special container capabilities. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughIntroduces a shared helper method Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_mnnvl_utils.py`:
- Around line 249-251: Rename the local variable named anc to avoid the
codespell false positive: change the tuple assignment from "msg, anc, _, _ =
sock.recvmsg(...)" to use a clearer name like "ancillary_data" and update its
usage in the following loop (e.g., "for lvl, tp, data in ancillary_data:") so
the code behavior is unchanged but the identifier avoids the CI spellcheck
failure; ensure any other references in the surrounding scope are updated to the
new name.
- Around line 184-296: _dup_remote_fds leaks OS FDs: when using the pidfd_getfd
fast path the pidfds opened in the first loop and any intermediate FDs must be
closed after duplicating (append actual duplicated descriptors to remote_fds and
close pidfds), and in the SCM_RIGHTS fallback ensure the received peer_fds are
recorded into remote_fds (not just fds) and close any temporary local
socket-owned FDs (e.g., local_fd copies, server/client sockets) when no longer
needed; update the logic in _dup_remote_fds to append duplicated FDs to
remote_fds for both branches, close all pidfds in pidfds after use, and ensure
any temporary/duplicate FDs created during SCM_RIGHTS exchange are closed to
avoid leaking until EMFILE.
- Around line 256-267: The current code constructs a predictable Unix-socket
path in sock_path ("/tmp/.mnnvl_fd_{os.getpid()}") and binds a server socket
there (server.bind(sock_path)) which is raceable; replace that pattern by
creating a private, mode 0o700 temporary directory (e.g., via tempfile.mkdtemp)
or use an abstract AF_UNIX address (prefix '\0') to generate a randomized,
non-guessable socket path, update sock_path to point into that secure directory
or the abstract name, ensure you still handle os.unlink cleanup (or remove the
temp dir) and keep the existing server.listen(comm_size) and
comm.allgather(sock_path) usage so the barrier logic remains unchanged.
- Line 221: The loop pairing pidfds and all_handles_data uses zip(pidfds,
all_handles_data) which can silently truncate mismatched lengths; change the
iteration in the for loop that unpacks pidfd and fd (referencing the variables
pidfds and all_handles_data in this module/function) to call zip with
strict=True so a length mismatch raises immediately (use zip(..., strict=True)
in the for loop header).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cd961077-fbd5-4192-8069-696de306a154
📒 Files selected for processing (1)
tensorrt_llm/_mnnvl_utils.py
| def _dup_remote_fds(cls, comm, all_handles_data, pidfds, remote_fds): | ||
| """Duplicate POSIX file-descriptor handles from peer processes. | ||
|
|
||
| Tries pidfd_getfd first; falls back to Unix-socket SCM_RIGHTS when | ||
| the syscall is blocked (e.g. by a container seccomp profile). | ||
| """ | ||
| import socket | ||
| import struct | ||
|
|
||
| libc = ctypes.CDLL(None, use_errno=True) | ||
| syscall = libc.syscall | ||
| syscall.restype = ctypes.c_long | ||
| SYS_pidfd_open = 434 | ||
| SYS_pidfd_getfd = 438 | ||
|
|
||
| # Probe whether pidfd_getfd is usable (may be blocked by seccomp). | ||
| self_pidfd = syscall(SYS_pidfd_open, os.getpid(), 0) | ||
| use_pidfd = False | ||
| if self_pidfd >= 0: | ||
| probe = syscall(SYS_pidfd_getfd, self_pidfd, self_pidfd, 0) | ||
| if probe >= 0: | ||
| os.close(probe) | ||
| use_pidfd = True | ||
| os.close(self_pidfd) | ||
|
|
||
| if use_pidfd: | ||
| # ---- fast path: pidfd_getfd available ---- | ||
| all_pids = comm.allgather(os.getpid()) | ||
| for pid in all_pids: | ||
| pidfd = syscall(SYS_pidfd_open, pid, 0) | ||
| if pidfd < 0: | ||
| err = ctypes.get_errno() | ||
| raise RuntimeError( | ||
| f"pidfd_open({pid}) failed with errno {err}: {os.strerror(err)}" | ||
| ) | ||
| pidfds.append(pidfd) | ||
|
|
||
| for pidfd, fd in zip(pidfds, all_handles_data): | ||
| remote_fd = syscall(SYS_pidfd_getfd, pidfd, fd, 0) | ||
| if remote_fd < 0: | ||
| err = ctypes.get_errno() | ||
| raise RuntimeError( | ||
| f"pidfd_getfd(pidfd={pidfd}, fd={fd}) failed with errno {err}: " | ||
| f"{os.strerror(err)}" | ||
| ) | ||
| remote_fds.append(remote_fd) | ||
| return remote_fds | ||
|
|
||
| # ---- fallback: exchange fds via Unix-socket SCM_RIGHTS ---- | ||
| logger.info("pidfd_getfd unavailable; using Unix-socket SCM_RIGHTS for fd exchange") | ||
| comm_rank = comm.Get_rank() | ||
| comm_size = comm.Get_size() | ||
| local_fd = all_handles_data[comm_rank] | ||
|
|
||
| fds = [None] * comm_size | ||
| fds[comm_rank] = local_fd | ||
|
|
||
| def _send_fd(sock, rank, fd): | ||
| sock.sendmsg( | ||
| [struct.pack("i", rank)], | ||
| [(socket.SOL_SOCKET, socket.SCM_RIGHTS, struct.pack("i", fd))], | ||
| ) | ||
|
|
||
| def _recv_fd(sock): | ||
| cmsg_space = socket.CMSG_SPACE(struct.calcsize("i")) | ||
| msg, anc, _, _ = sock.recvmsg(4, cmsg_space) | ||
| peer_rank = struct.unpack("i", msg[:4])[0] | ||
| for lvl, tp, data in anc: | ||
| if lvl == socket.SOL_SOCKET and tp == socket.SCM_RIGHTS: | ||
| return peer_rank, struct.unpack("i", data[:4])[0] | ||
| raise RuntimeError("No SCM_RIGHTS ancillary data received") | ||
|
|
||
| sock_path = f"/tmp/.mnnvl_fd_{os.getpid()}" | ||
| try: | ||
| os.unlink(sock_path) | ||
| except FileNotFoundError: | ||
| pass | ||
|
|
||
| server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) | ||
| server.bind(sock_path) | ||
| server.listen(comm_size) | ||
| # allgather paths acts as a barrier: all servers are listening before | ||
| # any rank starts connecting. | ||
| all_paths = comm.allgather(sock_path) | ||
|
|
||
| try: | ||
| # Connect to all lower-ranked peers (they are the servers). | ||
| for r in range(comm_rank): | ||
| client = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) | ||
| client.connect(all_paths[r]) | ||
| try: | ||
| _send_fd(client, comm_rank, local_fd) | ||
| _, fds[r] = _recv_fd(client) | ||
| finally: | ||
| client.close() | ||
|
|
||
| # Accept from all higher-ranked peers. | ||
| for _ in range(comm_size - 1 - comm_rank): | ||
| conn, _ = server.accept() | ||
| try: | ||
| peer_rank, peer_fd = _recv_fd(conn) | ||
| fds[peer_rank] = peer_fd | ||
| _send_fd(conn, comm_rank, local_fd) | ||
| finally: | ||
| conn.close() | ||
| finally: | ||
| server.close() | ||
| try: | ||
| os.unlink(sock_path) | ||
| except OSError: | ||
| pass | ||
|
|
||
| return fds |
There was a problem hiding this comment.
Close duplicated OS handles on the success path.
_dup_remote_fds() opens pidfds/shareable FDs in both branches, but the caller only closes them on exceptions. In the SCM_RIGHTS path the received FDs are not even recorded in remote_fds, so every successful allocation leaks host FDs until the process eventually hits EMFILE.
Also applies to: 343-348
🧰 Tools
🪛 GitHub Actions: Release Checks
[error] 249-251: pre-commit hook 'codespell' failed (exit code 65). Reported spelling issues: 'anc ==> and' at lines 249 and 251.
🪛 Ruff (0.15.11)
[warning] 221-221: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
[error] 256-256: Probable insecure usage of temporary file or directory: "/tmp/.mnnvl_fd_"
(S108)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_mnnvl_utils.py` around lines 184 - 296, _dup_remote_fds leaks
OS FDs: when using the pidfd_getfd fast path the pidfds opened in the first loop
and any intermediate FDs must be closed after duplicating (append actual
duplicated descriptors to remote_fds and close pidfds), and in the SCM_RIGHTS
fallback ensure the received peer_fds are recorded into remote_fds (not just
fds) and close any temporary local socket-owned FDs (e.g., local_fd copies,
server/client sockets) when no longer needed; update the logic in
_dup_remote_fds to append duplicated FDs to remote_fds for both branches, close
all pidfds in pidfds after use, and ensure any temporary/duplicate FDs created
during SCM_RIGHTS exchange are closed to avoid leaking until EMFILE.
| ) | ||
| pidfds.append(pidfd) | ||
|
|
||
| for pidfd, fd in zip(pidfds, all_handles_data): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
sed -n '218,223p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 284
🏁 Script executed:
cd tensorrt_llm && python -m ruff check _mnnvl_utils.py --select B905 2>&1 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 771
🏁 Script executed:
find . -maxdepth 2 -name "ruff.toml" -o -name "pyproject.toml" -o -name ".ruff.toml" | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 127
🏁 Script executed:
sed -n '180,225p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1816
🏁 Script executed:
sed -n '184,190p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 389
🏁 Script executed:
rg "_dup_remote_fds" tensorrt_llm/ -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 733
🏁 Script executed:
sed -n '145,170p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1209
🏁 Script executed:
sed -n '100,140p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1704
🏁 Script executed:
sed -n '240,280p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1652
🏁 Script executed:
sed -n '155,168p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 696
🏁 Script executed:
sed -n '312,340p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1372
🏁 Script executed:
sed -n '340,360p' tensorrt_llm/_mnnvl_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 962
Add strict=True to zip() call on line 221.
The zip(pidfds, all_handles_data) call should enforce length matching with strict=True, as both sequences are derived from comm.allgather() and must always have equal length. Plain zip() will silently truncate on mismatch, masking a critical bug. Ruff already flags this with B905.
🧰 Tools
🪛 Ruff (0.15.11)
[warning] 221-221: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_mnnvl_utils.py` at line 221, The loop pairing pidfds and
all_handles_data uses zip(pidfds, all_handles_data) which can silently truncate
mismatched lengths; change the iteration in the for loop that unpacks pidfd and
fd (referencing the variables pidfds and all_handles_data in this
module/function) to call zip with strict=True so a length mismatch raises
immediately (use zip(..., strict=True) in the for loop header).
| msg, anc, _, _ = sock.recvmsg(4, cmsg_space) | ||
| peer_rank = struct.unpack("i", msg[:4])[0] | ||
| for lvl, tp, data in anc: |
There was a problem hiding this comment.
Rename anc to unblock codespell.
CI is already failing on this local name. Renaming it to something like ancillary_data avoids the false positive without changing behavior.
🧰 Tools
🪛 GitHub Actions: Release Checks
[error] 249-251: pre-commit hook 'codespell' failed (exit code 65). Reported spelling issues: 'anc ==> and' at lines 249 and 251.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_mnnvl_utils.py` around lines 249 - 251, Rename the local
variable named anc to avoid the codespell false positive: change the tuple
assignment from "msg, anc, _, _ = sock.recvmsg(...)" to use a clearer name like
"ancillary_data" and update its usage in the following loop (e.g., "for lvl, tp,
data in ancillary_data:") so the code behavior is unchanged but the identifier
avoids the CI spellcheck failure; ensure any other references in the surrounding
scope are updated to the new name.
| sock_path = f"/tmp/.mnnvl_fd_{os.getpid()}" | ||
| try: | ||
| os.unlink(sock_path) | ||
| except FileNotFoundError: | ||
| pass | ||
|
|
||
| server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) | ||
| server.bind(sock_path) | ||
| server.listen(comm_size) | ||
| # allgather paths acts as a barrier: all servers are listening before | ||
| # any rank starts connecting. | ||
| all_paths = comm.allgather(sock_path) |
There was a problem hiding this comment.
Avoid a predictable Unix-socket path in /tmp.
/tmp/.mnnvl_fd_{pid} is guessable and shared. Another local process can race this path and interfere with the FD exchange. Please bind inside a private 0700 temp directory or use a randomized/abstract AF_UNIX address instead.
🧰 Tools
🪛 Ruff (0.15.11)
[error] 256-256: Probable insecure usage of temporary file or directory: "/tmp/.mnnvl_fd_"
(S108)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_mnnvl_utils.py` around lines 256 - 267, The current code
constructs a predictable Unix-socket path in sock_path
("/tmp/.mnnvl_fd_{os.getpid()}") and binds a server socket there
(server.bind(sock_path)) which is raceable; replace that pattern by creating a
private, mode 0o700 temporary directory (e.g., via tempfile.mkdtemp) or use an
abstract AF_UNIX address (prefix '\0') to generate a randomized, non-guessable
socket path, update sock_path to point into that secure directory or the
abstract name, ensure you still handle os.unlink cleanup (or remove the temp
dir) and keep the existing server.listen(comm_size) and
comm.allgather(sock_path) usage so the barrier logic remains unchanged.
Summary
pidfd_getfdto duplicate file descriptors across peer processes for multi-GPU memory sharing. On certain container environments (e.g., 4xB200 DGX), seccomp profiles block thepidfd_getfdsyscall, causing it to fail with EPERM even on a self-pidfd probe. This caused an Executor worker error during AutoDeploy NemotronSuperV3 nvfp4 accuracy tests._dup_remote_fdsclassmethod that probespidfd_getfdavailability at runtime by attempting a self-pidfd operation. When the syscall is blocked, it falls back to exchanging file descriptors over Unix-domain sockets usingSCM_RIGHTSancillary messages, which is a well-established POSIX mechanism not subject to seccomp restrictions. This avoids the need to modify container security policies while maintaining the fastpidfd_getfdpath when available.Test plan
Links
Summary by CodeRabbit