[https://nvbugs/5973199][fix] Fall back to SCM_RIGHTS for MNNVL fd exchange when pidfd_getfd is blocked by tensorrt-cicd · Pull Request #13473 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-04-26T10:57:07Z

Summary

Root cause: The MNNVL memory setup uses pidfd_getfd to duplicate file descriptors across peer processes for multi-GPU memory sharing. On certain container environments (e.g., 4xB200 DGX), seccomp profiles block the pidfd_getfd syscall, causing it to fail with EPERM even on a self-pidfd probe. This caused an Executor worker error during AutoDeploy NemotronSuperV3 nvfp4 accuracy tests.
Fix: Extracted the fd duplication logic into a new _dup_remote_fds classmethod that probes pidfd_getfd availability at runtime by attempting a self-pidfd operation. When the syscall is blocked, it falls back to exchanging file descriptors over Unix-domain sockets using SCM_RIGHTS ancillary messages, which is a well-established POSIX mechanism not subject to seccomp restrictions. This avoids the need to modify container security policies while maintaining the fast pidfd_getfd path when available.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/5973199

Summary by CodeRabbit

Bug Fixes
- Improved reliability of multi-process memory operations through enhanced resource management with automatic system compatibility fallback.

…change when pidfd_getfd is blocked The pidfd_getfd syscall may be blocked by the container's seccomp profile, causing MNNVL memory initialization to fail with EPERM. This broke the NemotronSuperV3 NVFP4 accuracy test on B200 GPUs when running with attention DP + expert parallelism. Extract the fd-duplication logic into _dup_remote_fds which probes pidfd_getfd on startup and, when blocked, transparently falls back to exchanging POSIX file descriptors over AF_UNIX sockets using SCM_RIGHTS ancillary data — a mechanism that requires no special container capabilities. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-04-26T11:04:20Z

📝 Walkthrough

Walkthrough

Introduces a shared helper method _dup_remote_fds that duplicates remote POSIX file descriptors between MPI ranks with capability-aware control flow: probes pidfd_getfd availability, then executes either a fast path using pidfd syscalls or a fallback Unix-socket FD exchange. The existing open_mnnvl_memory logic is refactored to delegate remote FD duplication to this helper.

Changes

Cohort / File(s)	Summary
Remote FD Duplication Helper `tensorrt_llm/_mnnvl_utils.py`	Added `_dup_remote_fds` class method implementing capability detection (probes `pidfd_getfd` support) with branching logic: fast path using pidfd syscalls for peer handle resolution, or fallback to Unix-socket `SCM_RIGHTS`-based FD exchange. Refactored `open_mnnvl_memory` to invoke the new helper instead of inline pidfd-based duplication.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding a fallback to SCM_RIGHTS when pidfd_getfd is blocked in MNNVL fd exchange.
Description check	✅ Passed	The PR description covers the root cause, the solution, and test plans. However, it lacks specific details on which tests were run and does not address all template sections comprehensively.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_mnnvl_utils.py`:
- Around line 249-251: Rename the local variable named anc to avoid the
codespell false positive: change the tuple assignment from "msg, anc, _, _ =
sock.recvmsg(...)" to use a clearer name like "ancillary_data" and update its
usage in the following loop (e.g., "for lvl, tp, data in ancillary_data:") so
the code behavior is unchanged but the identifier avoids the CI spellcheck
failure; ensure any other references in the surrounding scope are updated to the
new name.
- Around line 184-296: _dup_remote_fds leaks OS FDs: when using the pidfd_getfd
fast path the pidfds opened in the first loop and any intermediate FDs must be
closed after duplicating (append actual duplicated descriptors to remote_fds and
close pidfds), and in the SCM_RIGHTS fallback ensure the received peer_fds are
recorded into remote_fds (not just fds) and close any temporary local
socket-owned FDs (e.g., local_fd copies, server/client sockets) when no longer
needed; update the logic in _dup_remote_fds to append duplicated FDs to
remote_fds for both branches, close all pidfds in pidfds after use, and ensure
any temporary/duplicate FDs created during SCM_RIGHTS exchange are closed to
avoid leaking until EMFILE.
- Around line 256-267: The current code constructs a predictable Unix-socket
path in sock_path ("/tmp/.mnnvl_fd_{os.getpid()}") and binds a server socket
there (server.bind(sock_path)) which is raceable; replace that pattern by
creating a private, mode 0o700 temporary directory (e.g., via tempfile.mkdtemp)
or use an abstract AF_UNIX address (prefix '\0') to generate a randomized,
non-guessable socket path, update sock_path to point into that secure directory
or the abstract name, ensure you still handle os.unlink cleanup (or remove the
temp dir) and keep the existing server.listen(comm_size) and
comm.allgather(sock_path) usage so the barrier logic remains unchanged.
- Line 221: The loop pairing pidfds and all_handles_data uses zip(pidfds,
all_handles_data) which can silently truncate mismatched lengths; change the
iteration in the for loop that unpacks pidfd and fd (referencing the variables
pidfds and all_handles_data in this module/function) to call zip with
strict=True so a length mismatch raises immediately (use zip(..., strict=True)
in the for loop header).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd961077-fbd5-4192-8069-696de306a154

📥 Commits

Reviewing files that changed from the base of the PR and between eeba2eb and 4e72b71.

📒 Files selected for processing (1)

tensorrt_llm/_mnnvl_utils.py

coderabbitai · 2026-04-26T11:04:23Z

+    def _dup_remote_fds(cls, comm, all_handles_data, pidfds, remote_fds):
+        """Duplicate POSIX file-descriptor handles from peer processes.
+
+        Tries pidfd_getfd first; falls back to Unix-socket SCM_RIGHTS when
+        the syscall is blocked (e.g. by a container seccomp profile).
+        """
+        import socket
+        import struct
+
+        libc = ctypes.CDLL(None, use_errno=True)
+        syscall = libc.syscall
+        syscall.restype = ctypes.c_long
+        SYS_pidfd_open = 434
+        SYS_pidfd_getfd = 438
+
+        # Probe whether pidfd_getfd is usable (may be blocked by seccomp).
+        self_pidfd = syscall(SYS_pidfd_open, os.getpid(), 0)
+        use_pidfd = False
+        if self_pidfd >= 0:
+            probe = syscall(SYS_pidfd_getfd, self_pidfd, self_pidfd, 0)
+            if probe >= 0:
+                os.close(probe)
+                use_pidfd = True
+            os.close(self_pidfd)
+
+        if use_pidfd:
+            # ---- fast path: pidfd_getfd available ----
+            all_pids = comm.allgather(os.getpid())
+            for pid in all_pids:
+                pidfd = syscall(SYS_pidfd_open, pid, 0)
+                if pidfd < 0:
+                    err = ctypes.get_errno()
+                    raise RuntimeError(
+                        f"pidfd_open({pid}) failed with errno {err}: {os.strerror(err)}"
+                    )
+                pidfds.append(pidfd)
+
+            for pidfd, fd in zip(pidfds, all_handles_data):
+                remote_fd = syscall(SYS_pidfd_getfd, pidfd, fd, 0)
+                if remote_fd < 0:
+                    err = ctypes.get_errno()
+                    raise RuntimeError(
+                        f"pidfd_getfd(pidfd={pidfd}, fd={fd}) failed with errno {err}: "
+                        f"{os.strerror(err)}"
+                    )
+                remote_fds.append(remote_fd)
+            return remote_fds
+
+        # ---- fallback: exchange fds via Unix-socket SCM_RIGHTS ----
+        logger.info("pidfd_getfd unavailable; using Unix-socket SCM_RIGHTS for fd exchange")
+        comm_rank = comm.Get_rank()
+        comm_size = comm.Get_size()
+        local_fd = all_handles_data[comm_rank]
+
+        fds = [None] * comm_size
+        fds[comm_rank] = local_fd
+
+        def _send_fd(sock, rank, fd):
+            sock.sendmsg(
+                [struct.pack("i", rank)],
+                [(socket.SOL_SOCKET, socket.SCM_RIGHTS, struct.pack("i", fd))],
+            )
+
+        def _recv_fd(sock):
+            cmsg_space = socket.CMSG_SPACE(struct.calcsize("i"))
+            msg, anc, _, _ = sock.recvmsg(4, cmsg_space)
+            peer_rank = struct.unpack("i", msg[:4])[0]
+            for lvl, tp, data in anc:
+                if lvl == socket.SOL_SOCKET and tp == socket.SCM_RIGHTS:
+                    return peer_rank, struct.unpack("i", data[:4])[0]
+            raise RuntimeError("No SCM_RIGHTS ancillary data received")
+
+        sock_path = f"/tmp/.mnnvl_fd_{os.getpid()}"
+        try:
+            os.unlink(sock_path)
+        except FileNotFoundError:
+            pass
+
+        server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        server.bind(sock_path)
+        server.listen(comm_size)
+        # allgather paths acts as a barrier: all servers are listening before
+        # any rank starts connecting.
+        all_paths = comm.allgather(sock_path)
+
+        try:
+            # Connect to all lower-ranked peers (they are the servers).
+            for r in range(comm_rank):
+                client = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+                client.connect(all_paths[r])
+                try:
+                    _send_fd(client, comm_rank, local_fd)
+                    _, fds[r] = _recv_fd(client)
+                finally:
+                    client.close()
+
+            # Accept from all higher-ranked peers.
+            for _ in range(comm_size - 1 - comm_rank):
+                conn, _ = server.accept()
+                try:
+                    peer_rank, peer_fd = _recv_fd(conn)
+                    fds[peer_rank] = peer_fd
+                    _send_fd(conn, comm_rank, local_fd)
+                finally:
+                    conn.close()
+        finally:
+            server.close()
+            try:
+                os.unlink(sock_path)
+            except OSError:
+                pass
+
+        return fds


⚠️ Potential issue | 🟠 Major

Close duplicated OS handles on the success path.

_dup_remote_fds() opens pidfds/shareable FDs in both branches, but the caller only closes them on exceptions. In the SCM_RIGHTS path the received FDs are not even recorded in remote_fds, so every successful allocation leaks host FDs until the process eventually hits EMFILE.

Also applies to: 343-348

🧰 Tools

🪛 GitHub Actions: Release Checks

[error] 249-251: pre-commit hook 'codespell' failed (exit code 65). Reported spelling issues: 'anc ==> and' at lines 249 and 251.

🪛 Ruff (0.15.11)

[warning] 221-221: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

[error] 256-256: Probable insecure usage of temporary file or directory: "/tmp/.mnnvl_fd_"

(S108)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_mnnvl_utils.py` around lines 184 - 296, _dup_remote_fds leaks OS FDs: when using the pidfd_getfd fast path the pidfds opened in the first loop and any intermediate FDs must be closed after duplicating (append actual duplicated descriptors to remote_fds and close pidfds), and in the SCM_RIGHTS fallback ensure the received peer_fds are recorded into remote_fds (not just fds) and close any temporary local socket-owned FDs (e.g., local_fd copies, server/client sockets) when no longer needed; update the logic in _dup_remote_fds to append duplicated FDs to remote_fds for both branches, close all pidfds in pidfds after use, and ensure any temporary/duplicate FDs created during SCM_RIGHTS exchange are closed to avoid leaking until EMFILE.

coderabbitai · 2026-04-26T11:04:23Z

+                    )
+                pidfds.append(pidfd)
+
+            for pidfd, fd in zip(pidfds, all_handles_data):


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash sed -n '218,223p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 284

🏁 Script executed:

cd tensorrt_llm && python -m ruff check _mnnvl_utils.py --select B905 2>&1 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 771

🏁 Script executed:

find . -maxdepth 2 -name "ruff.toml" -o -name "pyproject.toml" -o -name ".ruff.toml" | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 127

🏁 Script executed:

sed -n '180,225p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1816

🏁 Script executed:

sed -n '184,190p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 389

🏁 Script executed:

rg "_dup_remote_fds" tensorrt_llm/ -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 733

🏁 Script executed:

sed -n '145,170p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1209

🏁 Script executed:

sed -n '100,140p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1704

🏁 Script executed:

sed -n '240,280p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1652

🏁 Script executed:

sed -n '155,168p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 696

🏁 Script executed:

sed -n '312,340p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1372

🏁 Script executed:

sed -n '340,360p' tensorrt_llm/_mnnvl_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 962

Add strict=True to zip() call on line 221.

The zip(pidfds, all_handles_data) call should enforce length matching with strict=True, as both sequences are derived from comm.allgather() and must always have equal length. Plain zip() will silently truncate on mismatch, masking a critical bug. Ruff already flags this with B905.

🧰 Tools

🪛 Ruff (0.15.11)

[warning] 221-221: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_mnnvl_utils.py` at line 221, The loop pairing pidfds and all_handles_data uses zip(pidfds, all_handles_data) which can silently truncate mismatched lengths; change the iteration in the for loop that unpacks pidfd and fd (referencing the variables pidfds and all_handles_data in this module/function) to call zip with strict=True so a length mismatch raises immediately (use zip(..., strict=True) in the for loop header).

coderabbitai · 2026-04-26T11:04:23Z

+            msg, anc, _, _ = sock.recvmsg(4, cmsg_space)
+            peer_rank = struct.unpack("i", msg[:4])[0]
+            for lvl, tp, data in anc:


⚠️ Potential issue | 🟡 Minor

Rename anc to unblock codespell.

CI is already failing on this local name. Renaming it to something like ancillary_data avoids the false positive without changing behavior.

🧰 Tools

🪛 GitHub Actions: Release Checks

[error] 249-251: pre-commit hook 'codespell' failed (exit code 65). Reported spelling issues: 'anc ==> and' at lines 249 and 251.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_mnnvl_utils.py` around lines 249 - 251, Rename the local variable named anc to avoid the codespell false positive: change the tuple assignment from "msg, anc, _, _ = sock.recvmsg(...)" to use a clearer name like "ancillary_data" and update its usage in the following loop (e.g., "for lvl, tp, data in ancillary_data:") so the code behavior is unchanged but the identifier avoids the CI spellcheck failure; ensure any other references in the surrounding scope are updated to the new name.

coderabbitai · 2026-04-26T11:04:23Z

+        sock_path = f"/tmp/.mnnvl_fd_{os.getpid()}"
+        try:
+            os.unlink(sock_path)
+        except FileNotFoundError:
+            pass
+
+        server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        server.bind(sock_path)
+        server.listen(comm_size)
+        # allgather paths acts as a barrier: all servers are listening before
+        # any rank starts connecting.
+        all_paths = comm.allgather(sock_path)


⚠️ Potential issue | 🟠 Major

Avoid a predictable Unix-socket path in /tmp.

/tmp/.mnnvl_fd_{pid} is guessable and shared. Another local process can race this path and interfere with the FD exchange. Please bind inside a private 0700 temp directory or use a randomized/abstract AF_UNIX address instead.

🧰 Tools

🪛 Ruff (0.15.11)

[error] 256-256: Probable insecure usage of temporary file or directory: "/tmp/.mnnvl_fd_"

(S108)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_mnnvl_utils.py` around lines 256 - 267, The current code constructs a predictable Unix-socket path in sock_path ("/tmp/.mnnvl_fd_{os.getpid()}") and binds a server socket there (server.bind(sock_path)) which is raceable; replace that pattern by creating a private, mode 0o700 temporary directory (e.g., via tempfile.mkdtemp) or use an abstract AF_UNIX address (prefix '\0') to generate a randomized, non-guessable socket path, update sock_path to point into that secure directory or the abstract name, ensure you still handle os.unlink cleanup (or remove the temp dir) and keep the existing server.listen(comm_size) and comm.allgather(sock_path) usage so the barrier logic remains unchanged.

github-actions Bot assigned tensorrt-cicd Apr 26, 2026

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/5973199][fix] Fall back to SCM_RIGHTS for MNNVL fd exchange when pidfd_getfd is blocked#13473

[https://nvbugs/5973199][fix] Fall back to SCM_RIGHTS for MNNVL fd exchange when pidfd_getfd is blocked#13473
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug5973199

tensorrt-cicd commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Uh oh!

coderabbitai Bot Apr 26, 2026

Uh oh!

coderabbitai Bot Apr 26, 2026

Uh oh!

coderabbitai Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tensorrt-cicd commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tensorrt-cicd commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading