-
Notifications
You must be signed in to change notification settings - Fork 252
Description
What is being built is a kata environment. The host has 8 GPU cards. If 1-5 GPU cards are used to create a pod, nvidia-container-cli will run normally, but problems will occur if 6 GPUs are used. After locating, the main reason is that the code calls ns_enter, switches to the rootfs of the container, and cannot find the corresponding directory when mounting_procfs. The environment version information is as follows:
nvidia-container-toolkit version
root@ubuntu-dev:/# dpkg -l | grep nvidia-container-toolkit
ii nvidia-container-toolkit 1.14.3-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.14.3-1 amd64 NVIDIA Container Toolkit Base
nvidia-container-cli,The running log information is as follows:
`
-- WARNING, the following logs are for debugging purposes only --
I0111 10:30:12.940268 149 nvc.c:376] initializing library context (version=1.14.1, build=1eb5a30a6ad0415550a9df632ac8832bf7e2bbba)
I0111 10:30:12.940332 149 nvc.c:350] using root /
I0111 10:30:12.940334 149 nvc.c:351] using ldcache /etc/ld.so.cache
I0111 10:30:12.940336 149 nvc.c:352] using unprivileged user 65534:65534
I0111 10:30:12.940347 149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0111 10:30:12.940484 149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0111 10:30:12.943220 179 nvc.c:278] loading kernel module nvidia
I0111 10:30:12.943338 179 nvc.c:282] running mknod for /dev/nvidiactl
I0111 10:30:12.943362 179 nvc.c:286] running mknod for /dev/nvidia0
I0111 10:30:12.943374 179 nvc.c:286] running mknod for /dev/nvidia1
I0111 10:30:12.943382 179 nvc.c:286] running mknod for /dev/nvidia2
I0111 10:30:12.943390 179 nvc.c:286] running mknod for /dev/nvidia3
I0111 10:30:12.943399 179 nvc.c:286] running mknod for /dev/nvidia4
I0111 10:30:12.943407 179 nvc.c:286] running mknod for /dev/nvidia5
I0111 10:30:12.943415 179 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0111 10:30:12.947452 179 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0111 10:30:12.947504 179 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0111 10:30:12.949693 179 nvc.c:296] loading kernel module nvidia_uvm
I0111 10:30:12.949702 179 nvc.c:300] running mknod for /dev/nvidia-uvm
I0111 10:30:12.949735 179 nvc.c:305] loading kernel module nvidia_modeset
I0111 10:30:12.955489 179 nvc.c:309] running mknod for /dev/nvidia-modeset
I0111 10:30:12.955701 183 rpc.c:71] starting driver rpc service
I0111 10:30:19.067427 229 rpc.c:71] starting nvcgo rpc service
I0111 10:30:19.072896 149 nvc_container.c:246] configuring container with 'compute utility supervised'
I0111 10:30:19.077084 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libcuda.so.545.23.08
I0111 10:30:19.077456 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libcudadebugger.so.545.23.08
I0111 10:30:19.077803 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libnvidia-nvvm.so.545.23.08
I0111 10:30:19.078148 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libnvidia-ptxjitcompiler.so.545.23.08
I0111 10:30:19.079973 149 nvc_container.c:268] setting pid to 147
I0111 10:30:19.080003 149 nvc_container.c:269] setting rootfs to /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs
I0111 10:30:19.080011 149 nvc_container.c:270] setting owner to 0:0
I0111 10:30:19.080018 149 nvc_container.c:271] setting bins directory to /usr/bin
I0111 10:30:19.080038 149 nvc_container.c:272] setting libs directory to /usr/lib/x86_64-linux-gnu
I0111 10:30:19.080045 149 nvc_container.c:273] setting libs32 directory to /usr/lib/i386-linux-gnu
I0111 10:30:19.080052 149 nvc_container.c:274] setting cudart directory to /usr/local/cuda
I0111 10:30:19.080058 149 nvc_container.c:275] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0111 10:30:19.080064 149 nvc_container.c:276] setting mount namespace to /proc/147/ns/mnt
I0111 10:30:19.080070 149 nvc_container.c:278] detected cgroupv1
I0111 10:30:19.080077 149 nvc_container.c:279] setting devices cgroup to /sys/fs/cgroup/devices/663e4a00_6864_4e19_8a5f_15d850583969/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0
I0111 10:30:19.080134 149 nvc_info.c:798] requesting driver information with ''
I0111 10:30:19.083496 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.535.146.02
I0111 10:30:19.083715 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.146.02
I0111 10:30:19.083826 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.535.146.02
I0111 10:30:19.083920 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.146.02
I0111 10:30:19.084025 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.535.146.02
I0111 10:30:19.084159 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.535.146.02
I0111 10:30:19.084254 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.535.146.02
I0111 10:30:19.084412 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.146.02
I0111 10:30:19.084542 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.535.146.02
I0111 10:30:19.084657 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.535.146.02
I0111 10:30:19.084783 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.535.146.02
I0111 10:30:19.084883 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.535.146.02
I0111 10:30:19.084963 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.535.146.02
I0111 10:30:19.085071 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.535.146.02
I0111 10:30:19.085101 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.535.146.02
W0111 10:30:19.085145 149 nvc_info.c:402] missing library libnvidia-nscq.so
W0111 10:30:19.085150 149 nvc_info.c:402] missing library libnvidia-gpucomp.so
W0111 10:30:19.085152 149 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so
W0111 10:30:19.085155 149 nvc_info.c:402] missing library libnvidia-compiler.so
W0111 10:30:19.085158 149 nvc_info.c:402] missing library libnvidia-ngx.so
W0111 10:30:19.085160 149 nvc_info.c:402] missing library libnvidia-eglcore.so
W0111 10:30:19.085163 149 nvc_info.c:402] missing library libnvidia-glcore.so
W0111 10:30:19.085165 149 nvc_info.c:402] missing library libnvidia-tls.so
W0111 10:30:19.085167 149 nvc_info.c:402] missing library libnvidia-glsi.so
W0111 10:30:19.085170 149 nvc_info.c:402] missing library libnvidia-ifr.so
W0111 10:30:19.085172 149 nvc_info.c:402] missing library libnvidia-rtcore.so
W0111 10:30:19.085175 149 nvc_info.c:402] missing library libnvoptix.so
W0111 10:30:19.085177 149 nvc_info.c:402] missing library libGLX_nvidia.so
W0111 10:30:19.085180 149 nvc_info.c:402] missing library libEGL_nvidia.so
W0111 10:30:19.085182 149 nvc_info.c:402] missing library libGLESv2_nvidia.so
W0111 10:30:19.085184 149 nvc_info.c:402] missing library libGLESv1_CM_nvidia.so
W0111 10:30:19.085187 149 nvc_info.c:402] missing library libnvidia-glvkspirv.so
W0111 10:30:19.085189 149 nvc_info.c:402] missing library libnvidia-cbl.so
W0111 10:30:19.085192 149 nvc_info.c:406] missing compat32 library libnvidia-ml.so
W0111 10:30:19.085194 149 nvc_info.c:406] missing compat32 library libnvidia-cfg.so
W0111 10:30:19.085197 149 nvc_info.c:406] missing compat32 library libnvidia-nscq.so
W0111 10:30:19.085199 149 nvc_info.c:406] missing compat32 library libcuda.so
W0111 10:30:19.085202 149 nvc_info.c:406] missing compat32 library libcudadebugger.so
W0111 10:30:19.085204 149 nvc_info.c:406] missing compat32 library libnvidia-opencl.so
W0111 10:30:19.085206 149 nvc_info.c:406] missing compat32 library libnvidia-gpucomp.so
W0111 10:30:19.085209 149 nvc_info.c:406] missing compat32 library libnvidia-ptxjitcompiler.so
W0111 10:30:19.085211 149 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so
W0111 10:30:19.085214 149 nvc_info.c:406] missing compat32 library libnvidia-allocator.so
W0111 10:30:19.085216 149 nvc_info.c:406] missing compat32 library libnvidia-compiler.so
W0111 10:30:19.085219 149 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so
W0111 10:30:19.085221 149 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so
W0111 10:30:19.085230 149 nvc_info.c:406] missing compat32 library libnvidia-nvvm.so
W0111 10:30:19.085233 149 nvc_info.c:406] missing compat32 library libnvidia-ngx.so
W0111 10:30:19.085235 149 nvc_info.c:406] missing compat32 library libvdpau_nvidia.so
W0111 10:30:19.085238 149 nvc_info.c:406] missing compat32 library libnvidia-encode.so
W0111 10:30:19.085241 149 nvc_info.c:406] missing compat32 library libnvidia-opticalflow.so
W0111 10:30:19.085243 149 nvc_info.c:406] missing compat32 library libnvcuvid.so
W0111 10:30:19.085246 149 nvc_info.c:406] missing compat32 library libnvidia-eglcore.so
W0111 10:30:19.085249 149 nvc_info.c:406] missing compat32 library libnvidia-glcore.so
W0111 10:30:19.085251 149 nvc_info.c:406] missing compat32 library libnvidia-tls.so
W0111 10:30:19.085254 149 nvc_info.c:406] missing compat32 library libnvidia-glsi.so
W0111 10:30:19.085256 149 nvc_info.c:406] missing compat32 library libnvidia-fbc.so
W0111 10:30:19.085259 149 nvc_info.c:406] missing compat32 library libnvidia-ifr.so
W0111 10:30:19.085262 149 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so
W0111 10:30:19.085264 149 nvc_info.c:406] missing compat32 library libnvoptix.so
W0111 10:30:19.085267 149 nvc_info.c:406] missing compat32 library libGLX_nvidia.so
W0111 10:30:19.085270 149 nvc_info.c:406] missing compat32 library libEGL_nvidia.so
W0111 10:30:19.085272 149 nvc_info.c:406] missing compat32 library libGLESv2_nvidia.so
W0111 10:30:19.085275 149 nvc_info.c:406] missing compat32 library libGLESv1_CM_nvidia.so
W0111 10:30:19.085277 149 nvc_info.c:406] missing compat32 library libnvidia-glvkspirv.so
W0111 10:30:19.085280 149 nvc_info.c:406] missing compat32 library libnvidia-cbl.so
I0111 10:30:19.085495 149 nvc_info.c:302] selecting /usr/bin/nvidia-smi
I0111 10:30:19.085511 149 nvc_info.c:302] selecting /usr/bin/nvidia-debugdump
I0111 10:30:19.085525 149 nvc_info.c:302] selecting /usr/bin/nvidia-persistenced
I0111 10:30:19.085549 149 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-control
I0111 10:30:19.085563 149 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-server
W0111 10:30:19.085591 149 nvc_info.c:428] missing binary nv-fabricmanager
I0111 10:30:19.085667 149 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.146.02/gsp_ga10x.bin
I0111 10:30:19.085671 149 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.146.02/gsp_tu10x.bin
I0111 10:30:19.085688 149 nvc_info.c:561] listing device /dev/nvidiactl
I0111 10:30:19.085691 149 nvc_info.c:561] listing device /dev/nvidia-uvm
I0111 10:30:19.085693 149 nvc_info.c:561] listing device /dev/nvidia-uvm-tools
I0111 10:30:19.085696 149 nvc_info.c:561] listing device /dev/nvidia-modeset
W0111 10:30:19.085712 149 nvc_info.c:352] missing ipc path /var/run/nvidia-persistenced/socket
W0111 10:30:19.085724 149 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket
W0111 10:30:19.085735 149 nvc_info.c:352] missing ipc path /tmp/nvidia-mps
I0111 10:30:19.085739 149 nvc_info.c:854] requesting device information with ''
I0111 10:30:19.093013 149 nvc_info.c:745] listing device /dev/nvidia0 (GPU-15dd6db0-ca52-31f5-3daf-2019882683b0 at 00000000:02:00.0)
I0111 10:30:19.101035 149 nvc_info.c:745] listing device /dev/nvidia1 (GPU-2f4bb339-fc05-e25d-512c-05c7eefd99e1 at 00000000:04:00.0)
I0111 10:30:19.109848 149 nvc_info.c:745] listing device /dev/nvidia2 (GPU-23e7b85e-0792-0721-9d22-ac2a0e9bac2b at 00000000:06:00.0)
I0111 10:30:19.118514 149 nvc_info.c:745] listing device /dev/nvidia3 (GPU-d78a3ff1-47b6-80be-95a5-ffe77853335f at 00000000:08:00.0)
I0111 10:30:19.126998 149 nvc_info.c:745] listing device /dev/nvidia4 (GPU-3dbafd65-2a6e-1445-9569-a70fa746902d at 00000000:0a:00.0)
I0111 10:30:19.135642 149 nvc_info.c:745] listing device /dev/nvidia5 (GPU-d9bc47e7-fd6d-f1dc-ff31-d9675bb73087 at 00000000:0c:00.0)
`
nvidia-container-cli,code positioning analysis is as follows:【src/nvc_mount.c 】
`
int
nvc_driver_mount(struct nvc_context *ctx, const struct nvc_container *cnt, const struct nvc_driver_info *info)
{
const char **mnt, **ptr, **tmp;
size_t nmnt;
int rv = -1;
if (validate_context(ctx) < 0)
return (-1);
if (validate_args(ctx, cnt != NULL && info != NULL) < 0)
return (-1);
if (ns_enter(&ctx->err, cnt->mnt_ns, CLONE_NEWNS) < 0)
return (-1);
nmnt = 2 + info->nbins + info->nlibs + cnt->nlibs + info->nlibs32 + info->nipcs + info->ndevs + info->nfirmwares;
mnt = ptr = (const char **)array_new(&ctx->err, nmnt);
if (mnt == NULL)
goto fail;
/* Procfs mount */
if (ctx->dxcore.initialized)
log_warn("skipping procfs mount on WSL");
else if ((*ptr++ = mount_procfs(&ctx->err, ctx->cfg.root, cnt)) == NULL)
goto fail;
`
After locating, it was found that the problem occurred on ns_enter func. Under normal circumstances during the test, even if the ns_enter interface was called, the file system path inside the virtual machine could be viewed, and then it could be mounted normally. [The mounting path is: /run/kata -Containers/AE1F1999611632C96D7EF8A5D5F51894D377259F06336911d02F67474D0/ROOTFS/Prive/Driver/Nvidia], for example, the abnormal scene used 6 GPU cards and called ns_enter, will directly enter the ROOTFS of the container as the root directory, so mount [/run/kata- containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/proc/driver/nvidia], the path cannot be found. I don’t understand the reason here. I don’t know how to solve this problem.