Skip to content

Configuring MIG in B200 causes 'context deadline exceeded' in some Pods #2155

@FermiX9

Description

@FermiX9

First of all, I’m not sure whether this is actually a bug or if I might be missing something.

In our environment, the NVIDIA driver is installed directly on the node, and MIG is managed by BCM rather than by the GPU Operator.
When configuring MIG with the 1g.23gb profile on B200 GPUs, we have observed that nvidia-smi queries become significantly slower. As a result, operator-related Pods such as gpu-feature-discovery enter an Error state due to a context deadline exceeded message.

My assumption is that the root cause is the slow execution of the nvidia-smi command. The Pod likely reaches its timeout before the command finishes and returns the GPU status.

On the other hand, I have been able to successfully apply this MIG configuration by following this process:

  • Stop the kubelet and containerd services on the node.
  • Apply a MIG configuration with fewer partitions.
  • Wait until the GPU Operator Pods reach the Running state in Kubernetes.
  • Stop the kubelet and containerd services on the node.
  • Apply the 1g.23gb MIG configuration.
  • Restart the kubelet and containerd services.

It appears that once the Pods have previously reached the Running state, they are able to retrieve the nvidia-smi information successfully after the restart, and they no longer fail during initialization.

UPDATE:
Also tried with the profile number 12 (3*2g.45gb 1g.23gb), following the official documentation: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-mig-profiles.html#b200-mig-profiles

It seems that when all the Pods start at the same time, it triggers a storm of queries to the driver, resulting in long nvidia-smi response times and causing the Pods to fail to start.

Image

The next logs shows the difference in nvidia-smi execution time depending on whether the containerd service is running on the node or not.

containerd service running and Pods trying to start:

root@node03:~# time nvidia-smi
Tue Feb 24 12:13:39 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA B200                    On  |   00000000:18:00.0 Off |                   On |
| N/A   42C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA B200                    On  |   00000000:29:00.0 Off |                   On |
| N/A   42C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA B200                    On  |   00000000:3A:00.0 Off |                   On |
| N/A   41C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA B200                    On  |   00000000:4B:00.0 Off |                   On |
| N/A   42C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA B200                    On  |   00000000:9A:00.0 Off |                   On |
| N/A   43C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA B200                    On  |   00000000:AA:00.0 Off |                   On |
| N/A   42C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA B200                    On  |   00000000:BA:00.0 Off |                   On |
| N/A   42C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA B200                    On  |   00000000:CA:00.0 Off |                   On |
| N/A   42C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |              Shared Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                Shared BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    5   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    6   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    9   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    5   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    6   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    9   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

real    0m44.299s
user    0m0.275s
sys     0m15.130s

containerd stopped:

root@node03:~# time nvidia-smi
Tue Feb 24 12:07:30 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA B200                    On  |   00000000:18:00.0 Off |                   On |
| N/A   43C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA B200                    On  |   00000000:29:00.0 Off |                   On |
| N/A   43C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA B200                    On  |   00000000:3A:00.0 Off |                   On |
| N/A   43C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA B200                    On  |   00000000:4B:00.0 Off |                   On |
| N/A   43C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA B200                    On  |   00000000:9A:00.0 Off |                   On |
| N/A   44C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA B200                    On  |   00000000:AA:00.0 Off |                   On |
| N/A   43C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA B200                    On  |   00000000:BA:00.0 Off |                   On |
| N/A   43C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA B200                    On  |   00000000:CA:00.0 Off |                   On |
| N/A   44C    P0            N/A  / 1000W |     107MiB / 183359MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |              Shared Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                Shared BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    5   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    6   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    9   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    5   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    6   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    9   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  6   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7    3   0   0  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7    4   0   1  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7    5   0   2  |              31MiB / 45312MiB    | 36      0 |  4   0    2    0    2 |
|                  |               0MiB / 19750MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  7   13   0   3  |              16MiB / 20992MiB    | 18      0 |  2   0    1    0    1 |
|                  |               0MiB /  9875MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

real    0m5.356s
user    0m0.045s
sys     0m5.231s

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions