Enable MIG Mode in DGX Station A100

Here is some information about how you can enable the Multi-Instance GPU (MIG) mode.

  1. By default, MIG mode is not enabled on the DGX Station A100.

    For example, when you run nvidia-smi, the output shows that MIG mode is disabled:

    $ nvidia-smi -i 0
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  A100-SXM4-40GB      Off  | 00000000:36:00.0 Off |                    0 |
    | N/A   29C    P0    62W / 400W |      0MiB / 40537MiB |      6%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    
  2. To enable the MIG mode for each GPU, run the nvidia-smi -i <GPU IDs> -mig 1 command.

  3. Select the GPUs by using comma-separated GPU indexes, PCI Bus Ids, or UUIDs.

    Here is some information to remember:

    • If you do not specify a GPU ID, the MIG mode is applied to all the GPUs on the system.

      $ sudo nvidia-smi -i 0 -mig 1
      Enabled MIG Mode for GPU 00000000:36:00.0
      All done.
      
      $ nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv
      pci.bus_id, mig.mode.current
      00000000:36:00.0, Enabled
      
    • If you are using MIG in a VM with GPU passthrough, you might need to reboot the VM to allow the GPU to be in MIG mode.

      Sometimes, for security reasons, the GPU reset is not allowed via the hypervisor. Here is an example:

      $ sudo nvidia-smi -i 0 -mig 1
      Warning: MIG mode is in pending enable state for GPU 00000000:00:03.0:Not Supported
      Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:00:03.0
      All done.
      
      $ sudo nvidia-smi -i 0 -mig 1
      $ sudo nvidia-smi --gpu-reset
      Resetting GPU 00000000:00:03.0 is not supported.
      
    • If you have agents on the system, such as monitoring agents that use the GPU, you might not be able to initiate a GPU reset.

      On DGX systems, for example, you might encounter the following message:

      $ sudo nvidia-smi -i 0 -mig 1
      Warning: MIG mode is in pending enable state for GPU 00000000:07:00.0:In use by another client
      00000000:07:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using the device and retry the command or reboot the system to make MIG mode effective.
      All done.
      
  4. Stop the nvsm, dcgm, and gdm3 services, enable MIG mode on the desired GPU, and restore the monitoring services:

    $ sudo systemctl stop nvsm
    $ sudo systemctl stop dcgm
    $ sudo systemctl stop gdm3
    $ sudo nvidia-smi -i 0 -mig 1
    Enabled MIG Mode for GPU 00000000:07:00.0
    All done.
    

    The examples use super-user privileges. When you grant read access to mig/config capabilities, non-root users can also manage instances after the Station A100 has been configured in MIG mode. Refer to Device Notes for more information.

    Here are the default file permissions on the mig/config file:

    $ ls -l /proc/driver/nvidia/capabilities/*
    /proc/driver/nvidia/capabilities/mig:
    total 0
    -r-------- 1 root root 0 May 24 16:10 config
    -r--r--r-- 1 root root 0 May 24 16:10 monitor
    

To ensure that the MIG instances are available in your containers, restart nv-docker-gpus and docker:

$ sudo systemctl restart nv-docker-gpus
$ sudo systemctl restart docker