Multi-Instance GPU

Multi-Instance GPU (MIG) is a new capability of the NVIDIA A100 GPU. MIG uses spatial partitioning to carve the physical resources of an A100 GPU into up to seven independent GPU instances. These instances run simultaneously, each with its own memory, cache, and compute streaming multiprocessors. MIG enables the A100 GPU to deliver guaranteed quality of service at up to 7X higher utilization compared to non-MIG enabled GPUs.

MIG enables the following:

  • GPU memory isolation among parallel GPU workloads.

  • Physical allocation of resources used by parallel GPU workloads.

Managing MIG instances is accomplished using the NVIDIA Management Library (NVML) APIs or its command-line utility (nvidia-smi). Enablement of MIG requires a GPU reset and hence some system services that manage GPUs should be terminated before enabling MIG.

To enable MIG on all eight GPUs in the system, issue the following.

  1. Stop the NVSM and DCGM services.

    $ sudo systemctl stop nvsm dcgm
  2. Enable MIG on all eight GPUs.

    $ sudo nvidia-smi -mig 1

    If other services are running that prevent you from resetting the GPUs, then reboot the system and skip the next step.

  3. Restart the DCGM and NVSM services.

    $ sudo systemctl start dcgm nvsm

    To use MIG, see the MIG User Guide, which provides more detailed information about key MIG concepts and deployment considerations and explains how to create MIG instances and how to run Docker containers using MIG.