Advanced GPU configuration (Optional)

GPU Partitioning

Compute workloads can benefit from using separate GPU partitions. The flexibility of GPU partitioning allows a single GPU to be shared and used by small, medium, and large-sized workloads. GPU partitions can be a valid option for executing Deep Learning workloads. An example is Deep Learning training and inferencing workflows, which utilize smaller datasets but are highly dependent on the size of the data/model, and users may need to decrease batch sizes.

The following graphic illustrates a GPU partitioning use case where multi-tenant, multiple users are sharing a single A100 (40GB). In this use case, a single A100 can be used for multiple workloads such as Deep Learning training, fine-tuning, inference, Jupiter Notebook, profiling, debugging, etc.

../_images/dg-gpu-part-01.png

Using two different NVIDIA GPU technologies, GPUs are partitioned using either NVIDIA AI Enterprise software temporal partitioning or Multi-Instance GPU (MIG) spatial partitioning. Please refer to the GPU Partitioning technical brief to understand the differences.

NVIDIA AI Enterprise Software Partitioning

Using NVIDIA AI Enterprise software partitioning, profiles assign custom amounts of dedicated GPU memory for each user. NVIDIA AI Enterprise Host Software sets the correct amount of memory to meet the specific needs within the workflow for said user. Every virtual machine has dedicated GPU memory and must be assigned accordingly, ensuring that it has the resources needed to handle the expected compute load.

NVIDIA AI Enterprise Host Software allows up to eight users to share each physical GPU by assigning the graphics resources of the available GPUs to virtual machines using a balanced approach. Depending on the number of GPUs within each line card, there can be multiple user types assigned.

Profiles for NVIDIA AI Enterprise

The profiles represent a very flexible deployment option of virtual GPUs, varying in size of GPU memory. The division of GPU memory defines the number of vGPUs that are possible per GPU.

C-series vGPU types are optimized for compute-intensive workloads. As a result, they support only a single display head at a maximum resolution of 4096×2160 and do not provide NVIDIA RTX graphics acceleration.

It is essential to consider which vGPU profile will be used within a deployment since this will ultimately determine how many vGPU backed VMs can be deployed. All VMs using the shared GPU resource must be assigned the same fractionalized vGPU profile. This means you cannot mix vGPU profiles on a single GPU using NVIDIA AI Enterprise software.

In the image below, the right side illustrates valid configurations in green, where VMs share a single GPU resource (GPU 1) on a T4 GPU, and all VMs are assigned homogenous profiles, such as 8GB, 4GB, or 16GB C profiles. Since there are two GPUs installed in the server, the other T4 (GPU 0) can be partitioned/fractionalized differently than GPU 1. An invalid configuration is shown in red, where a single GPU is shared using 8C and 4C profiles. Heterogenous profiles are not supported on vGPU, and VMs will not successfully power on.

../_images/dg-gpu-part-02.png

Scheduling Policies

NVIDIA AI Enterprise provides three GPU scheduling options to accommodate a variety of QoS requirements of customers. However, since AI Enterprise workloads are typically long-running operations, it is recommended to implement the Fixed Share or Equal Share scheduler for optimal performance.

  • Fixed share scheduling always guarantees the same dedicated quality of service. The fixed share scheduling policies guarantee equal GPU performance across all vGPUs sharing the same physical GPU.

  • Equal share scheduling provides equal GPU resources to each running VM. As vGPUs are added or removed, the share of GPU processing cycles allocated changes accordingly, resulting in performance to increase when utilization is low, and decrease when utilization is high.

  • Best effort scheduling provides consistent performance at a higher scale and therefore reduces the TCO per user. The best effort scheduler leverages a round-robin scheduling algorithm that shares GPU resources based on actual demand, resulting in optimal utilization of resources. This results in consistent performance with optimized user density. The best effort scheduling policy best utilizes the GPU during idle and not fully utilized times, allowing for optimized density and a good QoS.

Additional information regarding GPU scheduling can be found here.

RmPVMRL Registry Key

The RmPVMRL registry key sets the scheduling policy for NVIDIA vGPUs.

Note

You can change the vGPU scheduling policy only on GPUs based on the Pascal, Volta, Turing, and Ampere architectures.

Type

Dword

Contents

Value

Meaning

0x00 (default)

Best effort scheduler

0x01

Equal share scheduler with the default time slice length

0x00TT0001

Equal share scheduler with a user-defined time slice length TT

0x11

Fixed share scheduler with the default time slice length

0x00TT0011

Fixed share scheduler with a user-defined time slice length TT

Examples

The default time slice length depends on the maximum number of vGPUs per physical GPU allowed for the vGPU type.

Maximum Number of vGPUs

Default Time Slice Length

Less than or equal to 8

2 ms

Greater than 8

1 ms

TT

  • Two hexadecimal digits in the range 01 to 1E set the time slice length in milliseconds (ms) for the equal share and fixed share schedulers. The minimum length is 1 ms, and the maximum length is 30 ms.

  • If TT is 00, the length is set to the default length for the vGPU type.

  • If TT is greater than 1E, the length is set to 30 ms.

Examples

This example sets the vGPU scheduler to equal share scheduler with the default time slice length.

RmPVMRL=0x01

This example sets the vGPU scheduler to equal share scheduler with a time slice that is 3 ms long.

RmPVMRL=0x00030001

This example sets the vGPU scheduler to a fixed share scheduler with the default time slice length.

RmPVMRL=0x11

This example sets the vGPU scheduler to a fixed share scheduler with a time slice 24 (0x18) ms long.

RmPVMRL=0x00180011

Changing the vGPU Scheduling Policy for All GPUs

Perform this task in your hypervisor command shell.

  1. Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use a secure shell (SSH) for this purpose. Set the RmPVMRL registry key to the value that sets the GPU scheduling policy needed.

  2. In the VMware vSphere SSH CLI, use the esxcli set command.

    # esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RmPVMRL=value"
    

    Note

    Where <value> is the value that sets the vGPU scheduling policy you want, for example:

    • 0x00 - Equal Share Scheduler with the default time slice length

    • 0x00030001 - Equal Share Scheduler with a time slice of 3 ms

    • 0x011 - Fixed Share Scheduler with the default time slice length

    • 0x00180011 - Fixed Share Scheduler with a time slice of 24 ms (0x18)

    The default time slice length depends on the maximum number of vGPUs per physical GPU allowed for the vGPU type.

    Maximum Number of vGPUs

    Default Time Slice Length

    Less than or equal to 8

    2 ms

    Greater than 8

    1 ms

  3. Reboot your hypervisor host machine.

Changing the vGPU Scheduling Policy for Select GPUs

Perform this task in your hypervisor command shell:

  1. Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use a secure shell (SSH) for this purpose.

  2. Use the lspci command to obtain the PCI domain and bus/device/function (BDF) of each GPU for which you want to change the scheduling behavior.

    • Pipe the output of lspci to the grep command to display information only for NVIDIA GPUs.

      # lspci | grep NVIDIA
      

      The NVIDIA GPUs listed in this example have the PCI domain 0000 and BDFs 85:00.0 and 86:00.0.

      1
      2
      0000:85:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [M60] (rev a1)
      0000:86:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [M60] (rev a1)
      
  3. Use the module parameter NVreg_RegistryDwordsPerDevice to set the pci and RmPVMRL registry keys for each GPU.

    • Use the esxcli set command:

      # esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwordsPerDevice=pci=pci-domain:pci-bdf;RmPVMRL=value [;pci=pci-domain:pci-bdf;RmPVMRL=value...]"
      

      For each GPU, provide the following information:

      • pci-domain

        • The PCI domain of the GPU.

      • pci-bdf

        • The PCI device BDF of the GPU.

      • value

        • 0x00 - Sets the vGPU scheduling policy to Equal Share Scheduler with the default time slice length.

        • 0x00030001 - Sets the vGPU scheduling policy to Equal Share Scheduler with a time slice that is 3ms long.

        • 0x011 - Sets the vGPU scheduling policy to Fixed Share Scheduler with the default time slice length.

        • 0x00180011 - Sets the vGPU scheduling policy to Fixed Share Scheduler with a time slice of 24 ms (0x18) long.

      • For all supported values, see RmPVMRL Registry Key.

  4. Reboot your hypervisor host machine.

Restoring Default vGPU Scheduler Settings

Perform this task in your hypervisor command shell.

  1. Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use a secure shell (SSH) for this purpose.

  2. Unset the RmPVMRL registry key by setting the module parameter to an empty string.

    # esxcli system module parameters set -m nvidia -p "module-parameter="
    

    module-parameter

    The module parameter to set, which depends on whether the scheduling behavior was changed for all GPUs or select GPUs:

    • For all GPUs, set the NVreg_RegistryDwords module parameter.

    • For select GPUs, set the NVreg_RegistryDwordsPerDevice module parameter.

    For example, to restore default vGPU scheduler settings after they were changed for all GPUs, enter this command:

    # esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords="
    
  3. Reboot your hypervisor host machine.

NVIDIA Multi-Instance GPU Configuration for vSphere

The NVIDIA A100 Tensor Core GPU is based upon the NVIDIA Ampere architecture and accelerates compute workloads such as AI, data analytics, and HPC in the data center. MIG support on vGPUs began at the NVIDIA AI Enterprise Software 12 release, and gives users the flexibility to use the NVIDIA A100 in MIG mode or non-MIG mode. When the NVIDIA A100 is in non-MIG mode, NVIDIA vGPU software uses temporal partitioning and GPU time slice scheduling. MIG mode spatially partitions GPU hardware so that each MIG can be fully isolated with its streaming multiprocessors (SMs), high bandwidth, and memory. MIG can partition available GPU compute resources as well.

../_images/dg-gpu-part-03.png

Each instance’s processors have separate and isolated paths through the entire memory system. The on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address buses are assigned uniquely to an individual instance. This ensures that a particular user’s workload can run with predictable throughput and latency, using the same L2 cache allocation and DRAM bandwidth, even if other tasks thrash their caches or saturate their DRAM interfaces.

A single NVIDIA A100-40GB has eight usable GPU memory slices, each with 5 GB of memory but only seven usable SM slices. There are seven SM slices, not eight, because some SMs cover operational overhead when MIG mode is enabled. MIG mode is configured (or reconfigured) using nvidia-smi and has profiles that you can choose to meet the needs of HPC, deep learning, or accelerated computing workloads.

In summary, MIG spatially partitions the NVIDIA GPU into separate GPU instances but provides benefits of reduced latency over vGPU temporal partitioning for compute workloads. The following tables summarize similarities and differences between A100 MIG capabilities and NVIDIA AI Enterprise software while also highlighting the additional flexibility when combined.

NVIDIA A100 MIG-Enabled (40GB) Virtual GPU Types

NVIDIA A100 with MIG-Disabled (40GB)Virtual GPU Types

GPU Partitioning

Spatial (hardware)

Temporal (software)

Number of Partitions

7

10

Compute Resources

Dedicated

Shared

Compute Instance Partitioning

Yes

No

Address Space Isolation

Yes

Yes

Fault Tolerance

Yes (highest quality)

Yes

Low Latency Response

Yes (highest quality)

Yes

NVLink Support

No

Yes

Multi-Tenant

Yes

Yes

GPUDirect RDMA

Yes (GPU instances)

Yes

Heterogeneous Profiles

Yes

No

Management - Requires Super User

Yes

No

One of the features introduced to vGPUs when VMs are using MIG-backed virtual GPUs is the ability to have differently sized (heterogeneous) partitioned GPU instances. The following table illustrates the 18 possible size combinations when NVIDIA A100 has MIG mode enabled.

../_images/dg-gpu-part-04.png

Note

When using vCS and MIG mode is enabled, the NVIDIA AI Enterprise software recognizes the MIG-backed vGPU resource as if it were a 1:1 or full GPU profile.

NVIDIA AI Enterprise software supports MIG only with Linux guest operating systems. To support GPU instances with NVIDIA vGPU, a GPU must be configured with MIG mode enabled. For more information, see Configuring a GPU for MIG-Backed vGPUs in the Virtual GPU Software Documentation. For general information about the MIG feature, see NVIDIA Multi-Instance GPU User Guide.

Terminology

GPU Context

A GPU context is analogous to a CPU process. It encapsulates all of the resources necessary to execute operations on the GPU, including distinct address space, memory allocations, etc. In addition, a GPU context has the following properties:

  • Fault isolation

  • Individual scheduling

  • Distinct address space

GPU Engine

A GPU engine executes work on the GPU. The most commonly used engine is the Compute/Graphics engine, which executes the compute instructions. Other engines include the copy engine (CE), responsible for performing DMAs, NVDEC for video decoding, etc. Each engine can be scheduled independently and can execute work for different GPU contexts.

GPU Memory Slice

A GPU memory slice is the smallest fraction of the A100 GPUs memory, including the corresponding memory controllers and cache. A GPU memory slice is roughly one-eighth of the total GPU memory resources, including capacity and bandwidth.

GPU SM Slice

A GPU SM slice is the smallest fraction of the SMs on the A100 GPU. A GPU SM slice is roughly one-seventh of the total number of SMs available in the GPU when configured in MIG mode.

GPU Slice

A GPU slice is the smallest fraction of the A100 GPU that combines a single GPU memory slice and a single GPU SM slice.

GPU Instance

A GPU instance (GI) is a combination of GPU slices and GPU engines (DMAs, NVDECs, etc.). Anything within a GPU instance always shares all the GPU memory slices and other GPU engines, but its SM slices can be further subdivided into compute instances (CIs). A GPU instance provides memory QoS. Each GPU slice includes dedicated GPU memory resources that limit both the available capacity and bandwidth and provide memory QoS. Thus, each GPU memory slice gets one-eighth of the total GPU memory resources, and each GPU SM slice gets one-seventh of the total number of SMs.

Compute Instance

A GPU instance can be subdivided into multiple compute instances. A compute instance (CI) contains a subset of the parent GPU instance’s SM slices and other GPU engines (DMAs, NVDECs, etc.). The CIs share memory and engines.

The number of slices that a GI (GPU Instance) can be created with is not arbitrary. The NVIDIA driver APIs provide several “GPU Instance Profiles,” and users can create GIs by specifying one of these profiles.

Multiple GIs can be created from a mix and match of these profiles on a given GPU, so long as enough slices are available to satisfy the request.

Profile Name

Fraction of Memory

Fraction of SMs

Hardware Units

Number of Instances Available

MIG 1g.5gb

1/8

1/7

0 NVDECs

7

MIG 2g.10gb

2/8

2/7

1 NVDECs

3

MIG 3g.20gb

4/8

3/7

2 NVDECs

2

MIG 4g.20gb

4/8

4/7

2 NVDECs

1

MIG 7g.40gb

Full

7/7

5 NVDECs

1

MIG Prerequisites

The following prerequisites apply when using A100 in MIG mode.

  • Supported only on NVIDIA A100 products and associated systems using A100 (See the NVIDIA-Certified Systems.)

  • Requires CUDA 11 and NVIDIA AI Enterprise driver 450.73 or greater

  • Requires CUDA 11 supported Linux operating system distribution

  • VMware vSphere 7 Update 2

  • SR-IOV is enabled in BIOS Settings

MIG can is managed programmatically using NVIDIA Management Library (NVML) APIs or its command-line interface, nvidia-smi. Note that some of the nvidia-smi output in the following examples may be cropped to showcase the relevant sections of interest for brevity. For more information on the MIG commands, see the nvidia-smi main page or enter the command.

nvidia-smi mig --help

Enable MIG Mode

To support GPU instances with NVIDIA vGPU, a GPU must be configured with MIG mode enabled. In addition, GPU instances must be created and configured on the physical GPU before MIG Enablement and VM Configuration. Optionally, you can create compute instances within the GPU instances. If you do not create compute instances within the GPU instances, they can be added later for individual vGPUs from the guest VMs.

Ensure that the following prerequisites are met:

  • The NVIDIA AI Enterprise Host Software is installed on the hypervisor host.

  • General Prerequisites and MIG Prerequisites are met.

  • You have root user privileges on your hypervisor host machine.

  • The GPU is not configured for Passthrough in vCenter.

  • The GPU is not being used by any other processes, such as CUDA applications, monitoring applications, or the nvidia-smi command.

  1. Open a command shell as the root user on your hypervisor host machine. You can use a secure shell (SSH) for this purpose.

  2. Use the nvidia-smi command to determine whether MIG mode is enabled. By default, MIG mode is disabled. This example shows that MIG mode is disabled on GPU 0.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    $ nvidia-smi -i 0
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.63    Driver Version: 470.63       CUDA Version: N/A         |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  A100-SXM4-40GB      Off  | 00000000:36:00.0 Off |                    0 |
    | N/A   29C    P0    62W / 400W |      0MiB / 40537MiB |      6%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    
  3. If MIG mode is disabled, enable it.

    $ nvidia-smi -i [gpu-ids] -mig 1
    

    gpu-ids

    A comma-separated list of GPU indexes, PCI bus IDs, or UUIDs specifies the GPUs you want to enable MIG mode. If gpu-ids are omitted, MIG mode is enabled on all GPUs in the system.

    This example enables MIG mode on GPU 0.

    1
    2
    3
    $ nvidia-smi -i 0 -mig 1
    Enabled MIG Mode for GPU 00000000:36:00.0
    All done.
    

    Note

    If another process uses the GPU, nvidia-smi fails and displays a warning message that MIG mode for the GPU is in the pending enable state. In this situation, stop all processes that are using the GPU and retry the command.

  4. Next, reset the GPU.

    nvidia-smi -i 0 –gpu-reset
    
  5. Query the GPUs on which you enabled MIG mode to confirm that MIG mode is enabled.

    This example queries GPU 0 for the PCI bus ID and MIG mode in comma-separated values (CSV) format.

    1
    2
    3
    $ nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv
    pci.bus_id, mig.mode.current
    00000000:36:00.0, Enabled
    

    Note

    VMware vSphere creates the GPU instances automatically when a VM is assigned.

VM Configuration with a GPU Instance

To enable vGPU support and attach a GPU Instance for your virtual machine, you must edit the virtual machine settings.

  1. Confirm that the VM is powered down.

  2. Click the VM in the inventory window. Right-click the VM and select Edit Settings:

    ../_images/dg-gpu-part-05.png
  3. Click the Add New Device* button. Navigation displays a dropdown menu of device types. Select PCI Device:

    ../_images/dg-gpu-part-07.png
  4. Select PCI Device to continue. The new PCI device shows that an NVIDIA vGPU device has been added.

    ../_images/dg-gpu-part-08.png
  5. Expand the New PCI device, click the GPU Profile dropdown, and select the appropriate configuration.

    ../_images/dg-gpu-part-09.png

    Note

    A standard compute profile consist of three components: grid_a100-40c. A MIG-specific vGPU profile has an extra digit that maps to the fraction section shown in the output of nvidia-smi mig -lgip: **grid_a100-4-20c**.

  6. Click OK to complete the configuration.

  7. Boot the VM with the MIG GPU Instance.

    Important

    nvidia-smi does not support showing GPU utilization for MIG instances. However, it is supported with vGPU partitioning.

VM Configuration with a GPU Instance

You can add the compute instances for an individual vGPU from within the guest VM. If you want to replace the compute instances created when the GPU was configured for MIG-backed vGPUs, you can delete them before adding the compute instances from within the guest VM. Ensure that the following prerequisites are met:

  • You have root user privileges on the guest VM.

  • The GPU instance is not being used by any other processes, such as CUDA applications, monitoring applications, or the nvidia-smi command.

  1. List the compute instances that can be created in a guest VM command shell:

    $ nvidia-smi mig -lcip
    
    ../_images/dg-gpu-part-10.png
  2. Create the compute instances that you need within each GPU instance.

    $ nvidia-smi mig -cci -gi <gpu-instance-ids>
    

    Where <gpu-instance-ids> is a comma-separated list of GPU instance IDs that specifies the GPU instances within which you want to create the compute instances. For example, to create compute instance with profile #2 (3g.20gb).

    $ nvidia-smi mig -cci 2 -gi 0
    

    If you want to create multiple compute instances and run apps in parallel, see the user guide for more complex scenarios.

Important

To avoid an inconsistent state between a guest VM and the hypervisor host, do not create compute instances from the hypervisor on a GPU instance on which an active guest VM is running. Instead, create the compute instances from within the guest VM as explained in NVIDIA vGPU Software: Modifying a MIG-Backed vGPUs Configuration.

Optional: Update Containers for a MIG-Enabled vGPU

To run containers on a MIG-enabled vGPU, you need to update the nvidia-docker2 package. Follow the instructions here.

1
2
3
4
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
    && distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list \
    && sudo apt-get update
1
2
$ sudo apt-get install -y nvidia-docker2 \
    && sudo systemctl restart docker

To test the installation of the NVIDIA Container Toolkit on the VM, execute the command below:

sudo docker run –runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-786035d5-1e85-11b2-9fec-ac9c9a792daf/0/0 nvidia/cuda nvidia-smi

Note

NVIDIA_VISIBLE_DEVICES supports two formats to specify MIG devices:

  • MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>

  • GPUDeviceIndex>:<MIGDeviceIndex>

Destroying GPU Instances

When a VM is unassigned to a GPU instance, the instance should automatically be destroyed; however, it may be necessary to destroy the GPU instance using nvidia-smi manually.

The following example shows how the CIs and GIs created in the previous examples can be destroyed.

1
2
3
4
$ sudo nvidia-smi mig -dci -ci 0,1,2 -gi 1
Successfully destroyed compute instance ID  0 from GPU  0 GPU instance ID  1
Successfully destroyed compute instance ID  1 from GPU  0 GPU instance ID  1
Successfully destroyed compute instance ID  2 from GPU  0 GPU instance ID  1

It can be verified that the MIG devices have now been torn down.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
$ nvidia-smi
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |                      | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

GPU Aggregation for NVIDIA AI Enterprise

NVIDIA AI Enterprise supports GPU Aggregation where a VM can access more than one GPU, which is required for specific compute-intensive workloads. NVIDIA AI Enterprise supports peer-to-peer computing. The following sections describe both technologies and how to deploy GPU aggregation within VMWare ESXi.

Page Retirement and ECC

NVIDIA AI Enterprise supports ECC and dynamic page retirement. This feature will “retire” bad frame buffer memory cells by retiring the page the cell belongs to. Dynamic page retirement is done automatically for cells that are degrading in quality. This feature can improve the longevity of an otherwise good board and is thus a vital resiliency feature on supported products, especially in HPC and enterprise environments. Retiring of pages may only occur when ECC is enabled. However, once a page has been retired, it will permanently be blacklisted, even if ECC is later disabled. Refer to the NVIDIA Developer Zone page retirement documentation for more information.

These page retirement and ECC features are offered on all GPUs that are supported on NVIDIA AI Enterprise.

Disabling and Enabling ECC Memory

Specific GPUs that support NVIDIA AI Enterprise software support error-correcting code (ECC) memory. ECC memory improves data integrity by detecting and handling double-bit errors. However, not all GPUs, vGPU types, and hypervisor software versions support ECC memory with NVIDIA vGPU.

On GPUs that support ECC memory with NVIDIA vGPU, ECC memory is supported with C-series vGPUs, but not with A-series and B-series vGPUs. On physical GPUs that do not have HBM2 memory, the amount of frame buffer usable by vGPUs is reduced. All types of vGPU are affected, not just vGPUs that support ECC memory.

The effects of enabling ECC memory on a physical GPU are as follows:

  • ECC memory is exposed as a feature on all supported vGPUs on the physical GPU.

  • In VMs that support ECC memory, ECC memory is enabled, with the option to disable ECC in the VM.

  • ECC memory can be enabled or disabled for individual VMs. Enabling or disabling ECC memory in a VM does not affect the amount of frame buffer usable by vGPUs.

This task depends on whether you are changing ECC memory settings for a physical GPU or a vGPU.

  • For a physical GPU, perform this task from the hypervisor host.

  • For a vGPU, perform this task from the VM to which the vGPU is assigned.

    Note

    ECC memory must be enabled on the physical GPU on which the vGPUs reside.

Disabling ECC Memory

Before you begin, ensure that NVIDIA AI Enterprise Host Software is installed on your hypervisor. If you are changing ECC memory settings for a vGPU, ensure that the NVIDIA AI Enterprise software graphics driver is installed in the VM to which the vGPU is assigned. Please refer to Creating Your First NVIDIA AI Enterprise VM if you do not already have a VM to work with.

  1. Use nvidia-smi to list the status of all physical GPUs or vGPUs and check for ECC noted as enabled.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    # nvidia-smi -q
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:36:45 2020
    Driver Version                      : 450.55
    
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    
    [...]
    
        Ecc Mode
            Current                     : Enabled
            Pending                     : Enabled
    
    [...]
    
  2. Change the ECC status to off for each GPU for which ECC is enabled.

    • If you want to change the ECC status to off for all GPUs on your host machine or vGPUs assigned to the VM, run this command:

      # nvidia-smi -e 0
      
    • If you want to change the ECC status to off for a specific GPU or vGPU, run this command:

      # nvidia-smi -i id -e 0
      

    id is the index of the GPU or vGPU as reported by nvidia-smi. This example disables ECC for the GPU with index 0000:02:00.0.

    # nvidia-smi -i 0000:02:00.0 -e 0
    
  3. Reboot the host or restart the VM.

  4. Confirm that ECC is now disabled for the GPU or vGPU.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    # nvidia—smi —q
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:37:53 2020
    Driver Version                      : 450.55
    
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    [...]
    
        Ecc Mode
            Current                     : Disabled
            Pending                     : Disabled
    
    [...]
    

Enabling ECC Memory

If ECC memory is suitable for your workloads and is supported by your hypervisor software and GPUs, but is disabled on your GPUs or vGPUs, enable it. This task depends on whether you are changing ECC memory settings for a physical GPU or a vGPU.

  • For a physical GPU, perform this task from the hypervisor host.

  • For a vGPU, perform this task from the VM to which the vGPU is assigned.

    Note

    ECC memory must be enabled on the physical GPU on which the vGPUs reside.

Before you begin, ensure that NVIDIA AI Enterprise Host Software is installed on your hypervisor. If you are changing ECC memory settings for a vGPU, ensure that the NVIDIA vGPU software graphics driver is installed in the VM to which the vGPU is assigned.

  1. Use nvidia-smi to list all physical GPUs’ or vGPUs’ status and check for ECC noted as disabled.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    # nvidia-smi -q
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:36:45 2020
    Driver Version                      : 450.55
    
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    
    [...]
    
        Ecc Mode
            Current                     : Disabled
            Pending                     : Disabled
    
    [...]
    
  2. Change the ECC status to on for each GPU or vGPU for which ECC is enabled.

    • If you want to change the ECC status to on for all GPUs on your host machine or vGPUs assigned to the VM, run this command:

      # nvidia-smi -e 1
      
    • If you want to change the ECC status to on for a specific GPU or vGPU, run this command:

      # nvidia-smi -i id -e 1
      

      id is the index of the GPU or vGPU as reported by nvidia-smi.

      This example enables ECC for the GPU with index 0000:02:00.0.

      # nvidia-smi -i 0000:02:00.0 -e 1
      
  3. Reboot the host or restart the VM.

  4. Confirm that ECC is now enabled for the GPU or vGPU.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    # nvidia—smi —q
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:37:53 2020
    Driver Version                      : 450.55
    
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    [...]
    
        Ecc Mode
            Current                     : Enabled
            Pending                     : Enabled
    
    [...]