DGX Cloud Admission Controller Installation Guide#

The DGX Cloud Admission Controller microservice is a Kubernetes admission webhook that optimizes multi-node AI workloads for enhanced networking performance on cloud service providers. By installing the NeMo microservices with DGX Cloud Admission Controller, you can leverage advanced networking technologies including:

  • Elastic Fabric Adapter (EFA) on Amazon Web Services (AWS).

  • Remote Direct Memory Access (RDMA) on Microsoft Azure and Oracle Cloud Infrastructure (OCI).

  • TCP Offload (TCPXO) on Google Cloud Platform (GCP).

This optimization is crucial for high-performance computing (HPC) tasks, especially in distributed training environments that require high throughput and low-latency communication.

Cloud-specific Setup and Values Files#

From the following tabs, choose the cloud provider you want to deploy to and follow the instructions.

Deploy DGX Cloud Admission Controller on AWS to optimize networking performance using Elastic Fabric Adapter (EFA).

  1. Install Kyverno. It is a required dependency for the admission controller for AWS.

    helm upgrade -i kyverno kyverno/kyverno -n kyverno --create-namespace --version 3.1.5
    
  2. Download the values-dgxc-aws.yaml file. See Values for Amazon Web Services (AWS) for more details on the parameters.

    Configure the following parameters in the values file based on your AWS infrastructure:

    • Set gpuAllocatable to the number of GPUs per node.

    • Set efaAllocatable to the number of EFA NICs per node.

  3. Install the DGX Cloud Admission Controller.

    helm -n dgxc-admission-controller \
       install dgxc-admission-controller \
       dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \
       --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \
       -f values-dgxc-aws.yaml
    

Deploy DGX Cloud Admission Controller on Azure to optimize networking performance using Remote Direct Memory Access (RDMA).

  1. Install Kyverno, a required dependency for the admission controller for Azure.

    helm upgrade -i kyverno kyverno/kyverno -n kyverno --create-namespace --version 3.1.5
    
  2. Download the values-dgxc-azure.yaml file. See Values for Azure for more details on the parameters.

    Configure the following parameters in the values file based on your Azure infrastructure:

    • Set rdmaResourcePerGpu to the number of NICs per GPU.

    • Set networkAttachmentDefinition to your network attachment definition name.

    • Set rdmaResourceName to the network resource name.

  3. Install the DGX Cloud Admission Controller.

    helm -n dgxc-admission-controller \
       install dgxc-admission-controller \
       dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \
       --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \
       -f values-dgxc-azure.yaml
    

Deploy DGX Cloud Admission Controller on Oracle Cloud Infrastructure (OCI) to optimize networking performance using RDMA over Converged Ethernet (RoCE).

  1. Download the values-dgxc-oci.yaml file. See Values for Oracle Cloud Infrastructure (OCI) for more details on the parameters.

    Configure the following parameters in the values file based on your OCI infrastructure:

    • Set rdmaResourcePerGpu to the number of NICs per GPU.

    • Set networkAttachmentDefinition to your network attachment definition name.

    • Set rdmaResourceName to the network resource name.

  2. Install the DGX Cloud Admission Controller.

    helm -n dgxc-admission-controller \
       install dgxc-admission-controller \
       dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \
       --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \
       -f values-dgxc-oci.yaml
    

Deploy DGX Cloud Admission Controller on Google Cloud Platform (GCP) to optimize networking performance using TCP Offload (TCPXO).

  1. Download the values-dgxc-gcp.yaml file. See Values for Google Cloud Platform (GCP) for more details on the parameters.

  2. Install the DGX Cloud Admission Controller.

    helm -n dgxc-admission-controller \
       install dgxc-admission-controller \
       dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \
       --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \
       -f values-dgxc-gcp.yaml
    

Support Matrix#

Cloud Provider

High-Performance Networking

Details

Tested Environment

AWS

Elastic Fabric Adapter (EFA)

Needs a custom AMI for optimal performance. Supports through Kyverno.

p5.48xlarge

Azure

InfiniBand (RDMA)

-

Standard_ND96amsr_A100_v4 with NVIDIA-A100-SXM4-80GB.

GCP

TCP-X, TCP-XO

-

a3-megagpu-8g with NVIDIA H100 80GB MEGA.

OCI

RDMA (RoCE)

-

BM.GPU.A100.