DGX Cloud Admission Controller Installation Guide#
The DGX Cloud Admission Controller microservice is a Kubernetes admission webhook that optimizes multi-node AI workloads for enhanced networking performance on cloud service providers. By installing the NeMo microservices with DGX Cloud Admission Controller, you can leverage advanced networking technologies including:
Elastic Fabric Adapter (EFA) on Amazon Web Services (AWS).
Remote Direct Memory Access (RDMA) on Microsoft Azure and Oracle Cloud Infrastructure (OCI).
TCP Offload (TCPXO) on Google Cloud Platform (GCP).
This optimization is crucial for high-performance computing (HPC) tasks, especially in distributed training environments that require high throughput and low-latency communication.
Cloud-specific Setup and Values Files#
From the following tabs, choose the cloud provider you want to deploy to and follow the instructions.
Deploy DGX Cloud Admission Controller on AWS to optimize networking performance using Elastic Fabric Adapter (EFA).
Install Kyverno. It is a required dependency for the admission controller for AWS.
helm upgrade -i kyverno kyverno/kyverno -n kyverno --create-namespace --version 3.1.5
Download the
values-dgxc-aws.yaml
file. See Values for Amazon Web Services (AWS) for more details on the parameters.Configure the following parameters in the values file based on your AWS infrastructure:
Set
gpuAllocatable
to the number of GPUs per node.Set
efaAllocatable
to the number of EFA NICs per node.
Install the DGX Cloud Admission Controller.
helm -n dgxc-admission-controller \ install dgxc-admission-controller \ dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \ --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \ -f values-dgxc-aws.yaml
Deploy DGX Cloud Admission Controller on Azure to optimize networking performance using Remote Direct Memory Access (RDMA).
Install Kyverno, a required dependency for the admission controller for Azure.
helm upgrade -i kyverno kyverno/kyverno -n kyverno --create-namespace --version 3.1.5
Download the
values-dgxc-azure.yaml
file. See Values for Azure for more details on the parameters.Configure the following parameters in the values file based on your Azure infrastructure:
Set
rdmaResourcePerGpu
to the number of NICs per GPU.Set
networkAttachmentDefinition
to your network attachment definition name.Set
rdmaResourceName
to the network resource name.
Install the DGX Cloud Admission Controller.
helm -n dgxc-admission-controller \ install dgxc-admission-controller \ dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \ --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \ -f values-dgxc-azure.yaml
Deploy DGX Cloud Admission Controller on Oracle Cloud Infrastructure (OCI) to optimize networking performance using RDMA over Converged Ethernet (RoCE).
Download the
values-dgxc-oci.yaml
file. See Values for Oracle Cloud Infrastructure (OCI) for more details on the parameters.Configure the following parameters in the values file based on your OCI infrastructure:
Set
rdmaResourcePerGpu
to the number of NICs per GPU.Set
networkAttachmentDefinition
to your network attachment definition name.Set
rdmaResourceName
to the network resource name.
Install the DGX Cloud Admission Controller.
helm -n dgxc-admission-controller \ install dgxc-admission-controller \ dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \ --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \ -f values-dgxc-oci.yaml
Deploy DGX Cloud Admission Controller on Google Cloud Platform (GCP) to optimize networking performance using TCP Offload (TCPXO).
Download the
values-dgxc-gcp.yaml
file. See Values for Google Cloud Platform (GCP) for more details on the parameters.Install the DGX Cloud Admission Controller.
helm -n dgxc-admission-controller \ install dgxc-admission-controller \ dgxc-admission-controller-$DCXC_CHART_VERSION.tgz \ --set dgxcController.image=nvidia/nemo-microservices/dgxc-admission-controller:$DCXC_CHART_VERSION \ -f values-dgxc-gcp.yaml
Support Matrix#
Cloud Provider |
High-Performance Networking |
Details |
Tested Environment |
---|---|---|---|
AWS |
Elastic Fabric Adapter (EFA) |
Needs a custom AMI for optimal performance. Supports through Kyverno. |
p5.48xlarge |
Azure |
InfiniBand (RDMA) |
- |
Standard_ND96amsr_A100_v4 with NVIDIA-A100-SXM4-80GB. |
GCP |
TCP-X, TCP-XO |
- |
a3-megagpu-8g with NVIDIA H100 80GB MEGA. |
OCI |
RDMA (RoCE) |
- |
BM.GPU.A100. |