AWS-Based DGX Cloud Create Cluster Configuration#

This section provides specific details about configurations or customizations available in AWS-based DGX Cloud Create clusters.

Amazon Elastic Fabric Adapters#

DGX Cloud Create clusters in Amazon EKS provide Elastic Fabric Adapters (EFA) to enable high-speed distributed computing. DGXC customers can use this fabric to enable GPU Direct RDMA, NCCL, and MPI for distributed workloads.

While many container images built for distributed computing already bundle tools like MPI and NCCL, workloads that want to take advantage of EFA must use the Amazon-provided EFA/OFI/NCCL/MPI stack. To streamline the use of EFA, DGXC provides the Amazon stack, environment variables, and EFA devices automatically to pods launched as distributed MPIJob or PyTorchJob by mutating their pod definitions. This is referred to as Auto-Mounted EFA.

Taking Advantage of Auto-Mounted EFA#

The Auto-Mounted EFA feature consists of:

  • A volume mount at /opt/amazon-efa-ofi.

  • Added container resource requests of vpc.amazonaws.com/efa: "32" and hugepages-2Mi: 5Gi.

  • LD_LIBRARY_PATH prefixed with various directories in /opt/amazon-efa-ofi.

  • Environment variables OPAL_PREFIX, NVIDIA_GDRCOPY, and FI_EFA_USE_DEVICE_RDMA are set.

Some workloads will still require small modifications to take advantage of the mounted Amazon EFA stack:

  • Container images must be built using the C toolchain version 2.34 or later (generally, images built from at least Ubuntu 21.10 or ubi9).

  • If the distributed job is a PyTorchJob, generally no modifications are required. The Amazon stack will be found due to the value of LD_LIBRARY_PATH.

  • If the distributed job is an MPIJob, and the container entry point or scripting calls tools like mpirun, these need to be modified to call the provided /opt/amazon-efa-ofi/openmpi/bin/mpirun instead.

  • It may also be necessary to pass along key environment variables from the launcher to the worker nodes, such as /opt/amazon-efa-ofi/openmpi/bin/mpirun -x LD_LIBRARY_PATH -x OPAL_PREFIX -x FI_EFA_USE_DEVICE_RDMA.

Troubleshooting#

User wants to confirm that EFA is being used:

This can usually be done by setting the environment variable NCCL_INFO=DEBUG for workloads that use NCCL. When the workload sends messages, it should print information in logs like:

NCCL INFO Channel 13/0 : 2[2] -> 10[2] [send] via NET/AWS Libfabric/2/GDRDMA

User pod fails to start, logs may display the following:

mpirun: /lib/x86_64-linux-gnu/libc.so.6: version 'GLIBC_2.34' not found
(required by /opt/amazon-efa-ofi/openmpi/lib/libopen-rte.so.40)

This means the container image’s C toolchain is too old to take advantage of Auto-Mounted EFA, or the wrong mpirun binary is being called. To remediate:

  • Update the image or entrypoint script, which may involve tracking down base image updates.

  • Alternatively, compile and bundle the Amazon EFA stack into the image manually and opt out of Auto-Mounted EFA.

User is running distributed PyTorch and seeing os.fork() throw error:

OSError: Cannot allocate memory

Set the environment variable FI_EFA_USE_HUGE_PAGE=0. This causes a small performance hit but prevents fork failures due to the OS running out of huge pages. More info is available at the AWS OFI NCCL repository.

User wants to confirm the C toolchain version in their image:

This can be done by running and inspecting the libc.so.6 in the image. Example:

$ docker run -it --entrypoint bash redhat/ubi9

[root@54fbddf47d15 /]# find / -name libc.so.6
/usr/lib64/libc.so.6

[root@54fbddf47d15 /]# /usr/lib64/libc.so.6
GNU C Library (GNU libc) stable release version 2.34.

EFA is observed in launcher pod even when the GPU request for workers is less than 8:

Due to an implementation detail, the DGX Cloud Create infrastructure that automatically mutates a pod for EFA cannot distinguish whether the launcher pod is part of a distributed training job where the workers have requested a full node (8 GPU) or not. In general, EFA is not supported for partial node workloads (requests of less than 8 GPU). See the Opt-Out section next for details on completely disabling any EFA injection.

Opt-Out#

In some cases, users may want to disable Auto-Mounted EFA entirely. This is useful when:

  • The container image’s C toolchain is incompatible.

  • The image bundles its own build of Amazon EFA.

  • The user wants to opt out of EFA for any reason.

This is achieved by adding an annotation to workloads: disable-auto-efa: "true". How to add this annotation depends on how the workload is submitted:

  • NVIDIA Run:ai CLI

    runai submit-dist  --annotation "disable-auto-efa=true"
    
  • NVIDIA Run:ai UI

    When submitting a new distributed training workload, find the General section and select + Annotation.

  • YAML Files

    Add disable-auto-efa: "true" to the metadata annotations section of any pod specs.

(Optional) Private Access#

AWS Transit Gateway#

Introduction#

NVIDIA Run:ai Private Access enables access to NVIDIA DGX Cloud Create deployment without traversing the public internet. In this configuration, user access to the dedicated cluster and the NVIDIA Run:ai control plane is routed through a private connection that never leaves the AWS network.

Assumptions#

This document assumes that the following is already configured and accessible.

Getting Started#

Share your existing Transit Gateway with Nvidia managed AWS Account where your cluster is deployed using AWS Resource Access Management (RAM):

  • Enable resource sharing within AWS Organizations

  • Create a resource share

Create AWS PrivateLink to route traffic from customer VPC to NVIDIA Run:ai control plane over AWS network:

  • Create a VPC endpoint

  • Create Route 53 Hosted Zone for run.ai

  • Creating DNS records for *.run.ai

(Example) AWS Transit Gateway Private Access Deployment#

_images/dgx-cloud-diagram_aws_tgw_private_link.png

Prior to starting the AWS Transit Gateway configuration process make sure you have all the required information and assets. If necessary, please contact the Nvidia Technical Account Management team:

Information about the pre-provisioned NVIDIA Run:ai on AWS cluster:

  • AWS Account identifier ($DGXC_ACCOUNT_ID, 12-digit number)

  • AWS Region (e.g. us-east-1)

  • Subnet CIDRs used for the VPC in which the cluster is located ($RAI_CIDR e.g. 10.0.3.0/24, and 10.0.4.0/24)

Information on NVIDIA Run:ai access:

  • Researcher configuration file (aka kubeconfig)

In addition, you will also need following information about the AWS account from where you will be creating the Private Access connection. If necessary, consult AWS account console in customer account, or contact their network administrator, for assistance:

  • AWS VPC ($VPC_ID)

  • AWS VPC CIDR (10.0.3.0/24, and 10.0.4.0/24), must not overlap with the VPC CIDR of the DGX Cloud Create cluster

  • Private subnet with outbound internet connectivity ($SUBNET_ID)

  • AWS Region ($REGION e.g. us-east-1)

  1. Create Transit Gateway

    aws ec2 create-transit-gateway \
        --description "Run-ai Transit Gateway" \
        --amazon-side-asn 65000 \
        --options AutoAcceptSharedAttachments=enable \
        --tag-specifications "ResourceType=transit-gateway,Tags=[{Key=Name,Value=dgxc-runai-tgw}]"
    

    This command will output the TransitGatewayId and TransitGatewayARN ($TGW_ID, $TGW_ARN)

  2. Create Transit Gateway VPC Attachment

    Using TransitGatewayId and VpcId, create the attachment:

    aws ec2 create-transit-gateway-vpc-attachment \
        --transit-gateway-id $TGW_ID \
        --vpc-id $VPC_ID \
        --subnet-ids $SUBNET_ID \
        --tag-specifications "ResourceType=transit-gateway-attachment,Tags=[{Key=Name,Value=dgxc-runai-tgw-att}]"
    

    This command will output the TransitGatewayAttachmentId ($TGW_ATT_ID)

  3. Create Transit Gateway Route Table

    Create a route table for the Transit Gateway:

    aws ec2 create-transit-gateway-route-table \
        --transit-gateway-id $TGW_ID \
        --tag-specifications "ResourceType=transit-gateway-route-table,Tags=[{Key=Name,Value=dgxc-runai-tgw-rt}]"
    
  4. Create Transit Gateway Route

    After obtaining the TransitGatewayRouteTableId ($TGW_RT_ID), create a default route:

    aws ec2 create-transit-gateway-route \
        --destination-cidr-block "0.0.0.0/0" \
        --transit-gateway-route-table-id $TGW_RT_ID \
        --transit-gateway-attachment-id $TGW_ATT_ID
    
  5. Update VPC Route Tables

    For VPC’s that need to reach the NVIDIA Run:ai VPC, update their route tables to the NVIDIA Run:ai VPC CIDR:

    aws ec2 create-route \
        --route-table-id $TGW_RT_ID \
        --destination-cidr-block $RAI_CIDR \
        --transit-gateway-id $TGW_ID
    
  6. Create RAM Resource Share

    Create a Resource Access Manager (RAM) share:

    aws ram create-resource-share \
        --name dgxc-runai-tgw-ram \
        --allow-external-principals \
        --tags "Key=Name,Value=dgxc-runai-tgw-ram"
    

    This command will output the $RAM_SHARE_ARN

  7. Associate Transit Gateway with RAM Resource Share

    aws ram associate-resource-share \
        --resource-share-arn $TGW_ARN \
        --resource-arn $RAM_SHARE_ARN
    
  8. Associate Principal with RAM Resource Share

    To associate a principal (another AWS account or organization):

    aws ram associate-resource-share \
        --resource-share-arn $RAM_SHARE_ARN \
        --principal $DGXC_ACCOUNT_ID
    
  9. Create Security Group for Private Link

    For creating a Security Group:

    aws ec2 create-security-group \
        --group-name dgxc-runai-pl \
        --description "Private Link Security Group" \
        --vpc-id $VPC_ID
    

    This command will return $PL_SG_ID. Now, add inbound and outbound rules:

    aws ec2 authorize-security-group-ingress \
        --group-id $PL_SG_ID \
        --protocol tcp \
        --port 443 \
        --cidr $VPC_CIDR
    
    aws ec2 authorize-security-group-egress \
        --group-id $PL_SG_ID \
        --protocol "-1" \
        --port 0-0 \
        --cidr "0.0.0.0/0"
    
  10. Create VPC Endpoint

    For creating the VPC Endpoint:

    aws ec2 create-vpc-endpoint \
        --vpc-id $VPC_ID \
        --service-name dgxc-runai-pl \
        --vpc-endpoint-type Interface \
        --subnet-ids $SUBNET_ID \
        --security-group-ids $PL_SG_ID \
        --tag-specifications "ResourceType=vpc-endpoint,Tags=[{Key=Name,Value=dgxc-runai-pl}]"
    

    This command will return $VPC_PT_DNS

  11. Create Route 53 Hosted Zone

    For Route 53 zone creation:

    aws route53 create-hosted-zone \
        --name "run.ai" \
        --vpc VPCRegion=$REGION,VPCId=$VPC_ID \
        --caller-reference dgxc-runai-dns-req1 \
        --tags Key=Name,Value=dgxc-runai-zone
    

    This command will return $DNS_ZONE_ID

  12. Create Route 53 Record Set

    Create the DNS record:

    aws route53 change-resource-record-sets \
        --hosted-zone-id $DNS_ZONE_ID\
        --change-batch '{
            "Changes": [{
                "Action": "CREATE",
                "ResourceRecordSet": {
                "Name": "*.run.ai",
                "Type": "CNAME",
                "TTL": 300,
                "ResourceRecords": [{"Value": $VPC_PT_DNS}]
                }
            }]
        }'