AWS-Based Run:ai on DGX Cloud Cluster Configuration#

This section provides specific details about configurations or customizations available in the AWS-based Run:ai on DGX Cloud clusters.

Amazon Elastic Fabric Adapters#

Run:ai on DGX Cloud clusters in Amazon EKS provide Elastic Fabric Adapters (EFA) to enable high-speed distributed computing. Run:ai on DGX Cloud customers can use this fabric to enable GPU Direct RDMA, NCCL, and MPI for distributed workloads.

While many container images built for distributed computing already bundle tools like MPI and NCCL, workloads that want to take advantage of EFA must use the Amazon-provided EFA/OFI/NCCL/MPI stack. To streamline the use of EFA, Run:ai on DGX Cloud provides instances that have EFA devices pre-configured.

Researchers can use the Extended Resources feature to request EFA and Hugepages for their workloads. For example:

  • hugepages-2Mi: 5Gi

  • vpc.amazonaws.com/efa: "32"

Relevant environment variables for using EFA with MPI and NCCL include:

  • FI_EFA_USE_DEVICE_RDMA=1

  • FI_PROVIDER=efa

  • FI_EFA_USE_HUGE_PAGE=0

This last one may be necessary when running large multi-node PyTorch-based data loading jobs to prevent huge page resource exhaustion, but it does incur a slight performance penalty.

An example of launching a multi-node MPI NCCL test can be found in the Onboarding User Steps.

Troubleshooting#

User wants to confirm that EFA is being used:

This can usually be done by setting the environment variable NCCL_INFO=DEBUG for workloads that use NCCL. When the workload sends messages, it should print information in logs like:

NCCL INFO Channel 13/0 : 2[2] -> 10[2] [send] via NET/AWS Libfabric/2/GDRDMA

User pod fails to start, logs may display the following:

mpirun: /lib/x86_64-linux-gnu/libc.so.6: version 'GLIBC_2.34' not found
(required by /opt/amazon-efa-ofi/openmpi/lib/libopen-rte.so.40)

This means the container image’s C toolchain is too old to take advantage of EFA, or the wrong mpirun binary is being called. To remediate:

  • Update the image or entrypoint script, which may involve tracking down base image updates.

  • Alternatively, compile and bundle the Amazon EFA stack into the image manually.

User is running distributed PyTorch and seeing os.fork() throw error:

OSError: Cannot allocate memory

Set the environment variable FI_EFA_USE_HUGE_PAGE=0. This causes a small performance hit but prevents fork failures due to the OS running out of huge pages. More info is available at the AWS OFI NCCL repository.

User wants to confirm the C toolchain version in their image:

This can be done by running and inspecting the libc.so.6 in the image. Example:

$ docker run -it --entrypoint bash redhat/ubi9

[root@54fbddf47d15 /]# find / -name libc.so.6
/usr/lib64/libc.so.6

[root@54fbddf47d15 /]# /usr/lib64/libc.so.6
GNU C Library (GNU libc) stable release version 2.34.

(Optional) Private Access#

AWS Transit Gateway#

Introduction#

NVIDIA Run:ai Private Access enables access to NVIDIA Run:ai on DGX Cloud deployment without traversing the public internet. In this configuration, user access to the dedicated cluster and the NVIDIA Run:ai control plane is routed through a private connection that never leaves the AWS network.

Assumptions#

This document assumes that the following is already configured and accessible.

Getting Started#

Share your existing Transit Gateway with Nvidia managed AWS Account where your cluster is deployed using AWS Resource Access Management (RAM):

  • Enable resource sharing within AWS Organizations

  • Create a resource share

Create AWS PrivateLink to route traffic from customer VPC to NVIDIA Run:ai control plane over AWS network:

  • Create a VPC endpoint

  • Create Route 53 Hosted Zone for run.ai

  • Creating DNS records for *.run.ai

(Example) AWS Transit Gateway Private Access Deployment#

_images/dgx-cloud-diagram_aws_tgw_private_link.png

Prior to starting the AWS Transit Gateway configuration process make sure you have all the required information and assets. If necessary, please contact the Nvidia Technical Account Management team:

Information about the pre-provisioned NVIDIA Run:ai on AWS cluster:

  • AWS Account identifier ($DGXC_ACCOUNT_ID, 12-digit number)

  • AWS Region (e.g. us-east-1)

  • Subnet CIDRs used for the VPC in which the cluster is located ($RAI_CIDR e.g. 10.0.3.0/24, and 10.0.4.0/24)

Information on NVIDIA Run:ai access:

  • Researcher configuration file (aka kubeconfig)

In addition, you will also need following information about the AWS account from where you will be creating the Private Access connection. If necessary, consult AWS account console in customer account, or contact their network administrator, for assistance:

  • AWS VPC ($VPC_ID)

  • AWS VPC CIDR (10.0.3.0/24, and 10.0.4.0/24), must not overlap with the VPC CIDR of the Run:ai on DGX Cloud cluster

  • Private subnet with outbound internet connectivity ($SUBNET_ID)

  • AWS Region ($REGION e.g. us-east-1)

  1. Create Transit Gateway

    aws ec2 create-transit-gateway \
        --description "Run-ai Transit Gateway" \
        --amazon-side-asn 65000 \
        --options AutoAcceptSharedAttachments=enable \
        --tag-specifications "ResourceType=transit-gateway,Tags=[{Key=Name,Value=dgxc-runai-tgw}]"
    

    This command will output the TransitGatewayId and TransitGatewayARN ($TGW_ID, $TGW_ARN)

  2. Create Transit Gateway VPC Attachment

    Using TransitGatewayId and VpcId, create the attachment:

    aws ec2 create-transit-gateway-vpc-attachment \
        --transit-gateway-id $TGW_ID \
        --vpc-id $VPC_ID \
        --subnet-ids $SUBNET_ID \
        --tag-specifications "ResourceType=transit-gateway-attachment,Tags=[{Key=Name,Value=dgxc-runai-tgw-att}]"
    

    This command will output the TransitGatewayAttachmentId ($TGW_ATT_ID)

  3. Create Transit Gateway Route Table

    Create a route table for the Transit Gateway:

    aws ec2 create-transit-gateway-route-table \
        --transit-gateway-id $TGW_ID \
        --tag-specifications "ResourceType=transit-gateway-route-table,Tags=[{Key=Name,Value=dgxc-runai-tgw-rt}]"
    
  4. Create Transit Gateway Route

    After obtaining the TransitGatewayRouteTableId ($TGW_RT_ID), create a default route:

    aws ec2 create-transit-gateway-route \
        --destination-cidr-block "0.0.0.0/0" \
        --transit-gateway-route-table-id $TGW_RT_ID \
        --transit-gateway-attachment-id $TGW_ATT_ID
    
  5. Update VPC Route Tables

    For VPC’s that need to reach the NVIDIA Run:ai VPC, update their route tables to the NVIDIA Run:ai VPC CIDR:

    aws ec2 create-route \
        --route-table-id $TGW_RT_ID \
        --destination-cidr-block $RAI_CIDR \
        --transit-gateway-id $TGW_ID
    
  6. Create RAM Resource Share

    Create a Resource Access Manager (RAM) share:

    aws ram create-resource-share \
        --name dgxc-runai-tgw-ram \
        --allow-external-principals \
        --tags "Key=Name,Value=dgxc-runai-tgw-ram"
    

    This command will output the $RAM_SHARE_ARN

  7. Associate Transit Gateway with RAM Resource Share

    aws ram associate-resource-share \
        --resource-share-arn $TGW_ARN \
        --resource-arn $RAM_SHARE_ARN
    
  8. Associate Principal with RAM Resource Share

    To associate a principal (another AWS account or organization):

    aws ram associate-resource-share \
        --resource-share-arn $RAM_SHARE_ARN \
        --principal $DGXC_ACCOUNT_ID
    
  9. Create Security Group for Private Link

    For creating a Security Group:

    aws ec2 create-security-group \
        --group-name dgxc-runai-pl \
        --description "Private Link Security Group" \
        --vpc-id $VPC_ID
    

    This command will return $PL_SG_ID. Now, add inbound and outbound rules:

    aws ec2 authorize-security-group-ingress \
        --group-id $PL_SG_ID \
        --protocol tcp \
        --port 443 \
        --cidr $VPC_CIDR
    
    aws ec2 authorize-security-group-egress \
        --group-id $PL_SG_ID \
        --protocol "-1" \
        --port 0-0 \
        --cidr "0.0.0.0/0"
    
  10. Create VPC Endpoint

    For creating the VPC Endpoint:

    aws ec2 create-vpc-endpoint \
        --vpc-id $VPC_ID \
        --service-name dgxc-runai-pl \
        --vpc-endpoint-type Interface \
        --subnet-ids $SUBNET_ID \
        --security-group-ids $PL_SG_ID \
        --tag-specifications "ResourceType=vpc-endpoint,Tags=[{Key=Name,Value=dgxc-runai-pl}]"
    

    This command will return $VPC_PT_DNS

  11. Create Route 53 Hosted Zone

    For Route 53 zone creation:

    aws route53 create-hosted-zone \
        --name "run.ai" \
        --vpc VPCRegion=$REGION,VPCId=$VPC_ID \
        --caller-reference dgxc-runai-dns-req1 \
        --tags Key=Name,Value=dgxc-runai-zone
    

    This command will return $DNS_ZONE_ID

  12. Create Route 53 Record Set

    Create the DNS record:

    aws route53 change-resource-record-sets \
        --hosted-zone-id $DNS_ZONE_ID\
        --change-batch '{
            "Changes": [{
                "Action": "CREATE",
                "ResourceRecordSet": {
                "Name": "*.run.ai",
                "Type": "CNAME",
                "TTL": 300,
                "ResourceRecords": [{"Value": $VPC_PT_DNS}]
                }
            }]
        }'