NVIDIA AI Enterprise customers have access to a pre-configured GPU Operator within the NVIDIA Enterprise Catalog. The GPU Operator is pre-configured to simplify the provisioning experience with NVIDIA AI Enterprise deployments.
The pre-configured GPU Operator differs from the GPU Operator in the public NGC catalog. The differences are:
It is configured to use a prebuilt vGPU driver image (Only available to NVIDIA AI Enterprise customers).
It is configured to use the NVIDIA License System (NLS).
The GPU Operator with NVIDIA AI Enterprise requires some tasks to be completed prior to installation. Refer to the document NVIDIA AI Enterprise for instructions prior to running the below commands.
Add the NVIDIA AI Enterprise Helm repository, where api-key
is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated:
$ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=api-key && helm repo update
$ helm install --wait --generate-name nvaie/gpu-operator -n gpu-operator
License GPU Operator
Copy the NLS license token in the file named
client_configuration_token.tok
.Create an empty gridd.conf file.
touch gridd.conf
Create Configmap for the NLS Licensing.
kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok
Create K8s Secret to Access NGC registry.
kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator
GPU Operator with RDMA
Prerequisites
Please install the Network Operator to ensure that the MOFED drivers are installed.
After NVIDIA Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules.
$ helm install --wait gpu-operator nvaie/gpu-operator -n gpu-operator --set driver.rdma.enabled=true
Please note that the installation of the GPU Operator can take a couple of minutes. How long the installation will take depends on your internet speed.
kubectl get pods --all-namespaces | grep -v kube-system
Results:
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-operator-1622656274-node-feature-discovery-master-5cddq96gq 1/1 Running 0 2m39s
default gpu-operator-1622656274-node-feature-discovery-worker-wr88v 1/1 Running 0 2m39s
default gpu-operator-7db468cfdf-mdrdp 1/1 Running 0 2m39s
gpu-operator-resources gpu-feature-discovery-g425f 1/1 Running 0 2m20s
gpu-operator-resources nvidia-container-toolkit-daemonset-mcmxj 1/1 Running 0 2m20s
gpu-operator-resources nvidia-cuda-validator-s6x2p 0/1 Completed 0 48s
gpu-operator-resources nvidia-dcgm-exporter-wtxnx 1/1 Running 0 2m20s
gpu-operator-resources nvidia-dcgm-jbz94 1/1 Running 0 2m20s
gpu-operator-resources nvidia-device-plugin-daemonset-hzzdt 1/1 Running 0 2m20s
gpu-operator-resources nvidia-device-plugin-validator-9nkxq 0/1 Completed 0 17s
gpu-operator-resources nvidia-driver-daemonset-kt8g5 1/1 Running 0 2m20s
gpu-operator-resources nvidia-operator-validator-cw4j5 1/1 Running 0 2m20s
Please refer to the GPU Operator page on NGC for more information.
Execute the below command to list the Mellanox NIC’s with the status:
$ kubectl exec -it $(kubectl get pods -n nvidia-network-operator-resources | grep mofed | awk '{print $1}') -n nvidia-network-operator-resources -- ibdev2netdev
Output:
mlx5_0 port 1 ==> ens192f0 (Up)
mlx5_1 port 1 ==> ens192f1 (Down)
Edit the networkdefinition.yaml
.
$ nano networkdefinition.yaml
Create network definition for IPAM and replace the ens192f0
with active Mellanox device for master
.
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a
name: rdma-net-ipam
namespace: default
spec:
config: |-
{
"cniVersion": "0.3.1",
"name": "rdma-net-ipam",
"plugins": [
{
"ipam": {
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"log_file": "/tmp/whereabouts.log",
"log_level": "debug",
"range": "192.168.111.0/24",
"type": "whereabouts"
},
"type": "macvlan",
"master": "ens192f0",
"vlan": 111
},
{
"mtu": 1500,
"type": "tuning"
}
]
}
EOF
If you do not have VLAN-based networking on the high-performance side, please set “vlan”: 0