NVIDIA AI Enterprise 2.0 or later
NVIDIA AI Enterprise customers have access to a pre-configured GPU Operator within the NVIDIA Enterprise Catalog. The GPU Operator is pre-configured to simplify the provisioning experience with NVIDIA AI Enterprise deployments.
The pre-configured GPU Operator differs from the GPU Operator in the public NGC catalog. The differences are:
It is configured to use a prebuilt vGPU driver image (Only available to NVIDIA AI Enterprise customers).
It is configured to use the NVIDIA License System (NLS).
The GPU Operator with NVIDIA AI Enterprise requires some tasks to be completed prior to installation. Refer to the document NVIDIA AI Enterprise for instructions prior to running the below commands.
License GPU Operator for CLS
Add the NVIDIA AI Enterprise Helm repository, where api-key
is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated.
$ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=api-key && helm repo update
Copy the NLS license token in the file named
client_configuration_token.tok
.Create an empty gridd.conf file using the command below.
touch gridd.conf
Create Configmap for the NLS Licensing using the command below.
kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok
Create K8s Secret to Access NGC registry.
kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’
Install the GPU Operator with the command below.
$ helm install --wait --generate-name nvaie/gpu-operator -n gpu-operator
License GPU Operator for DLS
Add the NVIDIA AI Enterprise Helm repository, where api-key
is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated.
$ helm repo add nvidia https://nvidia.github.io/gpu-operator \
&& helm repo update
Prior to GPU Operator v1.9, the operator was installed in the default
namespace while all operands were installed in the gpu-operator-resources
namespace.
Starting with GPU Operator v1.9, both the operator and operands get installed in the same namespace. The namespace is configurable and is determined during installation. For example, to install the GPU Operator in the gpu-operator
namespace.
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace
nvidia/gpu-operator
If a namespace is not specified during installation, all GPU Operator components will be installed in the default
namespace.
GPU Operator with RDMA
Prerequisites
Please install the Installing NVIDIA Network Operator to ensure that the MOFED drivers are installed.
After NVIDIA Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules.
$ helm install --wait gpu-operator nvaie/gpu-operator -n gpu-operator --set driver.rdma.enabled=true
Please note that the installation of the GPU Operator can take a couple of minutes. How long the installation will take depends on your internet speed.
kubectl get pods --all-namespaces | grep -v kube-system
Results:
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-operator-1622656274-node-feature-discovery-master-5cddq96gq 1/1 Running 0 2m39s
default gpu-operator-1622656274-node-feature-discovery-worker-wr88v 1/1 Running 0 2m39s
default gpu-operator-7db468cfdf-mdrdp 1/1 Running 0 2m39s
gpu-operator-resources gpu-feature-discovery-g425f 1/1 Running 0 2m20s
gpu-operator-resources nvidia-container-toolkit-daemonset-mcmxj 1/1 Running 0 2m20s
gpu-operator-resources nvidia-cuda-validator-s6x2p 0/1 Completed 0 48s
gpu-operator-resources nvidia-dcgm-exporter-wtxnx 1/1 Running 0 2m20s
gpu-operator-resources nvidia-dcgm-jbz94 1/1 Running 0 2m20s
gpu-operator-resources nvidia-device-plugin-daemonset-hzzdt 1/1 Running 0 2m20s
gpu-operator-resources nvidia-device-plugin-validator-9nkxq 0/1 Completed 0 17s
gpu-operator-resources nvidia-driver-daemonset-kt8g5 1/1 Running 0 2m20s
gpu-operator-resources nvidia-operator-validator-cw4j5 1/1 Running 0 2m20s
Please refer to the GPU Operator page on NGC for more information.
Execute the below command to list the Mellanox NIC’s with the status:
$ kubectl exec -it $(kubectl get pods -n nvidia-network-operator-resources | grep mofed | awk '{print $1}') -n nvidia-network-operator-resources -- ibdev2netdev
Output:
mlx5_0 port 1 ==> ens192f0 (Up)
mlx5_1 port 1 ==> ens192f1 (Down)
Edit the networkdefinition.yaml
.
$ nano networkdefinition.yaml
Create network definition for IPAM and replace the ens192f0
with active Mellanox device for master
.
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a
name: rdma-net-ipam
namespace: default
spec:
config: |-
{
"cniVersion": "0.3.1",
"name": "rdma-net-ipam",
"plugins": [
{
"ipam": {
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"log_file": "/tmp/whereabouts.log",
"log_level": "debug",
"range": "192.168.111.0/24",
"type": "whereabouts"
},
"type": "macvlan",
"master": "ens192f0",
"vlan": 111
},
{
"mtu": 1500,
"type": "tuning"
}
]
}
EOF
If you do not have VLAN-based networking on the high-performance side, please set “vlan”: 0