Manual Installation#
Attention
This is an alternative to the automated installation. If you have used the automation, move to the Reference Applications section.
Important
Ensure the basic prerequisites have been met before proceeding with the setup.
Cluster Installation#
Follow these steps to install Cloud Native Stack (CNS) on your system:
Clone the Cloud Native Stack repository from GitHub, and navigate to the playbooks directory.
git clone --branch v25.7.2 https://github.com/NVIDIA/cloud-native-stack.git cd cloud-native-stack/playbooks
Edit the
cns_version.yamlfile to specify version 15.1.cns_version: 15.1
Open the file
cns_values_15.1.yaml, and configure the following settings according to your requirements as described below:enable_gpu_operator: yes enable_network_operator: yes enable_rdma: yes deploy_ofed: yes storage: no monitoring: yes loadbalancer: no loadbalancer_ip: "" cns_validation: yes
enable_gpu_operatorandenable_network_operatormust be set toyesto ensure that NVIDIA Network Operator and NVIDIA GPU Operator are deployed.If you have DOCA-OFED driver installed on your host system, set
deploy_ofedtono.To enable persistent storage, set
storagetoyes. This will deploy Local Path Provisioner and NFS Provisioner as storage options.To deploy the monitoring stack, set
monitoringtoyes. This will deploy Prometheus and Grafana with GPU metrics. After the stack is installed, access Grafana athttp://<node-ip>:32222with the credentials admin/cns-stack.To deploy MetalLB, set
loadbalancertoyesand setloadbalancer_ipto the node/host IP address (for example,10.117.20.50/32).
Modify the hosts file located at
./hoststo reflect your local machine configuration.[master] localhost ansible_connection=local [nodes]
Edit the
./files/network-operator-value.yamlfile (not./files/network-operator-values.yaml) to disable the NFD deployment using the network operator.nfd: enabled: false deployNodeFeatureRules: false
Modify the NicClusterPolicy Custom Resource (CR) by replacing the file located at
./files/nic-cluster-policy.yamlwith the following file which removes therdmaSharedDevicePluginconfiguration.apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: {% if deploy_ofed %} ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 25.04-0.6.1.0-2 {% endif %} secondaryNetwork: cniPlugins: image: plugins repository: ghcr.io/k8snetworkplumbingwg version: v1.6.2-update.1 imagePullSecrets: [] multus: image: multus-cni repository: ghcr.io/k8snetworkplumbingwg version: v4.1.0 imagePullSecrets: [] ipamPlugin: image: whereabouts repository: ghcr.io/k8snetworkplumbingwg version: v0.7.0 imagePullSecrets: []
Ensure you are in the
~/cloud-native-stack/playbooksdirectory. Create a Python virtual environment (venv) in which to install CNS:sudo apt update sudo apt install python3-pip python3-venv sshpass -y python3 -m venv ".cns"
Run the following command to activate the virtual environment and begin the CNS installation:
source .cns/bin/activate pip install --upgrade pip bash setup.sh install
Wait for ten to fifteen minutes for the installation to complete.
Note
During the installation of CNS, it may need to reboot the system, which can result in the following error:
TASK [reboot the system] ************************************************************************************************************************************************************************************* fatal: [localhost]: FAILED! => {"changed": false, "elapsed": 0, "msg": "Running reboot with local connection would reboot the control node.", "rebooted": false}If this occurs, manually reboot the system. After it restarts, re-activate the virtual environment and re-run the CNS installation command.
To deactivate the virtual environment, run:
deactivate
Verify the installation:
Check the status of the node:
kubectl get nodes
NAME STATUS ROLES AGE VERSION h4m Ready control-plane,worker 8h v1.32.2
Verify that all NVIDIA Network Operator and NVIDIA GPU Operator pods are running:
kubectl get pods --all-namespaces | grep -E "network-operator|nvidia-gpu-operator"
NAMESPACE NAME READY STATUS RESTARTS AGE network-operator cni-plugins-ds-rbz46 1/1 Running 0 17h network-operator kube-multus-ds-jdwz5 1/1 Running 0 17h network-operator mofed-ubuntu22.04-84df8f497b-ds-kgjdr 1/1 Running 0 17h network-operator network-operator-84798648dc-jhd9l 1/1 Running 0 17h network-operator whereabouts-2hc6n 1/1 Running 0 17h nvidia-gpu-operator gpu-feature-discovery-7w6pq 1/1 Running 0 17h nvidia-gpu-operator gpu-operator-1727018588-node-feature-discovery-gc-65c5f8cf45tlp 1/1 Running 0 17h nvidia-gpu-operator gpu-operator-1727018588-node-feature-discovery-master-56b7qsghn 1/1 Running 0 17h nvidia-gpu-operator gpu-operator-1727018588-node-feature-discovery-worker-rckps 1/1 Running 0 17h nvidia-gpu-operator gpu-operator-849f9c989-gr4sv 1/1 Running 0 17h nvidia-gpu-operator nvidia-container-toolkit-daemonset-cnkv8 1/1 Running 0 17h nvidia-gpu-operator nvidia-cuda-validator-fg28g 0/1 Completed 0 17h nvidia-gpu-operator nvidia-dcgm-exporter-vqpl5 1/1 Running 0 17h nvidia-gpu-operator nvidia-device-plugin-daemonset-5v5md 1/1 Running 0 17h nvidia-gpu-operator nvidia-driver-daemonset-gmbjq 2/2 Running 0 17h nvidia-gpu-operator nvidia-operator-validator-x8527 1/1 Running 0 17h
Installing Operators#
Deploy Cert Manager#
Cert Manager is a Kubernetes add-on that automates the management and issuance of TLS certificates from various issuing sources. Deploying Cert-Manager in the cluster allows to automatically manage TLS certificates as Kubernetes secrets.
Add the Jetstack Helm repository:
helm repo add jetstack https://charts.jetstack.io helm repo update
Use Helm to install Cert-Manager into the
cert-managernamespace:helm install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --create-namespace \ --version v1.18.2 \ --set crds.enabled=true
Wait for two to three minutes for the installation to complete.
Verify that all pods in the
cert-managernamespace have theReadystatus:kubectl get pods -n cert-manager -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cert-manager-56cc584bd4-r8jbs 1/1 Running 0 2m48s 192.168.34.32 h4m-dev-system <none> <none> cert-manager-cainjector-7cfc74b84b-bpdc7 1/1 Running 0 2m48s 192.168.34.31 h4m-dev-system <none> <none> cert-manager-webhook-784f6dd68-qt48v 1/1 Running 0 2m48s 192.168.34.30 h4m-dev-system <none> <none>
Install SR-IOV Network Operator#
SR-IOV Network Operator is responsible for configuring the SR-IOV components in the cluster.
Clone the SR-IOV network operator repository from GitHub and navigate to the
sriov-network-operatordirectory.git clone --branch v1.5.0 \ https://github.com/k8snetworkplumbingwg/sriov-network-operator.git cd sriov-network-operator
Edit the
~/deployment/sriov-network-operator-chart/values.yamlfile to enable admission controller configuration. This configures theoperator-webhookand thenetwork-resource-injectorfor installation, which is disabled by default.Refer to the file below to view the configuration changes for the default values, including the specific recommended image versions.
Or replace the default
values.yamlfile with this one:operator: tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" - key: "node-role.kubernetes.io/control-plane" operator: "Exists" effect: "NoSchedule" nodeSelector: {} affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: "node-role.kubernetes.io/master" operator: In values: [""] - weight: 1 preference: matchExpressions: - key: "node-role.kubernetes.io/control-plane" operator: In values: [""] nameOverride: "" fullnameOverride: "" resourcePrefix: "openshift.io" cniBinPath: "/opt/cni/bin" clusterType: "kubernetes" # minimal amount of time (in minutes) the operator will wait before removing # stale SriovNetworkNodeState objects (objects that doesn't match node with the daemon) # "0" means no extra delay, in this case the CR will be removed by the next reconciliation cycle (may take up to five minutes) staleNodeStateCleanupDelayMinutes: "30" metricsExporter: port: "9110" certificates: secretName: "metrics-exporter-cert" prometheusOperator: enabled: false serviceAccount: "prometheus-k8s" namespace: "monitoring" deployRules: false admissionControllers: enabled: true certificates: secretNames: operator: "operator-webhook-cert" injector: "network-resources-injector-cert" certManager: # When enabled, makes use of certificates managed by cert-manager. enabled: true # When enabled, certificates are generated via cert-manager and then name will match the name of the secrets # defined above generateSelfSigned: true # If not specified, no secret is created and secrets with the names defined above are expected to exist in the # cluster. In that case, the ca.crt must be base64 encoded twice since it ends up being an env variable. custom: enabled: false # operator: # caCrt: | # -----BEGIN CERTIFICATE----- # MIIMIICLDCCAdKgAwIBAgIBADAKBggqhkjOPQQDAjB9MQswCQYDVQQGEwJCRTEPMA0G # ... # -----END CERTIFICATE----- # tlsCrt: | # -----BEGIN CERTIFICATE----- # MIIMIICLDCCAdKgAwIBAgIBADAKBggqhkjOPQQDAjB9MQswCQYDVQQGEwJCRTEPMA0G # ... # -----END CERTIFICATE----- # tlsKey: | # -----BEGIN EC PRIVATE KEY----- # MHcl4wOuDwKQa+upc8GftXE2C//4mKANBC6It01gUaTIpo= # ... # -----END EC PRIVATE KEY----- # injector: # caCrt: | # -----BEGIN CERTIFICATE----- # MIIMIICLDCCAdKgAwIBAgIBADAKBggqhkjOPQQDAjB9MQswCQYDVQQGEwJCRTEPMA0G # ... # -----END CERTIFICATE----- # tlsCrt: | # -----BEGIN CERTIFICATE----- # MIIMIICLDCCAdKgAwIBAgIBADAKBggqhkjOPQQDAjB9MQswCQYDVQQGEwJCRTEPMA0G # ... # -----END CERTIFICATE----- # tlsKey: | # -----BEGIN EC PRIVATE KEY----- # MHcl4wOuDwKQa+upc8GftXE2C//4mKANBC6It01gUaTIpo= # ... # -----END EC PRIVATE KEY----- sriovOperatorConfig: # deploy sriovOperatorConfig CR with the below values deploy: true # node selectors for sriov-network-config-daemon configDaemonNodeSelector: beta.kubernetes.io/os: "linux" network.nvidia.com/operator.mofed.wait: "false" # log level for both operator and sriov-network-config-daemon logLevel: 2 # disable node draining when configuring SR-IOV, set to true in case of a single node # cluster or any other justifiable reason disableDrain: false # sriov-network-config-daemon configuration mode. either "daemon" or "systemd" configurationMode: daemon # feature gates to enable/disable featureGates: {} # Example for supportedExtraNICs values ['MyNIC: "8086 1521 1520"'] supportedExtraNICs: [] # Image URIs for sriov-network-operator components images: operator: nvcr.io/nvidia/mellanox/sriov-network-operator:network-operator-25.4.0 sriovConfigDaemon: nvcr.io/nvidia/mellanox/sriov-network-operator-config-daemon:network-operator-25.4.0 sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.1 ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.2.1 ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.38.2 rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni:v1.3.0 sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.9.0 resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.7.0 webhook: nvcr.io/nvidia/mellanox/sriov-network-operator-webhook:network-operator-25.4.0 metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0 imagePullSecrets: [] extraDeploy: []
Install the SRIOV Network Operator. Make sure you are in the parent directory where you cloned the
sriov-network-operatorrepo:helm install sriov-network-operator ./deployment/sriov-network-operator-chart \ -n sriov-network-operator \ --create-namespace \ --wait
Wait for one to two minutes for the operator installation to complete.
Verify that all pods in the
sriov-network-operatornamespace have theReadystatus:kubectl get pods -n sriov-network-operator
NAME READY STATUS RESTARTS AGE network-resources-injector-m8dwx 1/1 Running 0 75s operator-webhook-9wwjm 1/1 Running 0 75s sriov-network-config-daemon-j2xw7 1/1 Running 0 75s sriov-network-operator-5bfc88d89c-bkrzz 1/1 Running 0 82s
Create the following custom resources for a proper network configuration:
SriovNetworkNodePolicy (refer to Configure SR-IOV Network Node Policy)
SriovNetwork (refer to Configure SR-IOV Network)
Configure SR-IOV Network Node Policy#
Identify the node name using the following command:
kubectl get nodes --no-headers -o custom-columns=NAME:.metadata.name
Use the retrieved node name to replace
<node_name>in the following steps.Use the following command to identify the interface name corresponding to your NIC.
Replace the
<node_name>with the name of the node that is determined in step 1 and select the appropriate link speed based on your network environment.kubectl -n sriov-network-operator \ get sriovnetworknodestates.sriovnetwork.openshift.io <node_name> -o json | \ jq '.status.interfaces[] | select(.linkSpeed | test("^[1-9][0-9]{4,} Mb/s$")) | .name'
"enp3s0f0" "enp3s0f1"
Create an SriovNetworkNodePolicy CR with the following content in the
sriov_policy.yamlfile:apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: media-a-tx-pool namespace: sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/rdma.capable: "true" resourceName: media_a_tx_pool priority: 99 mtu: 1500 numVfs: 16 nicSelector: pfNames: ["<interface_name_0>#0-15"] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: media-a-rx-pool namespace: sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/rdma.capable: "true" resourceName: media_a_rx_pool priority: 99 mtu: 1500 numVfs: 16 nicSelector: pfNames: ["<interface_name_1>#0-15"] deviceType: netdevice isRdma: true
Replace
<interface_name_0>and<interface_name_1>in the above snippet with the interface names that you got from running the command identify the interface name.Create an SriovNetworkNodePolicy CR using the following command:
kubectl apply -f sriov_policy.yaml
sriovnetworknodepolicy.sriovnetwork.openshift.io/media-a-tx-pool created sriovnetworknodepolicy.sriovnetwork.openshift.io/media-a-rx-pool created
Note
After applying this file, the system might become temporarily unreachable. Be assured that the system will become accessible again. Allow some time for all components to fully initialize.
Wait for one to two minutes for the
sriov-device-pluginpod to haveReadystatus.Check the pod status using the following command:
kubectl get pods -n sriov-network-operator
NAME READY STATUS RESTARTS AGE network-resources-injector-m8dwx 1/1 Running 0 7m55s operator-webhook-9wwjm 1/1 Running 0 7m55s sriov-device-plugin-dql8q 1/1 Running 0 67s sriov-network-config-daemon-j2xw7 1/1 Running 0 7m55s sriov-network-operator-5bfc88d89c-bkrzz 1/1 Running 0 8m2s
Wait one to two minutes for virtual functions to get created.
Verify that the two pools each have a positive value for the node before proceeding. For example:
kubectl get node <node_name> -o json | \ jq '.status.allocatable | with_entries(select(.key|test("^openshift.io/.+pool$")))'
{ "openshift.io/media_a_rx_pool": "16", "openshift.io/media_a_tx_pool": "16" }
Replace the
<node_name>with the name of the node that is determined in step 1.
Configure SR-IOV Network#
This configuration requires that you create an sriov_network.yaml file that refers to the resourceName values
defined in the SriovNetworkNodePolicy.
Create an SriovNetwork CR for the chosen network interfaces.
In the example below, two networks are created for each port. The first is configured to use the Whereabouts plugin for dynamic IP Address Management (IPAM). The second is configured with static IPAM to allow manual and fixed assignment of IP addresses.
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: media-a-tx-net namespace: sriov-network-operator spec: ipam: | { "type": "whereabouts", "range": "192.168.100.0/24", "exclude": [ "192.168.100.0/26", "192.168.100.128/25" ] } networkNamespace: default resourceName: media_a_tx_pool --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: media-a-rx-net namespace: sriov-network-operator spec: ipam: | { "type": "whereabouts", "range": "192.168.100.0/24", "exclude": [ "192.168.100.0/25", "192.168.100.128/26" ] } networkNamespace: default resourceName: media_a_rx_pool --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: media-a-tx-net-static namespace: sriov-network-operator spec: ipam: | { "type": "static" } networkNamespace: default resourceName: media_a_tx_pool --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: media-a-rx-net-static namespace: sriov-network-operator spec: ipam: | { "type": "static" } networkNamespace: default resourceName: media_a_rx_pool
IP range for each of the networks are listed in the following table:
Resource Name
Network Names
Static IPAM Range
Dynamic IPAM Range
media_a_tx_poolmedia-a-tx-net(-static)192.168.100.0-63
192.168.100.64-127
media_a_rx_poolmedia-a-rx-net(-static)192.168.100.128-191
192.168.100.192-255
Create an SriovNetwork CR using the following:
kubectl apply -f sriov_network.yaml
sriovnetwork.sriovnetwork.openshift.io/media-a-tx-net created sriovnetwork.sriovnetwork.openshift.io/media-a-rx-net created sriovnetwork.sriovnetwork.openshift.io/media-a-tx-net-static created sriovnetwork.sriovnetwork.openshift.io/media-a-rx-net-static created
Execute the following command to validate successful creation of the SriovNetwork:
kubectl get network-attachment-definitions
NAME AGE media-a-rx-net 5m48s media-a-rx-net-static 5m48s media-a-tx-net 5m48s media-a-tx-net-static 5m48s
Advanced Configuration#
Configure CPU Manager#
The CPU management policy in Kubernetes enables exclusive CPU allocation for containers within Guaranteed Quality of Service (QoS) pods. To enable this, we need to change the cpuManagerPolicy from none to static.
Identify the name of node and drain it:
kubectl get nodes kubectl drain --ignore-daemonsets <node_name>
Stop the Kubelet:
sudo systemctl stop kubelet
Remove the old CPU manager state file. By default, the path to this file is
/var/lib/kubelet/cpu_manager_state. This clears the state maintained by the CPU Manager so that thecpusetscreated by the new policy won’t conflict with it.sudo rm /var/lib/kubelet/cpu_manager_state
Edit the Kubelet configuration file,
/var/lib/kubelet/config.yaml, and add the given lines:cpuManagerPolicy: "static" reservedSystemCPUs: "0-1"
Start the Kubelet:
sudo systemctl start kubelet
Uncordon the node:
kubectl uncordon <node_name>
Configuring GPU Time-Slicing#
By default, on a workstation with a single GPU, only one container (and pod) using GPU can be installed. However, time-slicing allows sharing the same GPU among different containers (and pods) by creating replicas.
Create
time-slicing-config-all.yamlbased on the following example. Configure the number of time-sliced GPU replicas to make available for shared access, for example,4:apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config-all data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4
Add the config map to the same namespace as the GPU operator:
kubectl create -n nvidia-gpu-operator -f time-slicing-config-all.yaml
Configure the device plugin with the config map and set the default time-slicing configuration:
kubectl patch clusterpolicies.nvidia.com/cluster-policy \ -n nvidia-gpu-operator --type merge \ -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all","default": "any"}}}}'
Verify the GPU replica count on the node, using the following command:
kubectl describe node | grep nvidia.com/gpu.replicas
Performance Configuration#
The Rivermax SDK which enables ST 2110 streaming takes advantage of GPUDirect and uses huge pages for performance. Huge pages are a memory management technique used in modern computer systems to improve performance for memory-intensive applications or large memory transfer. Enabling it is strongly recommended to benefit from the best performance.
Check that the
nvidia-peermemmodule is loaded:lsmod | grep nvidia
nvidia_peermem 16384 0 nvidia_uvm 4956160 4 nvidia_drm 122880 3 nvidia_modeset 1355776 5 nvidia_drm nvidia 54296576 87 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_uverbs 196608 3 nvidia_peermem,rdma_ucm,mlx5_ib video 73728 4 asus_wmi,amdgpu,asus_nb_wmi,nvidia_modeset
If the
nvidia-peermemmodule is not loaded, run the following command to load the module:sudo modprobe nvidia-peermem echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf
Configure
shmmaxand HugePages:Configure
shmmaxas needed (for example, set to 2 GiB):sudo sysctl -w kernel.shmmax=2147483648
If huge pages are being set for the first time on the system, the setting can be made persistent across reboots using:
echo 'kernel.shmmax=2147483648' | sudo tee -a /etc/sysctl.conf
If the persistent setting needs to be updated, edit the existing value in
/etc/sysctl.confand apply the settings using:
sudo sysctl -p
Enable the HugePages allocation:
Check HugePage size:
cat /proc/meminfo | grep Hugepagesize Hugepagesize: 2048 kB
Calculate and configure HugePages (for example, allocate 10 GiB)
Calculate required HugePages based on HugePage size:
10 GiB = 10240 MiB Number of HugePages = 10240 MiB / 2 MiB = 5120
Set HugePages:
sudo sysctl -w vm.nr_hugepages=5120
If huge pages are being set for the first time on the system, the setting can be made persistent across reboots using:
echo 'vm.nr_hugepages=5120' | sudo tee -a /etc/sysctl.conf
If the persistent setting needs to be updated, edit the existing value in
/etc/sysctl.confand apply the settings using:
sudo sysctl -p
Note
Allocating HugePages reserves 10 GB, reducing the total memory available for other applications.
Restart the Kubelet:
sudo systemctl restart kubelet
Cluster Uninstallation#
Ensure you are in the ~/cloud-native-stack/playbooks directory.
To uninstall CNS, run the following commands:
sudo iptables -P INPUT ACCEPT
sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
source .cns/bin/activate
bash setup.sh uninstall
After uninstalling CNS, it is important to reboot the system to completely remove any components loaded by CNS.