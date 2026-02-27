This guide provides comprehensive instructions for deploying NVIDIA UFM Enterprise on Kubernetes using Helm charts.

UFM Enterprise now supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:

Declarative Configuration : Define your UFM deployment using Helm values

Simplified Operations : Use standard Kubernetes tools for deployment, upgrades, and management

Plugin Support : Deploy UFM plugins as separate pods with automatic configuration

Ingress Integration : Expose UFM through Kubernetes Ingress controllers

Persistent Storage: Use Kubernetes PersistentVolumeClaims for data persistence

Kubernetes 1.28 or later.

UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.

UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.

Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:

Kubernetes cluster version 1.28 or later

kubectl configured with cluster access

Cluster admin permissions for installation

Helm 3.x installed on the management workstation. Run: Copy Copied! # Verify Helm installation helm version

A StorageClass that supports ReadWriteMany access mode

Minimum 10GB storage capacity

At least one node with InfiniBand interface

DOCA drivers installed on the worker node where UFM will be deployed

InfiniBand port configured and in "up" state. Run: Copy Copied! # Verify InfiniBand interface ip link show | grep -E 'ib[0-9]|ibp'

UFM License

Valid UFM Enterprise license file

License file accessible from the management workstation

UFM uses host network mode. Ensure these ports are available on the target node (more ports might be used):

Port Protocol Purpose Configurable 80/443 TCP Apache HTTP/HTTPS Yes 8000 TCP UFM Internal REST Server No 8081 TCP OpenSM Plugin Communication Yes 8082 TCP OpenSM Traps Listening Yes 8087 TCP Auth Service Yes 9001 TCP Telemetry/Prometheus Endpoint Yes 9002 TCP Secondary Telemetry Endpoint Yes 8401+ TCP Plugin Ports (varies per plugin) Yes

Warning When using Ingress, UFM automatically switches to ports 8080/18443 to avoid conflicts with the Ingress controller.

UFM requires ReadWriteMany storage. Make sure you have a Persistent storage configured.

The UFM Docker image needs to be located in a place your K8S cluster has access to.

It can be pre loaded or in your own Registry.

Copy Copied! # Create the namespace kubectl create namespace ufm-enterprise # Create license ConfigMap kubectl create configmap ufm-license \ --from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \ -n ufm-enterprise

Copy Copied! helm install ufm ufm-enterprise-<version>-helm.tgz \ --namespace ufm-enterprise \ --set config.fabricInterface=<your_ib_interface> \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4

Replace <your_ib_interface> with your InfiniBand interface name (e.g., ib0 , ibp4s0f0 ).

Note Note: The Helm chart is distributed as a .tgz package.





Watch the pod status:

Copy Copied! kubectl get pods -n ufm-enterprise -w

Expected state transitions:

Copy Copied! NAME READY STATUS AGE ufm-ufm-enterprise-xxxxxxxxxx 0 / 1 Init: 0 / 1 5s ufm-ufm-enterprise-xxxxxxxxxx 0 / 1 PodInitializing 30s ufm-ufm-enterprise-xxxxxxxxxx 0 / 1 Running 45s ufm-ufm-enterprise-xxxxxxxxxx 1 / 1 Running 2m

Note Note: The pod shows 0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes, depending on the cluster size.





All configuration options are set via Helm values. Use --set key=value or a values file ( -f values.yaml ).

Parameter Description Default namespace.create Create the namespace false namespace.name Namespace name ufm-enterprise

Parameter Description Default image.repository Image repository docker.io/mellanox/ufm-enterprise image.tag Image tag latest image.pullPolicy Image pull policy (Required) - imagePullSecrets Image pull secrets for private registries []

Note Note: image.pullPolicy must be set to one of: Never , IfNotPresent , or Always .





Parameter Description Default config.fabricInterface InfiniBand fabric interface name "" (uses gv.cfg) config.mgmtInterface Management network interface name "" (uses gv.cfg) config.httpPort Apache HTTP port 80 (or 8080 with Ingress) config.httpsPort Apache HTTPS port 443 (or 18443 with Ingress)

Parameter Description Default storage.enabled Enable PVC creation true storage.existingClaim Use existing PVC name "" storage.className Storage class name (Required) - storage.size Persistent volume size 10Gi storage.accessMode PVC access mode ReadWriteMany

Parameter Description Default resources.requests.memory Memory request (Required) - resources.requests.cpu CPU request (Required) - resources.limits.memory Memory limit (Required) - resources.limits.cpu CPU limit (Required) -

Parameter Description Default license.existingConfigMap ConfigMap containing license file(s) "" license.existingSecret Secret containing license file(s) ""

Parameter Description Default startupProbe.enabled Enable startup probe true startupProbe.initialDelaySeconds Initial delay 2 startupProbe.periodSeconds Probe interval 10 startupProbe.timeoutSeconds Probe timeout 2 startupProbe.failureThreshold Failures before giving up 30

Note Note: With default settings, UFM has up to 5 minutes (10s × 30) to fully start.





Parameter Description Default livenessProbe.enabled Enable liveness probe true livenessProbe.initialDelaySeconds Initial delay 0 livenessProbe.periodSeconds Probe interval 10 livenessProbe.timeoutSeconds Probe timeout 2 livenessProbe.failureThreshold Failures before restart 3

Parameter Description Default service.enabled Enable Kubernetes Service false service.type Service type ClusterIP service.nodePort NodePort number (30000-32767) ""

Note Note: A Service is automatically created when Ingress is enabled. Use service.enabled=true only if you need a standalone Service without Ingress (e.g., LoadBalancer type in cloud environments).





Parameter Description Default ingress.enabled Expose UFM via Ingress controller for external access false ingress.className Ingress controller to use (e.g., nginx , traefik ) "" ingress.host DNS hostname for accessing UFM (e.g., ufm.example.com ) "" ingress.annotations Controller-specific annotations (e.g., backend protocol, timeouts) {} ingress.tls.secretName Kubernetes TLS Secret for HTTPS (created via kubectl create secret tls ) ""

Parameter Description Default nodeSelector Schedule UFM on nodes with specific labels (e.g., kubernetes.io/hostname: ufm-node ) {} tolerations Allow UFM to run on tainted nodes (e.g., dedicated infrastructure nodes) [] affinity Advanced scheduling rules for node or pod affinity/anti-affinity {}

Example - Schedule on specific node:

Copy Copied! --set nodeSelector. "kubernetes\.io/hostname" =ufm-node

Parameter Description Default plugins.items List of plugins to deploy (see example below) [] plugins.defaultResources Default resource limits for plugins if not specified per-plugin See below

Example - Deploy with a plugin:

Copy Copied! helm install ufm ufm-enterprise-<version>-helm.tgz \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4 \ --set plugins.items[ 0 ].name=<plugin-name> \ --set plugins.items[ 0 ].image=<plugin-image> \ --set plugins.items[ 0 ].tag=<plugin-version> \ --set plugins.items[ 0 ].port=<plugin-port> \ --set plugins.items[ 0 ].imagePullPolicy=Always

This is the default and simplest deployment mode. UFM binds directly to the host's network ports.

Copy Copied! helm install ufm ./ufm-enterprise \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4

Access UFM at: https://<node-ip>:443

Use an Ingress controller for external access with TLS termination and hostname-based routing.

Copy Copied! helm install ufm ./ufm-enterprise \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4 \ --set ingress.enabled= true \ --set ingress.className=traefik \ --set ingress.host=ufm.example.com

Access UFM at: https://ufm.example.com

Note Note: When Ingress is enabled, UFM automatically switches to ports 8080/18443 to avoid conflicts.

For complex configurations, use a YAML values file:

Collapse Source Copy Copied! # my-values.yaml namespace: name: ufm-enterprise image: pullPolicy: Never config: fabricInterface: ib0 mgmtInterface: eth0 storage: className: nfs-client size: 50Gi resources: requests: memory: 8Gi cpu: 4 limits: memory: 16Gi cpu: 8 license: existingConfigMap: ufm-license ingress: enabled: true className: nginx host: ufm.example.com annotations: nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" nodeSelector: kubernetes.io/hostname: ufm-node

Deploy with the values file:

Copy Copied! helm install ufm ./ufm-enterprise -f my-values.yaml -n ufm-enterprise

UFM plugins run as separate pods with pod affinity to ensure they are scheduled on the same node as UFM.

Warning Limitation: In this version, you must manually specify the plugin port number. Refer to the plugin documentation for the correct port value.

Field Description Required name Plugin name without ufm-plugin- prefix Yes image Plugin Docker image repository Yes tag Plugin image tag Yes port Plugin service port (omit if no HTTP) No imagePullPolicy Image pull policy No (default: IfNotPresent) healthEndpoint HTTP health endpoint path No healthPort Port for health endpoint No (defaults to port ) livenessInitialDelay Seconds before first liveness probe No (default: 60) livenessPeriod Seconds between liveness probes No (default: 30) livenessTimeout Seconds before probe times out No (default: 15) livenessFailureThreshold Failures before restart No (default: 3) readinessInitialDelay Seconds before first readiness probe No (default: 10) readinessPeriod Seconds between readiness probes No (default: 10) readinessTimeout Seconds before probe times out No (default: 15) readinessFailureThreshold Failures before not-ready No (default: 3)

Copy Copied! helm install ufm ./ufm-enterprise \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4 \ --set plugins.items[ 0 ].name=<plugin-name> \ --set plugins.items[ 0 ].image=<plugin-image> \ --set plugins.items[ 0 ].tag=<plugin-version> \ --set plugins.items[ 0 ].port=<plugin-port> \ --set plugins.items[ 0 ].imagePullPolicy=Always

Copy Copied! helm install ufm ufm-enterprise-<version>-helm.tgz \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4 \ --set plugins.items[ 0 ].name=<plugin1-name> \ --set plugins.items[ 0 ].image=<plugin1-image> \ --set plugins.items[ 0 ].tag=<plugin1-version> \ --set plugins.items[ 0 ].port=<plugin1-port> \ --set plugins.items[ 0 ].imagePullPolicy=Always \ --set plugins.items[ 1 ].name=<plugin2-name> \ --set plugins.items[ 1 ].image=<plugin2-image> \ --set plugins.items[ 1 ].tag=<plugin2-version> \ --set plugins.items[ 1 ].port=<plugin2-port> \ --set plugins.items[ 1 ].imagePullPolicy=Always

Warning Important: Plugin array indices must be sequential starting from 0.





Some plugins don't expose an HTTP port. Omit the port field:

Copy Copied! --set plugins.items[ 0 ].name=<plugin-name> \ --set plugins.items[ 0 ].image=<plugin-image> \ --set plugins.items[ 0 ].tag=<plugin-version> \ --set plugins.items[ 0 ].imagePullPolicy=Always

Copy Copied! # plugins-values.yaml plugins: items: - name: <plugin-name> image: <plugin-image> tag: <plugin-version> port: <plugin-port> imagePullPolicy: Always helm install ufm ufm-enterprise-<version>-helm.tgz -f my-values.yaml -f plugins-values.yaml -n ufm-enterprise

The Helm chart includes default UFM configuration files that can be customized.

File Description gv.cfg Main UFM configuration opensm/opensm.conf OpenSM configuration sharp/sharp_am.cfg SHARP AM configuration telemetry_defaults/primary_env.cfg Primary telemetry environment telemetry_defaults/launch_ibdiagnet_config.ini IBDiagNet configuration secondary_telemetry_defaults/launch_ibdiagnet_config.ini Secondary telemetry config

Copy Copied! tar xzf ufm-enterprise-<version>-helm.tgz





Copy Copied! vim ufm-enterprise/files/config/gv.cfg vim ufm-enterprise/files/config/opensm/opensm.conf





Copy Copied! helm install ufm ./ufm-enterprise -n ufm-enterprise \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4

Configuration is applied in this order (later wins):

Base install/upgrade - UFM default config files Helm chart config files - Files from files/config/ directory Helm values - config.fabricInterface , config.mgmtInterface

Add custom Prometheus counter set files for telemetry customization:

Copy Copied! tar xzf ufm-enterprise-<version>-helm.tgz





Copy Copied! mkdir -p ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/ cp my-custom-counters.cset ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/





Copy Copied! helm install ufm ./ufm-enterprise -n ufm-enterprise \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu= 2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu= 4

Scale down the deployment to 0 replicas:

Copy Copied! kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas= 0

Verify UFM is stopped:

Copy Copied! kubectl get pods -n ufm-enterprise

Scale back up to 1 replica:

Copy Copied! kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas= 1

Wait for the pod to be ready:

Copy Copied! kubectl get pods -n ufm-enterprise -w

Follow logs:

Copy Copied! kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f

Previous container logs (after crash):

Copy Copied! kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous

# List log files kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/ # View specific log kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log # Tail a log kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log

https://<node-ip>:443/ufm_web/

Copy Copied! # Get UFM version curl https: # List resources curl https:

Run:

Copy Copied! helm uninstall ufm -n ufm-enterprise

Warning Warning: This deletes all UFM resources including the PersistentVolumeClaim and data.





To delete the entire namespace and all associated resources:

Copy Copied! kubectl delete namespace ufm-enterprise

To delete selected resources instead of the full namespace:

Copy Copied! kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise kubectl delete configmap -n ufm-enterprise ufm-license kubectl delete secret -n ufm-enterprise ufm-tls

Port Service Description 80 / 8080 Apache HTTP Web UI and REST API (HTTP) 443 / 18443 Apache HTTPS Web UI and REST API (HTTPS) 8000 Flask Internal REST server 8081 OpenSM Plugin communication 8082 OpenSM Trap listener 8087 Auth Authentication service 9001 Telemetry Prometheus metrics endpoint 9002 Telemetry Secondary metrics endpoint 8401+ Plugins Plugin-specific ports

UFM is deployed with hostNetwork: true , enabling direct access to:

InfiniBand interfaces for fabric management

Host ports for external connectivity

Low-latency communication with OpenSM

Implications:

UFM pods bind directly to the node’s network stack

Required ports must be available on the host

Port conflicts may prevent pod startup

Client → Ingress Controller (80/443) → UFM Service → UFM Pod (8080/18443)

When Ingress is enabled:

UFM listens internally on ports 8080 and 18443

The Ingress controller handles external traffic on ports 80 and 443

TLS may be terminated at the Ingress or passed through to UFM

All data stored under /opt/ufm/files/ is persisted via a PersistentVolumeClaim (PVC), ensuring data retention across pod restarts.

UFM runs in privileged mode to allow:

Direct access to InfiniBand hardware

Loading of kernel modules

Management of the InfiniBand Subnet Manager

Warning Security Impact: Privileged containers have elevated access to the host and should be deployed with caution.

Using hostNetwork: true means:

UFM can access all host network interfaces

Service ports are exposed directly on the node

Kubernetes NetworkPolicies do not apply to pod traffic

Dedicated Nodes – Deploy UFM on dedicated infrastructure nodes. Node Taints – Apply taints to prevent unrelated workloads from scheduling on UFM nodes. Network Segmentation – Isolate UFM nodes on a management network. RBAC Controls – Restrict access to the UFM namespace using Kubernetes RBAC. Secrets Management – Store sensitive data in Kubernetes Secrets. Regular Updates – Keep UFM and Kubernetes components up to date.

UFM uses two probes:

Probe Purpose Check Startup Wait for UFM initialization REST API returns HTTP 200 Liveness Detect failures UfmHealthRunner running, no failover flag

Startup Probe:

Calls /app/versioning/ on UFM Web server

Returns 503 during initialization, 200 when ready

Allows up to 5 minutes for startup

Liveness Probe:

Verifies UfmHealthRunner process is running

Checks for failover flag (critical failure indicator)

Verifies config_watcher.sh is running

Run the following command to review the liveness and startup probe configuration and status:

Copy Copied! kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"

Use the command below to list running UFM-related processes inside the pod:

Copy Copied! kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux

Run the following command to inspect the UFM health log file:

Copy Copied! kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log