NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software NVIDIA UFM Enterprise User Manual v6.24.1 Appendix - UFM on Kubernetes

Appendix - UFM on Kubernetes

/<![CDATA[/ div.rbtoc1771605105627 {padding: 0px;} div.rbtoc1771605105627 ul {margin-left: 0px;} div.rbtoc1771605105627 li {margin-left: 0px;padding-left: 0px;} /]]>/

UFM Enterprise Kubernetes Deployment Guide

This guide provides comprehensive instructions for deploying NVIDIA UFM Enterprise on Kubernetes using Helm charts.

Overview

What's New

UFM Enterprise now supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:

Declarative Configuration: Define your UFM deployment using Helm values
Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management
Plugin Support: Deploy UFM plugins as separate pods with automatic configuration
Ingress Integration: Expose UFM through Kubernetes Ingress controllers
Persistent Storage: Use Kubernetes PersistentVolumeClaimsfor data persistence

Supported Environments

Node Operating Systems

UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.

Hardware Requirements

UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.

Prerequisites

Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:

Kubernetes Cluster

Kubernetes cluster version 1.28 or later
kubectl configured with cluster access
Cluster admin permissions for installation

Helm

Helm 3.x installed on the management workstation. Run:

Copy
Copied!

            
            # Verify Helm installation
helm version

Storage

A StorageClass that supports ReadWriteMany access mode
Minimum 10GB storage capacity

InfiniBand

At least one node with InfiniBand interface
DOCA drivers installed on the worker node where UFM will be deployed

InfiniBand port configured and in "up" state. Run:

Copy
Copied!

            
            # Verify InfiniBand interface
ip link show | grep -E 'ib[0-9]|ibp'

UFM License

Valid UFM Enterprise license file
License file accessible from the management workstation

Network Ports

UFM uses host network mode. Ensure these ports are available on the target node (more ports might be used):

Port	Protocol	Purpose	Configurable
80/443	TCP	Apache HTTP/HTTPS	Yes
8000	TCP	UFM Internal REST Server	No
8081	TCP	OpenSM Plugin Communication	Yes
8082	TCP	OpenSM Traps Listening	Yes
8087	TCP	Auth Service	Yes
9001	TCP	Telemetry/Prometheus Endpoint	Yes
9002	TCP	Secondary Telemetry Endpoint	Yes
8401+	TCP	Plugin Ports (varies per plugin)	Yes

Warning

When using Ingress, UFM automatically switches to ports 8080/18443 to avoid conflicts with the Ingress controller.

Parameter	Description	Default
`namespace.create`	Create the namespace	`false`
`namespace.name`	Namespace name	`ufm-enterprise`

Parameter	Description	Default
`image.repository`	Image repository	`docker.io/mellanox/ufm-enterprise`
`image.tag`	Image tag	`latest`
`image.pullPolicy`	Image pull policy (Required)	-
`imagePullSecrets`	Image pull secrets for private registries	`[]`

Parameter	Description	Default
`config.fabricInterface`	InfiniBand fabric interface name	`""` (uses gv.cfg)
`config.mgmtInterface`	Management network interface name	`""` (uses gv.cfg)
`config.httpPort`	Apache HTTP port	`80` (or `8080` with Ingress)
`config.httpsPort`	Apache HTTPS port	`443` (or `18443` with Ingress)

Parameter	Description	Default
`storage.enabled`	Enable PVC creation	`true`
`storage.existingClaim`	Use existing PVC name	`""`
`storage.className`	Storage class name (Required)	-
`storage.size`	Persistent volume size	`10Gi`
`storage.accessMode`	PVC access mode	`ReadWriteMany`

Parameter	Description	Default
`resources.requests.memory`	Memory request (Required)	-
`resources.requests.cpu`	CPU request (Required)	-
`resources.limits.memory`	Memory limit (Required)	-
`resources.limits.cpu`	CPU limit (Required)	-

Parameter	Description	Default
`license.existingConfigMap`	ConfigMap containing license file(s)	`""`
`license.existingSecret`	Secret containing license file(s)	`""`

Parameter	Description	Default
`startupProbe.enabled`	Enable startup probe	`true`
`startupProbe.initialDelaySeconds`	Initial delay	`2`
`startupProbe.periodSeconds`	Probe interval	`10`
`startupProbe.timeoutSeconds`	Probe timeout	`2`
`startupProbe.failureThreshold`	Failures before giving up	`30`

Parameter	Description	Default
`livenessProbe.enabled`	Enable liveness probe	`true`
`livenessProbe.initialDelaySeconds`	Initial delay	`0`
`livenessProbe.periodSeconds`	Probe interval	`10`
`livenessProbe.timeoutSeconds`	Probe timeout	`2`
`livenessProbe.failureThreshold`	Failures before restart	`3`

Parameter	Description	Default
`service.enabled`	Enable Kubernetes Service	`false`
`service.type`	Service type	`ClusterIP`
`service.nodePort`	NodePort number (30000-32767)	`""`

Note: A Service is automatically created when Ingress is enabled. Use service.enabled=true only if you need a standalone Service without Ingress (e.g., LoadBalancer type in cloud environments).

Ingress Configuration

Parameter	Description	Default
`ingress.enabled`	Expose UFM via Ingress controller for external access	`false`
`ingress.className`	Ingress controller to use (e.g., `nginx`, `traefik`)	`""`
`ingress.host`	DNS hostname for accessing UFM (e.g., `ufm.example.com`)	`""`
`ingress.annotations`	Controller-specific annotations (e.g., backend protocol, timeouts)	`{}`
`ingress.tls.secretName`	Kubernetes TLS Secret for HTTPS (created via `kubectl create secret tls`)	`""`

Scheduling Configuration

Parameter	Description	Default
`nodeSelector`	Schedule UFM on nodes with specific labels (e.g., `kubernetes.io/hostname: ufm-node`)	`{}`
`tolerations`	Allow UFM to run on tainted nodes (e.g., dedicated infrastructure nodes)	`[]`
`affinity`	Advanced scheduling rules for node or pod affinity/anti-affinity	`{}`

Example - Schedule on specific node:

Copy
Copied!

            
            --set nodeSelector."kubernetes\.io/hostname"=ufm-node

Plugin Configuration

Parameter	Description	Default
`plugins.items`	List of plugins to deploy (see example below)	`[]`
`plugins.defaultResources`	Default resource limits for plugins if not specified per-plugin	See below

Example - Deploy with a plugin:

Copy
Copied!

            
            helm install ufm ufm-enterprise-<version>-helm.tgz \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set plugins.items[0].name=<plugin-name> \
  --set plugins.items[0].image=<plugin-image> \
  --set plugins.items[0].tag=<plugin-version> \
  --set plugins.items[0].port=<plugin-port> \
  --set plugins.items[0].imagePullPolicy=Always

Deployment Options

Option 1: Host Network Mode (Default)

This is the default and simplest deployment mode. UFM binds directly to the host's network ports.

Copy
Copied!

            
            helm install ufm ./ufm-enterprise \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4

Access UFM at: https://<node-ip>:443

Option 2: With Ingress Controller

Use an Ingress controller for external access with TLS termination and hostname-based routing.

Step 1: Install Ingress Controller (if not installed)

Step 2: Deploy UFM with Ingress

Copy
Copied!

            
            helm install ufm ./ufm-enterprise \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set ingress.enabled=true \
  --set ingress.className=traefik \
  --set ingress.host=ufm.example.com

Access UFM at: https://ufm.example.com

Note

Note: When Ingress is enabled, UFM automatically switches to ports 8080/18443 to avoid conflicts.

Option 3: Using a Values File

For complex configurations, use a YAML values file:

Copy
Copied!

            
            # my-values.yaml
namespace:
  name: ufm-enterprise
 
image:
  pullPolicy: Never
 
config:
  fabricInterface: ib0
  mgmtInterface: eth0
 
storage:
  className: nfs-client
  size: 50Gi
 
resources:
  requests:
    memory: 8Gi
    cpu: 4
  limits:
    memory: 16Gi
    cpu: 8
 
license:
  existingConfigMap: ufm-license
 
ingress:
  enabled: true
  className: nginx
  host: ufm.example.com
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
 
nodeSelector:
  kubernetes.io/hostname: ufm-node

Deploy with the values file:

Copy
Copied!

            
            helm install ufm ./ufm-enterprise -f my-values.yaml -n ufm-enterprise

Plugin Deployment

UFM plugins run as separate pods with pod affinity to ensure they are scheduled on the same node as UFM.

Plugin Configuration Fields

Warning

Limitation: In this version, you must manually specify the plugin port number. Refer to the plugin documentation for the correct port value.

Field	Description	Required
`name`	Plugin name without `ufm-plugin-` prefix	Yes
`image`	Plugin Docker image repository	Yes
`tag`	Plugin image tag	Yes
`port`	Plugin service port (omit if no HTTP)	No
`imagePullPolicy`	Image pull policy	No (default: IfNotPresent)
`healthEndpoint`	HTTP health endpoint path	No
`healthPort`	Port for health endpoint	No (defaults to `port`)
`livenessInitialDelay`	Seconds before first liveness probe	No (default: 60)
`livenessPeriod`	Seconds between liveness probes	No (default: 30)
`livenessTimeout`	Seconds before probe times out	No (default: 15)
`livenessFailureThreshold`	Failures before restart	No (default: 3)
`readinessInitialDelay`	Seconds before first readiness probe	No (default: 10)
`readinessPeriod`	Seconds between readiness probes	No (default: 10)
`readinessTimeout`	Seconds before probe times out	No (default: 15)
`readinessFailureThreshold`	Failures before not-ready	No (default: 3)

Deploy Single Plugin

Copy
Copied!

            
            helm install ufm ./ufm-enterprise \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set plugins.items[0].name=<plugin-name> \
  --set plugins.items[0].image=<plugin-image> \
  --set plugins.items[0].tag=<plugin-version> \
  --set plugins.items[0].port=<plugin-port> \
  --set plugins.items[0].imagePullPolicy=Always

Deploy Multiple Plugins

Copy
Copied!

            
            helm install ufm ufm-enterprise-<version>-helm.tgz \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set plugins.items[0].name=<plugin1-name> \
  --set plugins.items[0].image=<plugin1-image> \
  --set plugins.items[0].tag=<plugin1-version> \
  --set plugins.items[0].port=<plugin1-port> \
  --set plugins.items[0].imagePullPolicy=Always \
  --set plugins.items[1].name=<plugin2-name> \
  --set plugins.items[1].image=<plugin2-image> \
  --set plugins.items[1].tag=<plugin2-version> \
  --set plugins.items[1].port=<plugin2-port> \
  --set plugins.items[1].imagePullPolicy=Always

Warning

Important: Plugin array indices must be sequential starting from 0.

Plugin Without HTTP Port

Some plugins don't expose an HTTP port. Omit the port field:

Copy
Copied!

            
            --set plugins.items[0].name=<plugin-name> \
--set plugins.items[0].image=<plugin-image> \
--set plugins.items[0].tag=<plugin-version> \
--set plugins.items[0].imagePullPolicy=Always

Plugins with Values File

Copy
Copied!

            
            # plugins-values.yaml
plugins:
  items:
    - name: <plugin-name>
      image: <plugin-image>
      tag: <plugin-version>
      port: <plugin-port>
      imagePullPolicy: Always
 
helm install ufm ufm-enterprise-<version>-helm.tgz -f my-values.yaml -f plugins-values.yaml -n ufm-enterprise

Custom Configuration Files

The Helm chart includes default UFM configuration files that can be customized.

Included Config Files

File	Description
`gv.cfg`	Main UFM configuration
`opensm/opensm.conf`	OpenSM configuration
`sharp/sharp_am.cfg`	SHARP AM configuration
`telemetry_defaults/primary_env.cfg`	Primary telemetry environment
`telemetry_defaults/launch_ibdiagnet_config.ini`	IBDiagNet configuration
`secondary_telemetry_defaults/launch_ibdiagnet_config.ini`	Secondary telemetry config

Method 1: Edit Files Before Install

Extract the chart:

Copy
Copied!

            
            tar xzf ufm-enterprise-<version>-helm.tgz

Edit config files:

Copy
Copied!

            
            vim ufm-enterprise/files/config/gv.cfg
vim ufm-enterprise/files/config/opensm/opensm.conf

Install with modified files

Copy
Copied!

            
            helm install ufm ./ufm-enterprise -n ufm-enterprise \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4

Configuration Priority

Configuration is applied in this order (later wins):

Base install/upgrade - UFM default config files
Helm chart config files - Files from files/config/ directory
Helm values - config.fabricInterface, config.mgmtInterface

Adding Custom Counter Sets

Add custom Prometheus counter set files for telemetry customization:

Extract the chart:

Copy
Copied!

            
            tar xzf ufm-enterprise-<version>-helm.tgz

Add custom cset file:

Copy
Copied!

            
            mkdir -p ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/
cp my-custom-counters.cset ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/

Install:

Copy
Copied!

            
            helm install ufm ./ufm-enterprise -n ufm-enterprise \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4

Operations

Start/Stop UFM

Stop UFM

Scale down the deployment to 0 replicas:

Copy
Copied!

            
            kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=0

Verify UFM is stopped:

Copy
Copied!

            
            kubectl get pods -n ufm-enterprise

Start UFM

Scale back up to 1 replica:

Copy
Copied!

            
            kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=1

Wait for the pod to be ready:

Copy
Copied!

            
            kubectl get pods -n ufm-enterprise -w

View Logs

Container Logs

Follow logs:

Copy
Copied!

            
            kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f

Previous container logs (after crash):

Copy
Copied!

            
            kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous

UFM Application Logs

# List log files
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/
# View specific log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log
# Tail a log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log

Access UFM UI and REST API

Web UI

https://<node-ip>:443/ufm_web/

REST API

Copy
Copied!

            
            # Get UFM version
curl https://<node-ip>:443/ufmRest/app/ufm_version
 
# List resources
curl https://<node-ip>:443/ufmRest/resources/systems

Uninstallation

Remove UFM

Run:

Copy
Copied!

            
            helm uninstall ufm -n ufm-enterprise

Warning

Warning: This deletes all UFM resources including the PersistentVolumeClaimand data.

Resource Cleanup

Remove All Resources

To delete the entire namespace and all associated resources:

Copy
Copied!

            
            kubectl delete namespace ufm-enterprise

Remove Specific Resources Only

To delete selected resources instead of the full namespace:

Copy
Copied!

            
            kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
kubectl delete configmap -n ufm-enterprise ufm-license
kubectl delete secret -n ufm-enterprise ufm-tls

Networking

Port Reference

Port	Service	Description
80 / 8080	Apache HTTP	Web UI and REST API (HTTP)
443 / 18443	Apache HTTPS	Web UI and REST API (HTTPS)
8000	Flask	Internal REST server
8081	OpenSM	Plugin communication
8082	OpenSM	Trap listener
8087	Auth	Authentication service
9001	Telemetry	Prometheus metrics endpoint
9002	Telemetry	Secondary metrics endpoint
8401+	Plugins	Plugin-specific ports

Host Network Architecture

UFM is deployed with hostNetwork: true, enabling direct access to:

InfiniBand interfaces for fabric management
Host ports for external connectivity
Low-latency communication with OpenSM

Implications:

UFM pods bind directly to the node’s network stack
Required ports must be available on the host
Port conflicts may prevent pod startup

Traffic Flow with Ingress

Client → Ingress Controller (80/443) → UFM Service → UFM Pod (8080/18443)

When Ingress is enabled:

UFM listens internally on ports 8080 and 18443
The Ingress controller handles external traffic on ports 80 and 443
TLS may be terminated at the Ingress or passed through to UFM

Direct access to InfiniBand hardware
Loading of kernel modules
Management of the InfiniBand Subnet Manager

Warning

Security Impact: Privileged containers have elevated access to the host and should be deployed with caution.

Host Network Implications

Using hostNetwork: true means:

UFM can access all host network interfaces
Service ports are exposed directly on the node
Kubernetes NetworkPolicies do not apply to pod traffic

Best Practices

Dedicated Nodes – Deploy UFM on dedicated infrastructure nodes.
Node Taints – Apply taints to prevent unrelated workloads from scheduling on UFM nodes.
Network Segmentation – Isolate UFM nodes on a management network.
RBAC Controls – Restrict access to the UFM namespace using Kubernetes RBAC.
Secrets Management – Store sensitive data in Kubernetes Secrets.
Regular Updates – Keep UFM and Kubernetes components up to date.

Monitoring

Kubernetes Probes

UFM uses two probes:

Probe	Purpose	Check
Startup	Wait for UFM initialization	REST API returns HTTP 200
Liveness	Detect failures	`UfmHealthRunner`running, no failover flag

Health Check Details

Startup Probe:

Calls /app/versioning/ on UFM Web server
Returns 503 during initialization, 200 when ready
Allows up to 5 minutes for startup

Liveness Probe:

Verifies UfmHealthRunner process is running
Checks for failover flag (critical failure indicator)
Verifies config_watcher.sh is running

Monitoring Commands

Verify Probe Status

Run the following command to review the liveness and startup probe configuration and status:

Copy
Copied!

            
            kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"

Verify UFM Processes

Use the command below to list running UFM-related processes inside the pod:

Copy
Copied!

            
            kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux

Check UFM Health Log

Run the following command to inspect the UFM health log file:

Copy
Copied!

            
            kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log

Known Limitations

Limitation	Description	Impact
Single Pod	Only one replica supported	No horizontal scaling
No Automatic Failover	Pod won't migrate on node failure	Manual intervention required
No High Availability	HA mode not supported in K8s	Use Docker HA for HA requirements
Privileged Mode	Container requires privileged access	Security considerations
Host Network	Uses host networking	Port conflicts possible
sysdump Unavailable	sysdump collector doesn't work	Use manual log collection
Recreate Strategy	Rolling updates not supported	Brief downtime during upgrades
Plugin Operations	Not all plugin operations are supported	Some plugin features may not work
Plugin Port Configuration	User must manually specify plugin ports	Refer to plugin documentation for port values

On This Page