Appendix - UFM on Kubernetes
This guide provides comprehensive instructions for deploying NVIDIA UFM Enterprise on Kubernetes using Helm charts.
Overview
What's New
UFM Enterprise now supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:
Declarative Configuration: Define your UFM deployment using Helm values
Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management
Plugin Support: Deploy UFM plugins as separate pods with automatic configuration
Ingress Integration: Expose UFM through Kubernetes Ingress controllers
Persistent Storage: Use Kubernetes
PersistentVolumeClaimsfor data persistence
Supported Environments
Kubernetes Version
Kubernetes 1.28 or later.
Node Operating Systems
UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.
Hardware Requirements
UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.
Prerequisites
Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:
Kubernetes Cluster
Kubernetes cluster version 1.28 or later
kubectlconfigured with cluster accessCluster admin permissions for installation
Helm
Helm 3.x installed on the management workstation. Run:
# Verify Helm installation helm version
Storage
A StorageClass that supports
ReadWriteManyaccess modeMinimum 10GB storage capacity
InfiniBand
At least one node with InfiniBand interface
DOCA drivers installed on the worker node where UFM will be deployed
InfiniBand port configured and in "up" state. Run:
# Verify InfiniBand
interfaceip link show | grep -E'ib[0-9]|ibp'
UFM License
Valid UFM Enterprise license file
License file accessible from the management workstation
Network Ports
UFM uses host network mode. Ensure these ports are available on the target node (more ports might be used):
Port | Protocol | Purpose | Configurable |
80/443 | TCP | Apache HTTP/HTTPS | Yes |
8000 | TCP | UFM Internal REST Server | No |
8081 | TCP | OpenSM Plugin Communication | Yes |
8082 | TCP | OpenSM Traps Listening | Yes |
8087 | TCP | Auth Service | Yes |
9001 | TCP | Telemetry/Prometheus Endpoint | Yes |
9002 | TCP | Secondary Telemetry Endpoint | Yes |
8401+ | TCP | Plugin Ports (varies per plugin) | Yes |
When using Ingress, UFM automatically switches to ports 8080/18443 to avoid conflicts with the Ingress controller.
Installation
Step 1: Set Up Storage
UFM requires ReadWriteMany storage. Make sure you have a Persistent storage configured.
Step 2: UFM Docker Image
The UFM Docker image needs to be located in a place your K8S cluster has access to.
It can be pre loaded or in your own Registry.
Step 3: Create Namespace and License ConfigMap
# Create the namespace
kubectl create namespace ufm-enterprise
# Create license ConfigMap
kubectl create configmap ufm-license \
--from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \
-n ufm-enterprise
Step 4: Install UFM with Helm
helm install ufm ufm-enterprise-<version>-helm.tgz \
--namespace ufm-enterprise \
--set config.fabricInterface=<your_ib_interface> \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4
Replace <your_ib_interface> with your InfiniBand interface name (e.g., ib0, ibp4s0f0).
Note: The Helm chart is distributed as a .tgz package.
Step 5: Verify Installation
Watch the pod status:
kubectl get pods -n ufm-enterprise -w
Expected state transitions:
NAME READY STATUS AGE
ufm-ufm-enterprise-xxxxxxxxxx 0/1 Init:0/1 5s
ufm-ufm-enterprise-xxxxxxxxxx 0/1 PodInitializing 30s
ufm-ufm-enterprise-xxxxxxxxxx 0/1 Running 45s
ufm-ufm-enterprise-xxxxxxxxxx 1/1 Running 2m
Note: The pod shows 0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes, depending on the cluster size.
Configuration Reference
All configuration options are set via Helm values. Use --set key=value or a values file (-f values.yaml).
Namespace Configuration
Parameter | Description | Default |
| Create the namespace |
|
| Namespace name |
|
Image Configuration
Parameter | Description | Default |
| Image repository |
|
| Image tag |
|
| Image pull policy (Required) | - |
| Image pull secrets for private registries |
|
Note: image.pullPolicy must be set to one of: Never, IfNotPresent, or Always.
UFM Configuration
Parameter | Description | Default |
| InfiniBand fabric interface name |
|
| Management network interface name |
|
| Apache HTTP port |
|
| Apache HTTPS port |
|
Storage Configuration
Parameter | Description | Default |
| Enable PVC creation |
|
| Use existing PVC name |
|
| Storage class name (Required) | - |
| Persistent volume size |
|
| PVC access mode |
|
Resource Limits (Required)
Parameter | Description | Default |
| Memory request (Required) | - |
| CPU request (Required) | - |
| Memory limit (Required) | - |
| CPU limit (Required) | - |
License Configuration
Parameter | Description | Default |
| ConfigMap containing license file(s) |
|
| Secret containing license file(s) |
|
Startup Probe Configuration
Parameter | Description | Default |
| Enable startup probe |
|
| Initial delay |
|
| Probe interval |
|
| Probe timeout |
|
| Failures before giving up |
|
Note: With default settings, UFM has up to 5 minutes (10s × 30) to fully start.
Liveness Probe Configuration
Parameter | Description | Default |
| Enable liveness probe |
|
| Initial delay |
|
| Probe interval |
|
| Probe timeout |
|
| Failures before restart |
|
Service Configuration
Parameter | Description | Default |
| Enable Kubernetes Service |
|
| Service type |
|
| NodePort number (30000-32767) |
|
Note: A Service is automatically created when Ingress is enabled. Use service.enabled=true only if you need a standalone Service without Ingress (e.g., LoadBalancer type in cloud environments).
Ingress Configuration
Parameter | Description | Default |
| Expose UFM via Ingress controller for external access |
|
| Ingress controller to use (e.g., |
|
| DNS hostname for accessing UFM (e.g., |
|
| Controller-specific annotations (e.g., backend protocol, timeouts) |
|
| Kubernetes TLS Secret for HTTPS (created via |
|
Scheduling Configuration
Parameter | Description | Default |
| Schedule UFM on nodes with specific labels (e.g., |
|
| Allow UFM to run on tainted nodes (e.g., dedicated infrastructure nodes) |
|
| Advanced scheduling rules for node or pod affinity/anti-affinity |
|
Example - Schedule on specific node:
--set nodeSelector."kubernetes\.io/hostname"=ufm-node
Plugin Configuration
Parameter | Description | Default |
| List of plugins to deploy (see example below) |
|
| Default resource limits for plugins if not specified per-plugin | See below |
Example - Deploy with a plugin:
helm install ufm ufm-enterprise-<version>-helm.tgz \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4 \
--set plugins.items[0].name=<plugin-name> \
--set plugins.items[0].image=<plugin-image> \
--set plugins.items[0].tag=<plugin-version> \
--set plugins.items[0].port=<plugin-port> \
--set plugins.items[0].imagePullPolicy=Always
Deployment Options
Option 1: Host Network Mode (Default)
This is the default and simplest deployment mode. UFM binds directly to the host's network ports.
helm install ufm ./ufm-enterprise \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4
Access UFM at: https://<node-ip>:443
Option 2: With Ingress Controller
Use an Ingress controller for external access with TLS termination and hostname-based routing.
Step 1: Install Ingress Controller (if not installed)
Step 2: Deploy UFM with Ingress
helm install ufm ./ufm-enterprise \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4 \
--set ingress.enabled=true \
--set ingress.className=traefik \
--set ingress.host=ufm.example.com
Access UFM at: https://ufm.example.com
Note: When Ingress is enabled, UFM automatically switches to ports 8080/18443 to avoid conflicts.
Option 3: Using a Values File
For complex configurations, use a YAML values file:
# my-values.yaml
namespace:
name: ufm-enterprise
image:
pullPolicy: Never
config:
fabricInterface: ib0
mgmtInterface: eth0
storage:
className: nfs-client
size: 50Gi
resources:
requests:
memory: 8Gi
cpu: 4
limits:
memory: 16Gi
cpu: 8
license:
existingConfigMap: ufm-license
ingress:
enabled: true
className: nginx
host: ufm.example.com
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
nodeSelector:
kubernetes.io/hostname: ufm-node
Deploy with the values file:
helm install ufm ./ufm-enterprise -f my-values.yaml -n ufm-enterprise
Plugin Deployment
UFM plugins run as separate pods with pod affinity to ensure they are scheduled on the same node as UFM.
Plugin Configuration Fields
Limitation: In this version, you must manually specify the plugin port number. Refer to the plugin documentation for the correct port value.
Field | Description | Required |
| Plugin name without | Yes |
| Plugin Docker image repository | Yes |
| Plugin image tag | Yes |
| Plugin service port (omit if no HTTP) | No |
| Image pull policy | No (default: IfNotPresent) |
| HTTP health endpoint path | No |
| Port for health endpoint | No (defaults to |
| Seconds before first liveness probe | No (default: 60) |
| Seconds between liveness probes | No (default: 30) |
| Seconds before probe times out | No (default: 15) |
| Failures before restart | No (default: 3) |
| Seconds before first readiness probe | No (default: 10) |
| Seconds between readiness probes | No (default: 10) |
| Seconds before probe times out | No (default: 15) |
| Failures before not-ready | No (default: 3) |
Deploy Single Plugin
helm install ufm ./ufm-enterprise \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4 \
--set plugins.items[0].name=<plugin-name> \
--set plugins.items[0].image=<plugin-image> \
--set plugins.items[0].tag=<plugin-version> \
--set plugins.items[0].port=<plugin-port> \
--set plugins.items[0].imagePullPolicy=Always
Deploy Multiple Plugins
helm install ufm ufm-enterprise-<version>-helm.tgz \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4 \
--set plugins.items[0].name=<plugin1-name> \
--set plugins.items[0].image=<plugin1-image> \
--set plugins.items[0].tag=<plugin1-version> \
--set plugins.items[0].port=<plugin1-port> \
--set plugins.items[0].imagePullPolicy=Always \
--set plugins.items[1].name=<plugin2-name> \
--set plugins.items[1].image=<plugin2-image> \
--set plugins.items[1].tag=<plugin2-version> \
--set plugins.items[1].port=<plugin2-port> \
--set plugins.items[1].imagePullPolicy=Always
Important: Plugin array indices must be sequential starting from 0.
Plugin Without HTTP Port
Some plugins don't expose an HTTP port. Omit the port field:
--set plugins.items[0].name=<plugin-name> \
--set plugins.items[0].image=<plugin-image> \
--set plugins.items[0].tag=<plugin-version> \
--set plugins.items[0].imagePullPolicy=Always
Plugins with Values File
# plugins-values.yaml
plugins:
items:
- name: <plugin-name>
image: <plugin-image>
tag: <plugin-version>
port: <plugin-port>
imagePullPolicy: Always
helm install ufm ufm-enterprise-<version>-helm.tgz -f my-values.yaml -f plugins-values.yaml -n ufm-enterprise
Custom Configuration Files
The Helm chart includes default UFM configuration files that can be customized.
Included Config Files
File | Description |
| Main UFM configuration |
| OpenSM configuration |
| SHARP AM configuration |
| Primary telemetry environment |
| IBDiagNet configuration |
| Secondary telemetry config |
Method 1: Edit Files Before Install
Extract the chart:
tar xzf ufm-enterprise-<version>-helm.tgz
Edit config files:
vim ufm-enterprise/files/config/gv.cfg
vim ufm-enterprise/files/config/opensm/opensm.conf
Install with modified files
helm install ufm ./ufm-enterprise -n ufm-enterprise \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4
Configuration Priority
Configuration is applied in this order (later wins):
Base install/upgrade - UFM default config files
Helm chart config files - Files from
files/config/directoryHelm values -
config.fabricInterface,config.mgmtInterface
Adding Custom Counter Sets
Add custom Prometheus counter set files for telemetry customization:
Extract the chart:
tar xzf ufm-enterprise-<version>-helm.tgz
Add custom cset file:
mkdir -p ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/
cp my-custom-counters.cset ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/
Install:
helm install ufm ./ufm-enterprise -n ufm-enterprise \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=4
Operations
Start/Stop UFM
Stop UFM
Scale down the deployment to 0 replicas:
kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=0
Verify UFM is stopped:
kubectl get pods -n ufm-enterprise
Start UFM
Scale back up to 1 replica:
kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=1
Wait for the pod to be ready:
kubectl get pods -n ufm-enterprise -w
View Logs
Container Logs
Follow logs:
kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f
Previous container logs (after crash):
kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous
UFM Application Logs
# List log fileskubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/
# View specific log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log
# Tail a log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log
Access UFM UI and REST API
Web UI
https://<node-ip>:443/ufm_web/
REST API
# Get UFM version
curl https://<node-ip>:443/ufmRest/app/ufm_version
# List resources
curl https://<node-ip>:443/ufmRest/resources/systems
Uninstallation
Remove UFM
Run:
helm uninstall ufm -n ufm-enterprise
Warning: This deletes all UFM resources including the PersistentVolumeClaimand data.
Resource Cleanup
Remove All Resources
To delete the entire namespace and all associated resources:
kubectl delete namespace ufm-enterprise
Remove Specific Resources Only
To delete selected resources instead of the full namespace:
kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
kubectl delete configmap -n ufm-enterprise ufm-license
kubectl delete secret -n ufm-enterprise ufm-tls
Networking
Port Reference
Port | Service | Description |
80 / 8080 | Apache HTTP | Web UI and REST API (HTTP) |
443 / 18443 | Apache HTTPS | Web UI and REST API (HTTPS) |
8000 | Flask | Internal REST server |
8081 | OpenSM | Plugin communication |
8082 | OpenSM | Trap listener |
8087 | Auth | Authentication service |
9001 | Telemetry | Prometheus metrics endpoint |
9002 | Telemetry | Secondary metrics endpoint |
8401+ | Plugins | Plugin-specific ports |
Host Network Architecture
UFM is deployed with hostNetwork: true, enabling direct access to:
InfiniBand interfaces for fabric management
Host ports for external connectivity
Low-latency communication with OpenSM
Implications:
UFM pods bind directly to the node’s network stack
Required ports must be available on the host
Port conflicts may prevent pod startup
Traffic Flow with Ingress
Client → Ingress Controller (80/443) → UFM Service → UFM Pod (8080/18443)
When Ingress is enabled:
UFM listens internally on ports 8080 and 18443
The Ingress controller handles external traffic on ports 80 and 443
TLS may be terminated at the Ingress or passed through to UFM
Storage
Data Persistence
All data stored under /opt/ufm/files/ is persisted via a PersistentVolumeClaim (PVC), ensuring data retention across pod restarts.
Security Considerations
Privileged Container Requirement
UFM runs in privileged mode to allow:
Direct access to InfiniBand hardware
Loading of kernel modules
Management of the InfiniBand Subnet Manager
Security Impact: Privileged containers have elevated access to the host and should be deployed with caution.
Host Network Implications
Using hostNetwork: true means:
UFM can access all host network interfaces
Service ports are exposed directly on the node
Kubernetes NetworkPolicies do not apply to pod traffic
Best Practices
Dedicated Nodes – Deploy UFM on dedicated infrastructure nodes.
Node Taints – Apply taints to prevent unrelated workloads from scheduling on UFM nodes.
Network Segmentation – Isolate UFM nodes on a management network.
RBAC Controls – Restrict access to the UFM namespace using Kubernetes RBAC.
Secrets Management – Store sensitive data in Kubernetes Secrets.
Regular Updates – Keep UFM and Kubernetes components up to date.
Monitoring
Kubernetes Probes
UFM uses two probes:
Probe | Purpose | Check |
Startup | Wait for UFM initialization | REST API returns HTTP 200 |
Liveness | Detect failures |
|
Health Check Details
Startup Probe:
Calls
/app/versioning/on UFM Web serverReturns 503 during initialization, 200 when ready
Allows up to 5 minutes for startup
Liveness Probe:
Verifies
UfmHealthRunnerprocess is runningChecks for failover flag (critical failure indicator)
Verifies
config_watcher.shis running
Monitoring Commands
Verify Probe Status
Run the following command to review the liveness and startup probe configuration and status:
kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"
Verify UFM Processes
Use the command below to list running UFM-related processes inside the pod:
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux
Check UFM Health Log
Run the following command to inspect the UFM health log file:
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log
Known Limitations
Limitation | Description | Impact |
Single Pod | Only one replica supported | No horizontal scaling |
No Automatic Failover | Pod won't migrate on node failure | Manual intervention required |
No High Availability | HA mode not supported in K8s | Use Docker HA for HA requirements |
Privileged Mode | Container requires privileged access | Security considerations |
Host Network | Uses host networking | Port conflicts possible |
sysdump Unavailable | sysdump collector doesn't work | Use manual log collection |
Recreate Strategy | Rolling updates not supported | Brief downtime during upgrades |
Plugin Operations | Not all plugin operations are supported | Some plugin features may not work |
Plugin Port Configuration | User must manually specify plugin ports | Refer to plugin documentation for port values |