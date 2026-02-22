On This Page
- UFM Enterprise Kubernetes Deployment Guide
- Overview
- Supported Environments
- Prerequisites
- Installation
- Step 1: Set Up Storage
- Step 2: UFM Docker Image
- Step 3: Create Namespace and License ConfigMap
- Step 4: Install UFM with Helm
- Step 5: Verify Installation
- Configuration Reference
- Namespace Configuration
- Image Configuration
- UFM Configuration
- Storage Configuration
- Resource Limits (Required)
- License Configuration
- Startup Probe Configuration
- Liveness Probe Configuration
- Service Configuration
- Ingress Configuration
- Scheduling Configuration
- Plugin Configuration
- Deployment Options
- Plugin Deployment
- Custom Configuration Files
- Operations
- Uninstallation
- Networking
- Storage
- Security Considerations
- Best Practices
- Monitoring
- Known Limitations
Appendix - UFM on Kubernetes
This guide provides comprehensive instructions for deploying NVIDIA UFM Enterprise on Kubernetes using Helm charts.
Overview
What's New
UFM Enterprise now supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:
Declarative Configuration: Define your UFM deployment using Helm values
Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management
Plugin Support: Deploy UFM plugins as separate pods with automatic configuration
Ingress Integration: Expose UFM through Kubernetes Ingress controllers
Persistent Storage: Use Kubernetes
PersistentVolumeClaimsfor data persistence
Supported Environments
Kubernetes Version
Kubernetes 1.28 or later.
Node Operating Systems
UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.
Hardware Requirements
UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.
Prerequisites
Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:
Kubernetes Cluster
Kubernetes cluster version 1.28 or later
kubectlconfigured with cluster access
Cluster admin permissions for installation
Helm
Helm 3.x installed on the management workstation. Run:
# Verify Helm installation helm version
Storage
A StorageClass that supports
ReadWriteManyaccess mode
Minimum 10GB storage capacity
InfiniBand
At least one node with InfiniBand interface
DOCA drivers installed on the worker node where UFM will be deployed
InfiniBand port configured and in "up" state. Run:
# Verify InfiniBand
interfaceip link show | grep -E
'ib[0-9]|ibp'
UFM License
Valid UFM Enterprise license file
License file accessible from the management workstation
Network Ports
UFM uses host network mode. Ensure these ports are available on the target node (more ports might be used):
Port
Protocol
Purpose
Configurable
80/443
TCP
Apache HTTP/HTTPS
Yes
8000
TCP
UFM Internal REST Server
No
8081
TCP
OpenSM Plugin Communication
Yes
8082
TCP
OpenSM Traps Listening
Yes
8087
TCP
Auth Service
Yes
9001
TCP
Telemetry/Prometheus Endpoint
Yes
9002
TCP
Secondary Telemetry Endpoint
Yes
8401+
TCP
Plugin Ports (varies per plugin)
Yes
When using Ingress, UFM automatically switches to ports 8080/18443 to avoid conflicts with the Ingress controller.
Installation
Step 1: Set Up Storage
UFM requires ReadWriteMany storage. Make sure you have a Persistent storage configured.
Step 2: UFM Docker Image
The UFM Docker image needs to be located in a place your K8S cluster has access to.
It can be pre loaded or in your own Registry.
Step 3: Create Namespace and License ConfigMap
# Create the namespace
kubectl create namespace ufm-enterprise
# Create license ConfigMap
kubectl create configmap ufm-license \
--from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \
-n ufm-enterprise
Step 4: Install UFM with Helm
helm install ufm ufm-enterprise-<version>-helm.tgz \
--namespace ufm-enterprise \
--set config.fabricInterface=<your_ib_interface> \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4
Replace
<your_ib_interface> with your InfiniBand interface name (e.g.,
ib0,
ibp4s0f0).
Note: The Helm chart is distributed as a
.tgz package.
Step 5: Verify Installation
Watch the pod status:
kubectl get pods -n ufm-enterprise -w
Expected state transitions:
NAME READY STATUS AGE
ufm-ufm-enterprise-xxxxxxxxxx
0/
1 Init:
0/
1 5s
ufm-ufm-enterprise-xxxxxxxxxx
0/
1 PodInitializing 30s
ufm-ufm-enterprise-xxxxxxxxxx
0/
1 Running 45s
ufm-ufm-enterprise-xxxxxxxxxx
1/
1 Running 2m
Note: The pod shows
0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes, depending on the cluster size.
Configuration Reference
All configuration options are set via Helm values. Use
--set key=value or a values file (
-f values.yaml).
Namespace Configuration
Parameter
Description
Default
Create the namespace
Namespace name
Image Configuration
Parameter
Description
Default
Image repository
Image tag
Image pull policy (Required)
-
Image pull secrets for private registries
Note:
image.pullPolicy must be set to one of:
Never,
IfNotPresent, or
Always.
UFM Configuration
Parameter
Description
Default
InfiniBand fabric interface name
Management network interface name
Apache HTTP port
Apache HTTPS port
Storage Configuration
Parameter
Description
Default
Enable PVC creation
Use existing PVC name
Storage class name (Required)
-
Persistent volume size
PVC access mode
Resource Limits (Required)
Parameter
Description
Default
Memory request (Required)
-
CPU request (Required)
-
Memory limit (Required)
-
CPU limit (Required)
-
License Configuration
Parameter
Description
Default
ConfigMap containing license file(s)
Secret containing license file(s)
Startup Probe Configuration
Parameter
Description
Default
Enable startup probe
Initial delay
Probe interval
Probe timeout
Failures before giving up
Note: With default settings, UFM has up to 5 minutes (10s × 30) to fully start.
Liveness Probe Configuration
Parameter
Description
Default
Enable liveness probe
Initial delay
Probe interval
Probe timeout
Failures before restart
Service Configuration
Parameter
Description
Default
Enable Kubernetes Service
Service type
NodePort number (30000-32767)
Note: A Service is automatically created when Ingress is enabled. Use
service.enabled=true only if you need a standalone Service without Ingress (e.g.,
LoadBalancer type in cloud environments).
Ingress Configuration
Parameter
Description
Default
Expose UFM via Ingress controller for external access
Ingress controller to use (e.g.,
DNS hostname for accessing UFM (e.g.,
Controller-specific annotations (e.g., backend protocol, timeouts)
Kubernetes TLS Secret for HTTPS (created via
Scheduling Configuration
Parameter
Description
Default
Schedule UFM on nodes with specific labels (e.g.,
Allow UFM to run on tainted nodes (e.g., dedicated infrastructure nodes)
Advanced scheduling rules for node or pod affinity/anti-affinity
Example - Schedule on specific node:
--set nodeSelector.
"kubernetes\.io/hostname"=ufm-node
Plugin Configuration
Parameter
Description
Default
List of plugins to deploy (see example below)
Default resource limits for plugins if not specified per-plugin
See below
Example - Deploy with a plugin:
helm install ufm ufm-enterprise-<version>-helm.tgz \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4 \
--set plugins.items[
0].name=<plugin-name> \
--set plugins.items[
0].image=<plugin-image> \
--set plugins.items[
0].tag=<plugin-version> \
--set plugins.items[
0].port=<plugin-port> \
--set plugins.items[
0].imagePullPolicy=Always
Deployment Options
Option 1: Host Network Mode (Default)
This is the default and simplest deployment mode. UFM binds directly to the host's network ports.
helm install ufm ./ufm-enterprise \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4
Access UFM at:
https://<node-ip>:443
Option 2: With Ingress Controller
Use an Ingress controller for external access with TLS termination and hostname-based routing.
Step 1: Install Ingress Controller (if not installed)
Step 2: Deploy UFM with Ingress
helm install ufm ./ufm-enterprise \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4 \
--set ingress.enabled=
true \
--set ingress.className=traefik \
--set ingress.host=ufm.example.com
Access UFM at: https://ufm.example.com
Note: When Ingress is enabled, UFM automatically switches to ports 8080/18443 to avoid conflicts.
Option 3: Using a Values File
For complex configurations, use a YAML values file:
# my-values.yaml
namespace:
name: ufm-enterprise
image:
pullPolicy: Never
config:
fabricInterface: ib0
mgmtInterface: eth0
storage:
className: nfs-client
size: 50Gi
resources:
requests:
memory: 8Gi
cpu:
4
limits:
memory: 16Gi
cpu:
8
license:
existingConfigMap: ufm-license
ingress:
enabled:
true
className: nginx
host: ufm.example.com
annotations:
nginx.ingress.kubernetes.io/backend-protocol:
"HTTPS"
nodeSelector:
kubernetes.io/hostname: ufm-node
Deploy with the values file:
helm install ufm ./ufm-enterprise -f my-values.yaml -n ufm-enterprise
Plugin Deployment
UFM plugins run as separate pods with pod affinity to ensure they are scheduled on the same node as UFM.
Plugin Configuration Fields
Limitation: In this version, you must manually specify the plugin port number. Refer to the plugin documentation for the correct port value.
Field
Description
Required
Plugin name without
Yes
Plugin Docker image repository
Yes
Plugin image tag
Yes
Plugin service port (omit if no HTTP)
No
Image pull policy
No (default: IfNotPresent)
HTTP health endpoint path
No
Port for health endpoint
No (defaults to
Seconds before first liveness probe
No (default: 60)
Seconds between liveness probes
No (default: 30)
Seconds before probe times out
No (default: 15)
Failures before restart
No (default: 3)
Seconds before first readiness probe
No (default: 10)
Seconds between readiness probes
No (default: 10)
Seconds before probe times out
No (default: 15)
Failures before not-ready
No (default: 3)
Deploy Single Plugin
helm install ufm ./ufm-enterprise \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4 \
--set plugins.items[
0].name=<plugin-name> \
--set plugins.items[
0].image=<plugin-image> \
--set plugins.items[
0].tag=<plugin-version> \
--set plugins.items[
0].port=<plugin-port> \
--set plugins.items[
0].imagePullPolicy=Always
Deploy Multiple Plugins
helm install ufm ufm-enterprise-<version>-helm.tgz \
--namespace ufm-enterprise \
--set config.fabricInterface=ib0 \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4 \
--set plugins.items[
0].name=<plugin1-name> \
--set plugins.items[
0].image=<plugin1-image> \
--set plugins.items[
0].tag=<plugin1-version> \
--set plugins.items[
0].port=<plugin1-port> \
--set plugins.items[
0].imagePullPolicy=Always \
--set plugins.items[
1].name=<plugin2-name> \
--set plugins.items[
1].image=<plugin2-image> \
--set plugins.items[
1].tag=<plugin2-version> \
--set plugins.items[
1].port=<plugin2-port> \
--set plugins.items[
1].imagePullPolicy=Always
Important: Plugin array indices must be sequential starting from 0.
Plugin Without HTTP Port
Some plugins don't expose an HTTP port. Omit the
port field:
--set plugins.items[
0].name=<plugin-name> \
--set plugins.items[
0].image=<plugin-image> \
--set plugins.items[
0].tag=<plugin-version> \
--set plugins.items[
0].imagePullPolicy=Always
Plugins with Values File
# plugins-values.yaml
plugins:
items:
- name: <plugin-name>
image: <plugin-image>
tag: <plugin-version>
port: <plugin-port>
imagePullPolicy: Always
helm install ufm ufm-enterprise-<version>-helm.tgz -f my-values.yaml -f plugins-values.yaml -n ufm-enterprise
Custom Configuration Files
The Helm chart includes default UFM configuration files that can be customized.
Included Config Files
File
Description
Main UFM configuration
OpenSM configuration
SHARP AM configuration
Primary telemetry environment
IBDiagNet configuration
Secondary telemetry config
Method 1: Edit Files Before Install
Extract the chart:
tar xzf ufm-enterprise-<version>-helm.tgz
Edit config files:
vim ufm-enterprise/files/config/gv.cfg
vim ufm-enterprise/files/config/opensm/opensm.conf
Install with modified files
helm install ufm ./ufm-enterprise -n ufm-enterprise \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4
Configuration Priority
Configuration is applied in this order (later wins):
Base install/upgrade - UFM default config files
Helm chart config files - Files from
files/config/directory
Helm values -
config.fabricInterface,
config.mgmtInterface
Adding Custom Counter Sets
Add custom Prometheus counter set files for telemetry customization:
Extract the chart:
tar xzf ufm-enterprise-<version>-helm.tgz
Add custom cset file:
mkdir -p ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/
cp my-custom-counters.cset ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/
Install:
helm install ufm ./ufm-enterprise -n ufm-enterprise \
--set storage.className=nfs-client \
--set image.pullPolicy=Never \
--set resources.requests.memory=4Gi \
--set resources.requests.cpu=
2 \
--set resources.limits.memory=8Gi \
--set resources.limits.cpu=
4
Operations
Start/Stop UFM
Stop UFM
Scale down the deployment to 0 replicas:
kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=
0
Verify UFM is stopped:
kubectl get pods -n ufm-enterprise
Start UFM
Scale back up to 1 replica:
kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=
1
Wait for the pod to be ready:
kubectl get pods -n ufm-enterprise -w
View Logs
Container Logs
Follow logs:
kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f
Previous container logs (after crash):
kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous
UFM Application Logs
# List log files
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/
# View specific log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log
# Tail a log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log
Access UFM UI and REST API
Web UI
https://<node-ip>:443/ufm_web/
REST API
# Get UFM version
curl https:
//<node-ip>:443/ufmRest/app/ufm_version
# List resources
curl https:
//<node-ip>:443/ufmRest/resources/systems
Uninstallation
Remove UFM
Run:
helm uninstall ufm -n ufm-enterprise
Warning: This deletes all UFM resources including the
PersistentVolumeClaimand data.
Resource Cleanup
Remove All Resources
To delete the entire namespace and all associated resources:
kubectl delete namespace ufm-enterprise
Remove Specific Resources Only
To delete selected resources instead of the full namespace:
kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
kubectl delete configmap -n ufm-enterprise ufm-license
kubectl delete secret -n ufm-enterprise ufm-tls
Networking
Port Reference
Port
Service
Description
80 / 8080
Apache HTTP
Web UI and REST API (HTTP)
443 / 18443
Apache HTTPS
Web UI and REST API (HTTPS)
8000
Flask
Internal REST server
8081
OpenSM
Plugin communication
8082
OpenSM
Trap listener
8087
Auth
Authentication service
9001
Telemetry
Prometheus metrics endpoint
9002
Telemetry
Secondary metrics endpoint
8401+
Plugins
Plugin-specific ports
Host Network Architecture
UFM is deployed with
hostNetwork: true, enabling direct access to:
InfiniBand interfaces for fabric management
Host ports for external connectivity
Low-latency communication with OpenSM
Implications:
UFM pods bind directly to the node’s network stack
Required ports must be available on the host
Port conflicts may prevent pod startup
Traffic Flow with Ingress
Client → Ingress Controller (80/443) → UFM Service → UFM Pod (8080/18443)
When Ingress is enabled:
UFM listens internally on ports 8080 and 18443
The Ingress controller handles external traffic on ports 80 and 443
TLS may be terminated at the Ingress or passed through to UFM
Storage
Data Persistence
All data stored under
/opt/ufm/files/ is persisted via a PersistentVolumeClaim (PVC), ensuring data retention across pod restarts.
Security Considerations
Privileged Container Requirement
UFM runs in privileged mode to allow:
Direct access to InfiniBand hardware
Loading of kernel modules
Management of the InfiniBand Subnet Manager
Security Impact: Privileged containers have elevated access to the host and should be deployed with caution.
Host Network Implications
Using
hostNetwork: true means:
UFM can access all host network interfaces
Service ports are exposed directly on the node
Kubernetes NetworkPolicies do not apply to pod traffic
Best Practices
Dedicated Nodes – Deploy UFM on dedicated infrastructure nodes.
Node Taints – Apply taints to prevent unrelated workloads from scheduling on UFM nodes.
Network Segmentation – Isolate UFM nodes on a management network.
RBAC Controls – Restrict access to the UFM namespace using Kubernetes RBAC.
Secrets Management – Store sensitive data in Kubernetes Secrets.
Regular Updates – Keep UFM and Kubernetes components up to date.
Monitoring
Kubernetes Probes
UFM uses two probes:
Probe
Purpose
Check
Startup
Wait for UFM initialization
REST API returns HTTP 200
Liveness
Detect failures
Health Check Details
Startup Probe:
Calls
/app/versioning/on UFM Web server
Returns 503 during initialization, 200 when ready
Allows up to 5 minutes for startup
Liveness Probe:
Verifies
UfmHealthRunnerprocess is running
Checks for failover flag (critical failure indicator)
Verifies
config_watcher.shis running
Monitoring Commands
Verify Probe Status
Run the following command to review the liveness and startup probe configuration and status:
kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A
5 -E
"Liveness:|Startup:"
Verify UFM Processes
Use the command below to list running UFM-related processes inside the pod:
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux
Check UFM Health Log
Run the following command to inspect the UFM health log file:
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log
Known Limitations
Limitation
Description
Impact
Single Pod
Only one replica supported
No horizontal scaling
No Automatic Failover
Pod won't migrate on node failure
Manual intervention required
No High Availability
HA mode not supported in K8s
Use Docker HA for HA requirements
Privileged Mode
Container requires privileged access
Security considerations
Host Network
Uses host networking
Port conflicts possible
sysdump Unavailable
sysdump collector doesn't work
Use manual log collection
Recreate Strategy
Rolling updates not supported
Brief downtime during upgrades
Plugin Operations
Not all plugin operations are supported
Some plugin features may not work
Plugin Port Configuration
User must manually specify plugin ports
Refer to plugin documentation for port values