NVIDIA UFM Enterprise User Manual v6.24.1

On This Page

Appendix - UFM on Kubernetes

This guide provides comprehensive instructions for deploying NVIDIA UFM Enterprise on Kubernetes using Helm charts.

Overview

What's New

UFM Enterprise now supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:

  • Declarative Configuration: Define your UFM deployment using Helm values

  • Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management

  • Plugin Support: Deploy UFM plugins as separate pods with automatic configuration

  • Ingress Integration: Expose UFM through Kubernetes Ingress controllers

  • Persistent Storage: Use Kubernetes PersistentVolumeClaimsfor data persistence

Supported Environments

Kubernetes Version

Kubernetes 1.28 or later.

Node Operating Systems

UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.

Hardware Requirements

UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.

Prerequisites

Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:

Kubernetes Cluster

  • Kubernetes cluster version 1.28 or later

  • kubectl configured with cluster access

  • Cluster admin permissions for installation

Helm

  • Helm 3.x installed on the management workstation. Run:

    Copy
    Copied!
                

    # Verify Helm installation helm version

Storage

  • A StorageClass that supports ReadWriteMany access mode

  • Minimum 10GB storage capacity

InfiniBand

  • At least one node with InfiniBand interface

  • DOCA drivers installed on the worker node where UFM will be deployed

  • InfiniBand port configured and in "up" state. Run:

    Copy
    Copied!
                

    # Verify InfiniBand interface ip link show | grep -E 'ib[0-9]|ibp'

UFM License

  • Valid UFM Enterprise license file

  • License file accessible from the management workstation

Network Ports

UFM uses host network mode. Ensure these ports are available on the target node (more ports might be used):

Port

Protocol

Purpose

Configurable

80/443

TCP

Apache HTTP/HTTPS

Yes

8000

TCP

UFM Internal REST Server

No

8081

TCP

OpenSM Plugin Communication

Yes

8082

TCP

OpenSM Traps Listening

Yes

8087

TCP

Auth Service

Yes

9001

TCP

Telemetry/Prometheus Endpoint

Yes

9002

TCP

Secondary Telemetry Endpoint

Yes

8401+

TCP

Plugin Ports (varies per plugin)

Yes

Warning

When using Ingress, UFM automatically switches to ports 8080/18443 to avoid conflicts with the Ingress controller.

Installation

Step 1: Set Up Storage

UFM requires ReadWriteMany storage. Make sure you have a Persistent storage configured.

Step 2: UFM Docker Image

The UFM Docker image needs to be located in a place your K8S cluster has access to.

It can be pre loaded or in your own Registry.

Step 3: Create Namespace and License ConfigMap

Copy
Copied!
            

# Create the namespace kubectl create namespace ufm-enterprise   # Create license ConfigMap kubectl create configmap ufm-license \ --from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \ -n ufm-enterprise

Step 4: Install UFM with Helm

Copy
Copied!
            

helm install ufm ufm-enterprise-<version>-helm.tgz \ --namespace ufm-enterprise \ --set config.fabricInterface=<your_ib_interface> \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4

Replace <your_ib_interface> with your InfiniBand interface name (e.g., ib0, ibp4s0f0).

Note

Note: The Helm chart is distributed as a .tgz package.


Step 5: Verify Installation

Watch the pod status:

Copy
Copied!
            

kubectl get pods -n ufm-enterprise -w

Expected state transitions:

Copy
Copied!
            

NAME READY STATUS AGE ufm-ufm-enterprise-xxxxxxxxxx 0/1 Init:0/1 5s ufm-ufm-enterprise-xxxxxxxxxx 0/1 PodInitializing 30s ufm-ufm-enterprise-xxxxxxxxxx 0/1 Running 45s ufm-ufm-enterprise-xxxxxxxxxx 1/1 Running 2m

Note

Note: The pod shows 0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes, depending on the cluster size.


Configuration Reference

All configuration options are set via Helm values. Use --set key=value or a values file (-f values.yaml).

Namespace Configuration

Parameter

Description

Default

namespace.create

Create the namespace

false

namespace.name

Namespace name

ufm-enterprise


Image Configuration

Parameter

Description

Default

image.repository

Image repository

docker.io/mellanox/ufm-enterprise

image.tag

Image tag

latest

image.pullPolicy

Image pull policy (Required)

-

imagePullSecrets

Image pull secrets for private registries

[]

Note

Note: image.pullPolicy must be set to one of: Never, IfNotPresent, or Always.


UFM Configuration

Parameter

Description

Default

config.fabricInterface

InfiniBand fabric interface name

"" (uses gv.cfg)

config.mgmtInterface

Management network interface name

"" (uses gv.cfg)

config.httpPort

Apache HTTP port

80 (or 8080 with Ingress)

config.httpsPort

Apache HTTPS port

443 (or 18443 with Ingress)


Storage Configuration

Parameter

Description

Default

storage.enabled

Enable PVC creation

true

storage.existingClaim

Use existing PVC name

""

storage.className

Storage class name (Required)

-

storage.size

Persistent volume size

10Gi

storage.accessMode

PVC access mode

ReadWriteMany


Resource Limits (Required)

Parameter

Description

Default

resources.requests.memory

Memory request (Required)

-

resources.requests.cpu

CPU request (Required)

-

resources.limits.memory

Memory limit (Required)

-

resources.limits.cpu

CPU limit (Required)

-


License Configuration

Parameter

Description

Default

license.existingConfigMap

ConfigMap containing license file(s)

""

license.existingSecret

Secret containing license file(s)

""


Startup Probe Configuration

Parameter

Description

Default

startupProbe.enabled

Enable startup probe

true

startupProbe.initialDelaySeconds

Initial delay

2

startupProbe.periodSeconds

Probe interval

10

startupProbe.timeoutSeconds

Probe timeout

2

startupProbe.failureThreshold

Failures before giving up

30

Note

Note: With default settings, UFM has up to 5 minutes (10s × 30) to fully start.


Liveness Probe Configuration

Parameter

Description

Default

livenessProbe.enabled

Enable liveness probe

true

livenessProbe.initialDelaySeconds

Initial delay

0

livenessProbe.periodSeconds

Probe interval

10

livenessProbe.timeoutSeconds

Probe timeout

2

livenessProbe.failureThreshold

Failures before restart

3


Service Configuration

Parameter

Description

Default

service.enabled

Enable Kubernetes Service

false

service.type

Service type

ClusterIP

service.nodePort

NodePort number (30000-32767)

""

Note

Note: A Service is automatically created when Ingress is enabled. Use service.enabled=true only if you need a standalone Service without Ingress (e.g., LoadBalancer type in cloud environments).


Ingress Configuration

Parameter

Description

Default

ingress.enabled

Expose UFM via Ingress controller for external access

false

ingress.className

Ingress controller to use (e.g., nginx, traefik)

""

ingress.host

DNS hostname for accessing UFM (e.g., ufm.example.com)

""

ingress.annotations

Controller-specific annotations (e.g., backend protocol, timeouts)

{}

ingress.tls.secretName

Kubernetes TLS Secret for HTTPS (created via kubectl create secret tls)

""


Scheduling Configuration

Parameter

Description

Default

nodeSelector

Schedule UFM on nodes with specific labels (e.g., kubernetes.io/hostname: ufm-node)

{}

tolerations

Allow UFM to run on tainted nodes (e.g., dedicated infrastructure nodes)

[]

affinity

Advanced scheduling rules for node or pod affinity/anti-affinity

{}

Example - Schedule on specific node:

Copy
Copied!
            

--set nodeSelector."kubernetes\.io/hostname"=ufm-node

Plugin Configuration

Parameter

Description

Default

plugins.items

List of plugins to deploy (see example below)

[]

plugins.defaultResources

Default resource limits for plugins if not specified per-plugin

See below

Example - Deploy with a plugin:

Copy
Copied!
            

helm install ufm ufm-enterprise-<version>-helm.tgz \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4 \ --set plugins.items[0].name=<plugin-name> \ --set plugins.items[0].image=<plugin-image> \ --set plugins.items[0].tag=<plugin-version> \ --set plugins.items[0].port=<plugin-port> \ --set plugins.items[0].imagePullPolicy=Always

Deployment Options

Option 1: Host Network Mode (Default)

This is the default and simplest deployment mode. UFM binds directly to the host's network ports.

Copy
Copied!
            

helm install ufm ./ufm-enterprise \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4

Access UFM at: https://<node-ip>:443

Option 2: With Ingress Controller

Use an Ingress controller for external access with TLS termination and hostname-based routing.

Step 1: Install Ingress Controller (if not installed)

Step 2: Deploy UFM with Ingress

Copy
Copied!
            

helm install ufm ./ufm-enterprise \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4 \ --set ingress.enabled=true \ --set ingress.className=traefik \ --set ingress.host=ufm.example.com

Access UFM at: https://ufm.example.com

Note

Note: When Ingress is enabled, UFM automatically switches to ports 8080/18443 to avoid conflicts.

Option 3: Using a Values File

For complex configurations, use a YAML values file:

Copy
Copied!
            

# my-values.yaml namespace: name: ufm-enterprise   image: pullPolicy: Never   config: fabricInterface: ib0 mgmtInterface: eth0   storage: className: nfs-client size: 50Gi   resources: requests: memory: 8Gi cpu: 4 limits: memory: 16Gi cpu: 8   license: existingConfigMap: ufm-license   ingress: enabled: true className: nginx host: ufm.example.com annotations: nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"   nodeSelector: kubernetes.io/hostname: ufm-node

Deploy with the values file:

Copy
Copied!
            

helm install ufm ./ufm-enterprise -f my-values.yaml -n ufm-enterprise

Plugin Deployment

UFM plugins run as separate pods with pod affinity to ensure they are scheduled on the same node as UFM.

Plugin Configuration Fields

Warning

Limitation: In this version, you must manually specify the plugin port number. Refer to the plugin documentation for the correct port value.

Field

Description

Required

name

Plugin name without ufm-plugin- prefix

Yes

image

Plugin Docker image repository

Yes

tag

Plugin image tag

Yes

port

Plugin service port (omit if no HTTP)

No

imagePullPolicy

Image pull policy

No (default: IfNotPresent)

healthEndpoint

HTTP health endpoint path

No

healthPort

Port for health endpoint

No (defaults to port)

livenessInitialDelay

Seconds before first liveness probe

No (default: 60)

livenessPeriod

Seconds between liveness probes

No (default: 30)

livenessTimeout

Seconds before probe times out

No (default: 15)

livenessFailureThreshold

Failures before restart

No (default: 3)

readinessInitialDelay

Seconds before first readiness probe

No (default: 10)

readinessPeriod

Seconds between readiness probes

No (default: 10)

readinessTimeout

Seconds before probe times out

No (default: 15)

readinessFailureThreshold

Failures before not-ready

No (default: 3)


Deploy Single Plugin

Copy
Copied!
            

helm install ufm ./ufm-enterprise \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4 \ --set plugins.items[0].name=<plugin-name> \ --set plugins.items[0].image=<plugin-image> \ --set plugins.items[0].tag=<plugin-version> \ --set plugins.items[0].port=<plugin-port> \ --set plugins.items[0].imagePullPolicy=Always

Deploy Multiple Plugins

Copy
Copied!
            

helm install ufm ufm-enterprise-<version>-helm.tgz \ --namespace ufm-enterprise \ --set config.fabricInterface=ib0 \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set license.existingConfigMap=ufm-license \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4 \ --set plugins.items[0].name=<plugin1-name> \ --set plugins.items[0].image=<plugin1-image> \ --set plugins.items[0].tag=<plugin1-version> \ --set plugins.items[0].port=<plugin1-port> \ --set plugins.items[0].imagePullPolicy=Always \ --set plugins.items[1].name=<plugin2-name> \ --set plugins.items[1].image=<plugin2-image> \ --set plugins.items[1].tag=<plugin2-version> \ --set plugins.items[1].port=<plugin2-port> \ --set plugins.items[1].imagePullPolicy=Always

Warning

Important: Plugin array indices must be sequential starting from 0.


Plugin Without HTTP Port

Some plugins don't expose an HTTP port. Omit the port field:

Copy
Copied!
            

--set plugins.items[0].name=<plugin-name> \ --set plugins.items[0].image=<plugin-image> \ --set plugins.items[0].tag=<plugin-version> \ --set plugins.items[0].imagePullPolicy=Always

Plugins with Values File

Copy
Copied!
            

# plugins-values.yaml plugins: items: - name: <plugin-name> image: <plugin-image> tag: <plugin-version> port: <plugin-port> imagePullPolicy: Always   helm install ufm ufm-enterprise-<version>-helm.tgz -f my-values.yaml -f plugins-values.yaml -n ufm-enterprise

Custom Configuration Files

The Helm chart includes default UFM configuration files that can be customized.

Included Config Files

File

Description

gv.cfg

Main UFM configuration

opensm/opensm.conf

OpenSM configuration

sharp/sharp_am.cfg

SHARP AM configuration

telemetry_defaults/primary_env.cfg

Primary telemetry environment

telemetry_defaults/launch_ibdiagnet_config.ini

IBDiagNet configuration

secondary_telemetry_defaults/launch_ibdiagnet_config.ini

Secondary telemetry config


Method 1: Edit Files Before Install

Extract the chart:

Copy
Copied!
            

tar xzf ufm-enterprise-<version>-helm.tgz


Edit config files:

Copy
Copied!
            

vim ufm-enterprise/files/config/gv.cfg vim ufm-enterprise/files/config/opensm/opensm.conf


Install with modified files

Copy
Copied!
            

helm install ufm ./ufm-enterprise -n ufm-enterprise \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4

Configuration Priority

Configuration is applied in this order (later wins):

  1. Base install/upgrade - UFM default config files

  2. Helm chart config files - Files from files/config/ directory

  3. Helm values - config.fabricInterface, config.mgmtInterface

Adding Custom Counter Sets

Add custom Prometheus counter set files for telemetry customization:

Extract the chart:

Copy
Copied!
            

tar xzf ufm-enterprise-<version>-helm.tgz


Add custom cset file:

Copy
Copied!
            

mkdir -p ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/ cp my-custom-counters.cset ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/


Install:

Copy
Copied!
            

helm install ufm ./ufm-enterprise -n ufm-enterprise \ --set storage.className=nfs-client \ --set image.pullPolicy=Never \ --set resources.requests.memory=4Gi \ --set resources.requests.cpu=2 \ --set resources.limits.memory=8Gi \ --set resources.limits.cpu=4

Operations

Start/Stop UFM

Stop UFM

Scale down the deployment to 0 replicas:

Copy
Copied!
            

kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=0

Verify UFM is stopped:

Copy
Copied!
            

kubectl get pods -n ufm-enterprise

Start UFM

Scale back up to 1 replica:

Copy
Copied!
            

kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=1

Wait for the pod to be ready:

Copy
Copied!
            

kubectl get pods -n ufm-enterprise -w

View Logs

Container Logs

Follow logs:

Copy
Copied!
            

kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f

Previous container logs (after crash):

Copy
Copied!
            

kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous

UFM Application Logs

# List log files

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/

# View specific log

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log

# Tail a log

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log

Access UFM UI and REST API

Web UI

https://<node-ip>:443/ufm_web/

REST API

Copy
Copied!
            

# Get UFM version curl https://<node-ip>:443/ufmRest/app/ufm_version   # List resources curl https://<node-ip>:443/ufmRest/resources/systems

Uninstallation

Remove UFM

Run:

Copy
Copied!
            

helm uninstall ufm -n ufm-enterprise

Warning

Warning: This deletes all UFM resources including the PersistentVolumeClaimand data.


Resource Cleanup

Remove All Resources

To delete the entire namespace and all associated resources:

Copy
Copied!
            

kubectl delete namespace ufm-enterprise

Remove Specific Resources Only

To delete selected resources instead of the full namespace:

Copy
Copied!
            

kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise kubectl delete configmap -n ufm-enterprise ufm-license kubectl delete secret -n ufm-enterprise ufm-tls

Networking

Port Reference

Port

Service

Description

80 / 8080

Apache HTTP

Web UI and REST API (HTTP)

443 / 18443

Apache HTTPS

Web UI and REST API (HTTPS)

8000

Flask

Internal REST server

8081

OpenSM

Plugin communication

8082

OpenSM

Trap listener

8087

Auth

Authentication service

9001

Telemetry

Prometheus metrics endpoint

9002

Telemetry

Secondary metrics endpoint

8401+

Plugins

Plugin-specific ports

Host Network Architecture

UFM is deployed with hostNetwork: true, enabling direct access to:

  • InfiniBand interfaces for fabric management

  • Host ports for external connectivity

  • Low-latency communication with OpenSM

Implications:

  • UFM pods bind directly to the node’s network stack

  • Required ports must be available on the host

  • Port conflicts may prevent pod startup

Traffic Flow with Ingress

Client → Ingress Controller (80/443) → UFM Service → UFM Pod (8080/18443)

When Ingress is enabled:

  • UFM listens internally on ports 8080 and 18443

  • The Ingress controller handles external traffic on ports 80 and 443

  • TLS may be terminated at the Ingress or passed through to UFM

Storage

Data Persistence

All data stored under /opt/ufm/files/ is persisted via a PersistentVolumeClaim (PVC), ensuring data retention across pod restarts.

Security Considerations

Privileged Container Requirement

UFM runs in privileged mode to allow:

  • Direct access to InfiniBand hardware

  • Loading of kernel modules

  • Management of the InfiniBand Subnet Manager

Warning

Security Impact: Privileged containers have elevated access to the host and should be deployed with caution.

Host Network Implications

Using hostNetwork: true means:

  • UFM can access all host network interfaces

  • Service ports are exposed directly on the node

  • Kubernetes NetworkPolicies do not apply to pod traffic


Best Practices

  1. Dedicated Nodes – Deploy UFM on dedicated infrastructure nodes.

  2. Node Taints – Apply taints to prevent unrelated workloads from scheduling on UFM nodes.

  3. Network Segmentation – Isolate UFM nodes on a management network.

  4. RBAC Controls – Restrict access to the UFM namespace using Kubernetes RBAC.

  5. Secrets Management – Store sensitive data in Kubernetes Secrets.

  6. Regular Updates – Keep UFM and Kubernetes components up to date.

Monitoring

Kubernetes Probes

UFM uses two probes:

Probe

Purpose

Check

Startup

Wait for UFM initialization

REST API returns HTTP 200

Liveness

Detect failures

UfmHealthRunnerrunning, no failover flag


Health Check Details

Startup Probe:

  • Calls /app/versioning/ on UFM Web server

  • Returns 503 during initialization, 200 when ready

  • Allows up to 5 minutes for startup

Liveness Probe:

  • Verifies UfmHealthRunner process is running

  • Checks for failover flag (critical failure indicator)

  • Verifies config_watcher.sh is running

Monitoring Commands

Verify Probe Status

Run the following command to review the liveness and startup probe configuration and status:

Copy
Copied!
            

kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"

Verify UFM Processes

Use the command below to list running UFM-related processes inside the pod:

Copy
Copied!
            

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux

Check UFM Health Log

Run the following command to inspect the UFM health log file:

Copy
Copied!
            

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log

Known Limitations

Limitation

Description

Impact

Single Pod

Only one replica supported

No horizontal scaling

No Automatic Failover

Pod won't migrate on node failure

Manual intervention required

No High Availability

HA mode not supported in K8s

Use Docker HA for HA requirements

Privileged Mode

Container requires privileged access

Security considerations

Host Network

Uses host networking

Port conflicts possible

sysdump Unavailable

sysdump collector doesn't work

Use manual log collection

Recreate Strategy

Rolling updates not supported

Brief downtime during upgrades

Plugin Operations

Not all plugin operations are supported

Some plugin features may not work

Plugin Port Configuration

User must manually specify plugin ports

Refer to plugin documentation for port values


© Copyright 2026, NVIDIA. Last updated on Feb 20, 2026