NVIDIA UFM Enterprise User Manual v6.24.1

UFM Infra

The UFM Infra feature introduces a structured architecture where services are divided into two categories, each deployed differently based on functionality:

  • UFM Infra: A set of persistent infrastructure services that run on all nodes. These services support system-level operations and ensure distributed availability.

  • UFM Enterprise: Services that run exclusively on the master node, responsible for management, orchestration, and user-facing functionality.

  • Faster API Availability after Failover : By limiting service transitions during node failures, recovery times are significantly reduced.

  • Improved Modularity: Separating core infrastructure from enterprise logic simplifies maintenance and troubleshooting.

  • Enhanced Scalability: Services can be scaled and managed independently across nodes.

Users can enable or disable the UFM Infra feature without requiring a reinstallation of the UFM system. For more information, refer to Enabling or Disabling UFM Infra.

Installation instructions are available at UFM Infra Installation.

The Redis image must be loaded, or the is_external_redis flag must be enabled in gv.cfg.

ufm-infra.service

Manages the following infrastructure components:

Component

Description

Redis Server

Inter-node communication and topology storage

Apache Web Server

HTTP/HTTPS web server for UFM API and UI

Authentication Server

User authentication and session management

UFM Health (Infra)

Infrastructure health monitoring

Infra Plugins

Plugins running in infra context (e.g., Fast API)

UTM Telemetry

Telemetry services (when UTM mode enabled)


ufm-enterprise.service

Manages the following enterprise components:

Component

Description

OpenSM

Subnet Manager for InfiniBand fabric

UFM Main Process

Core UFM fabric management engine

Enterprise Plugins

Plugins running in enterprise context

Topology Publishing

Publishes fabric topology to Redis (Infra mode)


Shared Resources

In Infra mode, the following resources are shared between services:

  • Docker Volume (ufm-shared-data) (ufm-shared-data): Shared Apache configuration between containers

  • Shared Configuration Files: opt/ufm/files/mounted to both containers

  • Redis: Used for topology publishing and inter-service communication

Key

Type

Default Value

Description

enabled

boolean

false

Enable or disable UFM Infra mode

redis_host

string

localhost

Redis server hostname or IP address

redis_port

integer

6379

Redis server port number

redis_socket_timeout

integer

5

Redis connection timeout in seconds

is_external_redis

boolean

false

Use external Redis server instead of internal

is_tls_redis

boolean

false

Enable TLS encryption for Redis connections

Fast-API configuration

The following parameters can be modified within the Fast API configuration file:

Section

Default Value

Description

smCommunicator

600

Default Time-to-live (TTL) for SM-related transactions before expiration (in seconds)

sharpCommunicator

600

Default Time-to-live (TTL) for SHARP-related transactions before expiration (in seconds)


UFM Infra mode can be enabled or disabled after installation using the ufm_infra_feature_flag.py script.

Script Location

/opt/ufm/files/scripts/ufm_infra_feature_flag.py

Command Line Options

Usage:

Copy
Copied!
            

ufm_infra_feature_flag.py[-h](     -e | -d)[--rootless][--log - level{DEBUG, INFO, WARNING, ERROR, CRITICAL}]             [--timeout - seconds TIMEOUT_SECONDS][--ufm - user UFM_USER]             [--force][--skip - ha - validation]             [--infra - plugins - dir<path>] Control UFM Infra feature flags

Flag

Description

-e, --enable

Enable the Infra feature

-d, --disable

Disable the Infra feature

--rootless

Use rootless Podman mode (default: root Docker mode)

--log-level

Set logging level (default: INFO)

--timeout-seconds

Timeout for waiting for containers to stop (default: 120)

--ufm-user

User for rootless Podman commands (default: ufmadm)

--force

Automatically stop/start UFM services

--skip-ha-validation

Skip HA configuration validation

--infra-plugins-dir

Directory containing plugin images to load and install


Enabling Infra Mode

Standalone Mode (Docker)

Without Automatic Service Management

  1. Stop UFM services manually:

    Copy
    Copied!
                

    systemctl stop ufm-enterprise systemctl stop ufm-infra

  2. Enable Infra mode:

    Copy
    Copied!
                

    cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --enable

  3. Start UFM services manually:

    Copy
    Copied!
                

    systemctl start ufm-infra systemctl start ufm-enterprise

    Note

    The script automatically detects whether the system is running in HA mode and manages cluster resources accordingly.

Disabling Infra Mode

Standalone Mode (Docker)

Copy
Copied!
            

cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --disable --force


Standalone Mode (Rootless Podman)

Copy
Copied!
            

cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --disable --rootless --force


High Availability (HA) Mode

Copy
Copied!
            

cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --disable --force

Script Behavior

When Enabling Infra Mode

The script performs the following actions:

  • Stops UFM services (standalone) or the HA cluster

  • Waits for all UFM containers to stop

  • Updates gv.cfg to set:

    Copy
    Copied!
                

    [UFMInfra] enabled = true

  • Updates the Redis trigger file to enabled

  • Validates HA resources (if running in HA mode)

  • Loads and installs Infra plugins if --infra-plugins-dir is specified

  • Restarts UFM services or the HA cluster


When Disabling Infra Mode

The script performs the following actions:

  • Stops UFM services (standalone) or the HA cluster

  • Waits for all UFM containers to stop

  • Updates gv.cfg to set:

    Copy
    Copied!
                

    [UFMInfra] enabled = false

  • Updates the Redis trigger file to disabled

  • Restarts UFM services or the HA cluster

As part of the updated architecture, a FAST-API plugin can be deployed as an Infra Plugin and a Redis server is required for inter-service communication. Redis can be configured in two ways:

  • As an internal service (installed with UFM)

  • As an external Redis instance, depending on deployment needs.

The following sequence describes how communication is handled between Fast API, Redis, and SM/SHARP components:

  1. Request Submission via Fast API

    Users send REST API requests (e.g., for PKey creation or SHARP reservation actions) to the Fast API. These requests are placed into Redis queues, and a Transaction ID (TID) is returned to the user for tracking purposes.

  2. Processing by Communicators

    • The SM Communicator or SHARP Communicator monitors Redis queues for new requests.

    • Upon receiving a request, the communicator forwards it to the relevant component (SM or SHARP) for execution.

    • After processing, the communicator captures the response and status.

  3. Status Updates

    The communicators update the status of each request back into Redis. Users can query the status of their transaction using the TID provided during request submission.

  4. Configuration Storage and Retrieval

    • Communicators store the configuration in Redis.

    • This allows the Fast API to retrieve and expose configuration data via REST APIs, giving users access to the configuration via REST APIs to understand cluster-level settings.

      image-2025-4-23_18-46-3-version-1-modificationdate-1771580781717-api-v2.png

© Copyright 2026, NVIDIA. Last updated on Feb 20, 2026