What can I help you with?
NVIDIA UFM Enterprise User Manual v6.21.2

UFM Infra

The UFM Infra feature introduces a structured architecture where services are divided into two categories, each deployed differently based on functionality:

  • UFM Infra: A set of persistent infrastructure services that run on all nodes. These services support system-level operations and ensure distributed availability.

  • UFM Enterprise: Services that run exclusively on the master node, responsible for management, orchestration, and user-facing functionality.

Key Benefits

  • Faster Failover: By limiting service transitions during node failures, recovery times are significantly reduced.

  • Improved Modularity: Separating core infrastructure from enterprise logic simplifies maintenance and troubleshooting.

  • Enhanced Scalability: Services can be scaled and managed independently across nodes.

Users can enable or disable the UFM Infra feature without requiring a reinstallation of the UFM system. For more information, refer to Enabling or Disabling UFM Infra.

Installation instructions are available at Installing UFM Infra Using Rootless with Podman.

As part of the updated architecture, a FAST-API plugin is deployed and a Redis server is required for inter-service communication. Redis can be configured in two ways:

  • As an internal service (installed with UFM)

  • As an external Redis instance, depending on deployment needs.

    For more information, refer to Redis-Related Configuration.

The following sequence describes how communication is handled between Fast API, Redis, and SM/SHARP components:

  1. Request Submission via Fast API

    Users send REST API requests (e.g., for PKey creation or SHARP reservation actions) to the Fast API. These requests are placed into Redis queues, and a Transaction ID (TID) is returned to the user for tracking purposes.

  2. Processing by Communicators

    • The SM Communicator or SHARP Communicator monitors Redis queues for new requests.

    • Upon receiving a request, the communicator forwards it to the relevant component (SM or SHARP) for execution.

    • After processing, the communicator captures the response and status.

  3. Status Updates

    The communicators update the status of each request back into Redis. Users can query the status of their transaction using the TID provided during request submission.

  4. Configuration Storage and Retrieval

    • Communicators store the configuration in Redis.

    • This allows the Fast API to retrieve and expose configuration data via REST APIs, giving users access to the configuration via REST APIs to understand cluster-level settings.

      image-2025-4-23_18-46-3-version-1-modificationdate-1748450701894-api-v2.png

Redis-Related Configuration

Redis configuration parameters can be modified within the UFMInfrasection of the gv.cfg file. This allows for customization of Redis behavior to better suit UFM infrastructure requirements.

Copy
Copied!
            

[UFMInfra] ... # What is the host where the Redis server is running redis_host = localhost   # What is the Redis port redis_port = 6379   # Redis timeout in seconds redis_socket_timeout = 5   # Flag that shows if we use external Redis database is_external_redis = False   # Flag that shows if we use TLS connection to Redis database is_tls_redis = False


Fast-API configuration

The following parameters can be modified within the Fast API configuration file:

Section

Default Value

Description

smCommunicator

600

Default Time-to-live (TTL) for SM-related transactions before expiration (in seconds)

sharpCommunicator

600

Default Time-to-live (TTL) for SHARP-related transactions before expiration (in seconds)


Prerequisites

Before enabling or disabling the UFM Infra feature, ensure the following conditions are met:

  • The UFM Docker image has been installed using the deploy_rootless_ufm script. Refer to Installing UFM Infra Using Rootless with Podman.

  • UFM High Availability (HA) is deployed using the Enterprise Multinode setup.

  • The control script for managing the feature is available on the host at: /opt/ufm/files/scripts/ufm_infra_feature_flag.py

  • Example:

    Copy
    Copied!
                

    ufm_infra_feature_flag.py -h usage: ufm_infra_feature_flag.py [-h] (-e | -d) [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--timeout-seconds TIMEOUT_SECONDS] [--ufm-user UFM_USER]   Control UFM Infra feature flags   This script turns on/off the UFM Infra (multi node) feature. It manages the UFM Infrastructure feature by controlling both the configuration and HA cluster resources. The script follows these flows:   Prerequisites check:   1. Verifies Python version is 3.6 or higher   2. Verifies script is run with root privileges   3. Verifies ufm_user user exists (default is ufmadm but can be overridden with --ufm-user)   4. Validates HA configuration and UFM Infra installation   Enable flow:   1. Stops the HA cluster and waits for all UFM containers to stop   2. Updates the UFM configuration to enable the Infra feature   3. Updates the Redis trigger file to enable topology publishing   4. Enables the HA resources   5. Starts the HA cluster (only if previous steps succeeded)   Disable flow:   1. Stops the HA cluster and waits for all UFM containers to stop   2. Updates the UFM configuration to disable the Infra feature   3. Updates the Redis trigger file to disable topology publishing   4. Disables the HA resources   5. Starts the HA cluster (only if previous steps succeeded)   Note: This script requires root privileges to modify the UFM configuration. If any step fails, the script will exit without starting the HA cluster. In case of failure, manual intervention will be required to restore the system to a working state. The HA cluster may need to be started manually using 'ufm_ha_cluster start' command.   optional arguments:   -h, --help            show this help message and exit   -e, --enable          Enable the Infra feature (mutually exclusive with -d)   -d, --disable         Disable the Infra feature (mutually exclusive with -e)   --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}                         Set the logging level (default: INFO)   --timeout-seconds TIMEOUT_SECONDS                         Timeout for waiting for containers to go down (default120 seconds)   --ufm-user UFM_USER   The user to run the command as (default: ufmadm)

When deploying a plugin with ufm_infrais installed, users can choose one of the following methods:

  • Via the UI: Use the UFM user interface to deploy the plugin. For instructions, refer to Plugin Management.

  • Via REST API: Deploy the plugin through UFM's REST API. For more information, refer to NVIDIA UFM Enterprise REST API Guide.

  • Using the Plugin Management Script: Run the manage_ufm_plugins script inside the UFM container (not the ufm_infra container). For more information, refer to UFM Plugins Management.

© Copyright 2025, NVIDIA. Last updated on May 28, 2025.