NVIDIA UFM Enterprise User Manual v6.22.1

Installing UFM Infra Using Rootless with Podman

Prerequisites

  1. Download the UFM and plugins bundle tar file to /tmp.

  2. Extract the contents using the command:

    Copy
    Copied!
                

    tar -xvf <bundle tar>

This archive (tar file) includes the following components:

  • Relevant UFM container image

  • Relevant FAST-API container image

  • Relevant Infra container image (for internal Redis usage). Refer to Redis-Related Configuration for more information.

  • Default plugin bundle for UFM

  • UFM-HA package

To enable the UFM Infra feature, UFM HA must be installed in a new mode (`external-storage`), using a new product (`enterprise-multinode`).

Additionally, NFS must be configured as follows:

NFS Setup Prerequisites

  • Select a dedicated NFS server to host the shared directories.

  • Create a shared directory on the NFS server for UFM configuration and logs.

  • Install the NFS client on each UFM node if not already present.

Enable HA ports in Firewall

If you have firewall rules that blocks non-standard ports, we need to open these ports so high availability services could communicate with each other on the HA nodes. To do so, run these commands:

Copy
Copied!
            

firewall-cmd --permanent --add-service=high-availability # or firewall-cmd --add-service=high-availability # and then reload the rules firewall-cmd --reload


Create and Mount the UFM Directory

Note

At this stage, apply point #2 (Mount the UFM directory) only on the master machine.

Other nodes will be visited for mount later.

  1. Create the UFM directory:

    Copy
    Copied!
                

    mkdir -p /opt/ufm/files/

  2. Mount the UFM directory:

    • If using NFS 4.2:

      Copy
      Copied!
                  

      mount -t nfs4 -o context="system_u:object_r:container_file_t:s0" <server>:/shared_folder /opt/ufm/shared_files

    • If using NFS 3:

      Copy
      Copied!
                  

      mount -t nfs -o vers=3,context="system_u:object_r:container_file_t:s0" <server>:/shared_folder /opt/ufm/shared_files

  3. Ensure the NFS version and mount options are compatible with the NFS server.

  4. Verify that the following HA packages are installed: pcs, pacemaker, and corosync. Install them if they are missing.

  5. Follow the HA installation steps in Run the HA Installation.

Run the HA Installation

Follow the HA installation instructions at UFM High-Availability Installation and Configuration.

When running the HA installation script, use the following command:

Copy
Copied!
            

./install.sh -p enterprise-multinode -l /opt/ufm/shared_files

  • The -l flag must always point to the shared directory path: /opt/ufm/shared_files

  • No need to provide the DRBD disk argument to the installation script.

Installation Instructions

  1. Check firewall status:

    Copy
    Copied!
                

    systemctl status firewalld

  2. Configure Firewall (if active):

    Copy
    Copied!
                

    # check if firewalld is running systemctl status firewalld # Permanently add port 8443 to firewalld firewall-cmd --permanent --add-port=8443/tcp # reload firewalld config firewall-cmd --reload

  3. Create UFM directory:

    Copy
    Copied!
                

    mkdir -p /opt/ufm

  4. Create UFM group:

    Copy
    Copied!
                

    groupadd ufmadm -g 733

  5. Create a UFM user:

    Copy
    Copied!
                

    useradd -d /opt/ufm -m -u 733 -g ufmadm ufmadm

  6. Set directory ownership:

    Copy
    Copied!
                

    chown -R ufmadm:ufmadm /opt/ufm chown -R ufmadm:ufmadm /opt/ufm/shared_files

  7. Configure SubUID and SubGID:

    Copy
    Copied!
                

    echo "ufmadm:100000:65536" >> /etc/subuid echo "ufmadm:100000:65536" >> /etc/subgid

  8. Enable Login Linger for UFM ser:

    Copy
    Copied!
                

    loginctl enable-linger ufmadm

  9. Configure Rootless Podam storage

    Copy
    Copied!
                

    sudo -u ufmadm mkdir -p /opt/ufm/.config/containers cat <<EOF | sudo -u ufmadm tee /opt/ufm/.config/containers/storage.conf > /dev/null [storage] driver = "overlay" runroot = "/run/user/733" EOF

  10. Create Podman UFM socket:

    Copy
    Copied!
                

    cat <<EOF > /usr/lib/systemd/system/podman-ufm.socket [Unit] Description=Podman API Socket For Nvidia UFM   [Socket] SocketUser=ufmadm SocketGroup=ufmadm ListenStream=%t/podman-ufm/podman-ufm.sock SocketMode=0660   [Install] WantedBy=sockets.target EOF

  11. Create Podman UFM service

    Copy
    Copied!
                

    cat <<EOF > /usr/lib/systemd/system/podman-ufm.service [Unit] Description=Podman API Service for Nvidia UFM Requires=podman-ufm.socket After=podman-ufm.socket StartLimitIntervalSec=0   [Service] Delegate=true Type=exec User=ufmadm Group=ufmadm KillMode=process Environment=LOGGING="--log-level=info" ExecStart=/usr/bin/podman \$LOGGING system service LimitMEMLOCK=infinity   [Install] WantedBy=default.target EOF

  12. Create Podman cleanup service:

    Copy
    Copied!
                

    cat <<EOF > /usr/lib/systemd/system/podman-ufm-cleanup.service [Unit] Description=podman-ufm-cleanup - clean stuck rootless containers at boot After=podman-ufm.service Before=ufm-enterprise.service   [Service] Type=oneshot User=ufmadm Group=ufmadm ExecStart=/usr/bin/podman system migrate   [Install] WantedBy=multi-user.target EOF

  13. Enable and start Podman services:

    Copy
    Copied!
                

    systemctl daemon-reload systemctl enable --now podman-ufm.socket systemctl enable --now podman-ufm.service systemctl enable --now podman-ufm-cleanup.service

  14. Create Udev Rules for InfiniBand Devices

    Copy
    Copied!
                

    cat <<EOF > /etc/udev/rules.d/70-umad.rules KERNEL=="umad*", SUBSYSTEM=="infiniband_mad", MODE="0600", OWNER="ufmadm", GROUP="ufmadm" KERNEL=="issm*", SUBSYSTEM=="infiniband_mad", MODE="0600", OWNER="ufmadm", GROUP="ufmadm" EOF   udevadm control --reload-rules udevadm trigger

  15. Clean and create UFM directories

    Copy
    Copied!
                

    rm -rf /opt/ufm/systemd sudo -u ufmadm mkdir -p /opt/ufm/ufm_plugins_data sudo -u ufmadm mkdir -p /opt/ufm/systemd sudo -u ufmadm mkdir -p /opt/ufm/etc/apache2

  16. Load UFM image and extract version:

    Copy
    Copied!
                

    # Extract UFM version from filename (e.g., ufm_6.22.0-7.ubuntu24.x86_64-docker.img.gz -> 6_22_0_7) UFM_VERSION=$(basename "$UFM_IMAGE_FILE" | sed 's/ufm_\([0-9][^.]*\.[^.]*\.[^.]*-[^.]*\)\.ubuntu.*/\1/' | tr '.-' '_') echo "UFM Version: $UFM_VERSION"   # Load the UFM image sudo -u ufmadm podman load -i "$UFM_IMAGE_FILE"

  17. Create version-specific directory and soft link:

    Copy
    Copied!
                

    # Create version-specific directory in shared storage sudo -u ufmadm mkdir -p /opt/ufm/shared_files/ufm-${UFM_VERSION}   # Remove existing files link if it exists rm -f /opt/ufm/files   # Create soft link to version-specific directory sudo -u ufmadm ln -s /opt/ufm/shared_files/ufm-${UFM_VERSION} /opt/ufm/files   # Verify the soft link ls -la /opt/ufm/files

  18. Run UFM installer:

    Copy
    Copied!
                

    sudo -u ufmadm podman run -it --rm --name=ufm_installer \ -v /run/podman-ufm/podman-ufm.sock:/var/run/docker.sock \ -v /opt/ufm/:/installation/ufm_files/ \ -v /opt/ufm/files:/installation/ufm_files/files \ -v /opt/ufm/systemd:/etc/systemd_files/ \ mellanox/ufm-enterprise:latest \ --install \ --fabric-interface ib0 \ --rootless \ --plugin-path /opt/ufm/ufm_plugins_data \ --ufm-user ufmadm \ --ufm-group ufmadm \ --ufm-infra

    Note
    **Note**: Replace `ib0` with your actual InfiniBand interface name, if it is not the default ib0.

    **Note**: - All other UFM install flags are supported and can be added to the command.

  19. Load Redis Image (if not using external Redis):

    Copy
    Copied!
                

    Load the given Redis image (in case you are not using external Redis) sudo - u ufmadm load -i "<PATH TO GIVEN REDIS IMAGE>"

  20. Load Fast API Plugin image:

    Copy
    Copied!
                

    sudo - u ufmadm run --hostname $HOSTNAME --rm --name=ufm_plugin_mgmt --entrypoint="" \ -v /run/podman-ufm/podman-ufm.sock:/var/run/docker.sock \ -v /opt/ufm/files:/opt/ufm/shared_config_files \ -v /dev/log:/dev/log \ -v /sys/fs/cgroup:/sys/fs/cgroup:ro \ -v /lib/modules:/lib/modules:ro \ -v /opt/ufm/ufm_plugins_data:/opt/ufm/ufm_plugins_data \ -e UFM_CONTEXT=ufm-infra \ mellanox/ufm-enterprise:latest \ /opt/ufm/scripts/manage_ufm_plugins.sh add -p fast_api -t ${FAST_API_VERSION} -c ufm-infra

  21. Install service files:

    Copy
    Copied!
                

    mv /opt/ufm/systemd/ufm-enterprise.service /etc/systemd/system/ufm-enterprise.service mv /opt/ufm/systemd/ufm-infra.service /etc/systemd/system/ufm-infra.service systemctl daemon-reload

To start UFM as a standalone instance, run:

Copy
Copied!
            

systemctl daemon-reload systemctl start ufm-infra systemctl start ufm-enterprise

Running in HA Mode

Note

Do not manually start any services.

  1. Ensure UFM and UFM-HA are installed on all nodes as described in the above sections.

  2. Mount /opt/ufm/files on all standby nodes as described point #2 (Mount the UFM directory)

  3. On one node, edit the HA configuration file:

    Copy
    Copied!
                

    /etc/ufm_ha/ha_nodes.cfg

    Fill each node parameters

    Copy
    Copied!
                

    [Node.1] # valid role options: master/standby role = master # Mandatory primary_ip = # Mandatory if dual_link = true  secondary_ip =   [Node.2] role = standby primary_ip = secondary_ip =   [Node.3] role = standby primary_ip = secondary_ip =

  4. Ensure the file sync mode is set to external-storage, and that the shared file system is mounted prior to HA configuration.

    Copy
    Copied!
                

    [FileSync] # valid options are: drbd/external-storage # in case of external-storage the user MUST mount the files system PRIOR to ha configuration mode = external-storage

  5. Copy the edited file to all nodes at the same path.

  6. Configure the cluster, starting from standby nodes and ending with the master node:

    Copy
    Copied!
                

    ufm_ha_cluster config -p <password>

    Note

    Use the same password on all nodes.

  7. After finishing the configuration on all nodes, run:

    Copy
    Copied!
                

    ufm_ha_cluster status

  8. Start the cluster:

    Copy
    Copied!
                

    ufm_ha_cluster start

  9. Check cluster status again to ensure all services have started successfully.

© Copyright 2025, NVIDIA. Last updated on Aug 7, 2025.