NVIDIA Docs Hub NVIDIA Networking Networking Software Management Software InfiniBand Cluster Bring-up Procedure v1.0 UFM Enterprise Installation

UFM Enterprise Installation

NVIDIA Unified Fabric Manager (UFM) is a powerful platform for managing InfiniBand scale-out computing environments.
UFM enables data center operators to efficiently monitor and operate the entire fabric, boost application performance and maximize fabric resource utilization.

Note

If you do not have a valid license, please fill out the NVIDIA Enterprise Account Registration form to get a UFM evaluation license.

Save the license file on the master server at /tmp/license_file/

Note

Before installing UFM server software in High Availability mode, ensure that the requirements at this link are met.

Note

UFM HA package requires a dedicated partition with the same size and name for DRBD on both servers.

Note

After installing the UFM server software, make sure to configure the fabric_interface parameter in gv.cfg.

The fabric interface should be set to one of the InfiniBand IPoIB interfaces, which connect the UFM to the fabric.

Installing UFM on Docker Container - High Availability Mode

Pre-deployments requirements

Install pacemaker, pcs, and drbd-utils on both servers

For Ubuntu:

Copy
Copied!

            
            apt install pcs pacemaker drbd-utils

For CentOS/Red Hat:

Copy
Copied!

            
            yum install pcs pacemaker drbd84-utils kmod-drbd84

Copy
Copied!

            
            yum install pcs pacemaker drbd90-utils kmod-drbd90

A partition for DRBD on each server (with the same name on both servers) such as /dev/sdd1. Recommended partition size is 10-20 GB, otherwise DRBD sync will take a long time to complete.
CLI command hostname -i must return the IP address of the management interface used for pacemaker sync correctly (update /etc/hosts/ file with machine IP)
Create the directory on each server under /opt/ufm/files/ with read/write permissions on each server. This directory will be used by UFM to mount UFM files, and it will be synced by DRBD.

Installing UFM Containers

On the main server, install UFM Enterprise container with the command below:

Copy
Copied!

            
            docker run -it --name=ufm_installer --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /etc/systemd/system/:/etc/systemd_files/ \
-v /opt/ufm/files/:/installation/ufm_files/ \
-v /tmp/license_file/:/installation/ufm_licenses/ \
mellanox/ufm-enterprise:latest \
--install

On the standby (secondary) server, install the UFM Enterprise container like the following example with the command below:

Copy
Copied!

            
            docker run -it --name=ufm_installer --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /etc/systemd/system/:/etc/systemd_files/ \
-v /opt/ufm/files/:/installation/ufm_files/ \
mellanox/ufm-enterprise:latest \
--install

Downloading UFM HA Package

Download the UFM-HA package on both servers using the following command:

Copy
Copied!

            
            wget https://www.mellanox.com/downloads/UFM/ufm_ha_5.5.0-9.tgz

For Sha256:

Copy
Copied!

            
            wget https://download.nvidia.com/ufm/ufm_ha/ufm_ha_5.5.0-9.sha256

Installing UFM HA Package

For more information on the UFM-HA package and all installation and configuration options, please refer to UFM High Availability User Guide.

[On Both Servers] Extract the downloaded UFM-HA package under /tmp/
[On Both Servers] Go to the extracted directory /tmp/ufm_ha_XXX and run the installation script. For example, if your DRBD partition is /dev/sda5 run the following command:
Copy

Copied!
```
            
            ./install.sh -l /opt/ufm/files/ -d /dev/sda5 -p enterprise
        
```

Configuring UFM HA

There are the three methods to configure the HA cluster:

Configure HA with SSH Trust (Dual Link Configuration) - Requires passwordless SSH connection between the servers.
Configure HA without SSH Trust (Dual Link Configuration) - Does not require passwordless SSH connection between the servers, but asks you to run configuration commands on both servers.
Configure HA without SSH Trust (Single Link Configuration) - Can be used in cases where only one link is available among the two UFM HA nodes/servers.

Configure HA with SSH Trust (Dual Link Configuration)

On the master server only, configure the HA nodes. To do so, from /tmp, run the configure_ha_nodes.sh command as shown in the below example
Copy

Copied!
```
            
            configure_ha_nodes.sh \ 
--cluster-password 12345678 \ 
--master-primary-ip 10.10.50.1 \ 
--standby-primary-ip 10.10.50.2 \ 
--master-secondary-ip 192.168.10.1 \ 
--standby-secondary-ip 192.168.10.2 \ 
--no-vip 
        
```
Note

The script configure_ha_nodes.sh is is located under /usr/local/bin/, therefore, by default, you do not need to use the full path to run it.

Note

The --cluster-password must be at least 8 characters long.

Note

When using back-to-back ports with local IP addresses for HA sync interfaces, ensure that you add your IP addresses and hostnames to the /etc/hosts file. This is needed to allow the HA configuration to resolve hostnames correctly based on the IP addresses you are using.

Note

configure_ha_nodes.sh requires SSH connection to the standby server. If SSH trust is not configured, then you are prompted to enter the SSH password of the standby server during configuration runtime
Depending on the size of your partition, wait for the configuration process to complete and DRBD sync to finish. To check the DRBD sync status, run:
Copy

Copied!
```
            
            ufm_ha_cluster status 
        
```

Configure HA without SSH Trust (Dual Link Configuration)

If you cannot establish an SSH trust between your HA servers, you can use ufm_ha_cluster directly to configure HA. You can see all the options for configuring HA in the Help menu:

Copy
Copied!

            
            ufm_ha_cluster config -h

To configure HA, follow the below instructions:

Note

Please change the variables in the commands below based on your setup.

[On Standby Server] Run the following command to configure Standby Server:

Copy
Copied!

            
            ufm_ha_cluster config -r standby -e <peer ip address> -l <local ip address> -p <cluster_password>

[On Master Server] Run the following command to configure Master Server:

Copy
Copied!

            
            ufm_ha_cluster config -r master -e <peer ip address> -l <local ip address> -p <cluster_password> -i <virtual ip address>

Configure HA without SSH Trust (Single Link Configuration)

Warning

This is not the recommended configuration and, in case of network failure, it might cause HA cluster split brain.

If you cannot establish an SSH trust between your HA servers, you can use ufm_ha_cluster directly to configure HA. To configure HA, follow the below instructions:

Note

Please change the variables in the commands below based on your setup.

[On Standby Server] Run the following command to configure Standby Server:

Copy
Copied!

            
            ufm_ha_cluster config \
-r standby \
-e 10.212.145.5 \
-l 10.212.145.6 \
--enable-single-link

[On Master Server] Run the following command to configure Master Server:

Copy
Copied!

            
            ufm_ha_cluster config -r master \
-e 10.212.145.6 \
-l 10.212.145.5 \
-i 10.212.145.50 \
--enable-single-link

You must wait until after configuration for DRBD sync to finish, depending on the size of your partition. To check the DRBD sync status, run:

Copy
Copied!

            
            ufm_ha_cluster status

IPv6 Example:

Copy
Copied!

            
            ufm_ha_cluster config -r standby -l fcfc:fcfc:209:224:20c:29ff:fee7:d5f2 -e fcfc:fcfc:209:224:20c:29ff:fecb:4962 --enable-single-link -p some_secret

Starting HA Cluster

To start UFM HA cluster:

Copy
Copied!

            
             ufm_ha_cluster start

To check UFM HA cluster status:

Copy
Copied!

            
            ufm_ha_cluster status

To stop UFM HA cluster:

Copy
Copied!

            
            ufm_ha_cluster stop

To uninstall UFM HA, first stop the cluster and then run the uninstallation command as follows:
Copy

Copied!
```
            
            /opt/ufm/ufm_ha/uninstall_ha.sh 
        
```

On This Page