NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software NVIDIA UFM Cyber-AI Documentation v2.4.0 High Availability

High Availability

Overview

UFM HA supports High-Availability on the host level for UFM products (UFM Enterprise/UFM Appliance/UFM CyberAI) The solution is based on pacemaker to monitor services and DRBD to sync file-system states. The HA package can be used with both bare-metal and Dockerized UFM products.

UFM HA should be installed on two machines, master and standby.

Supported Platforms

Ubuntu
Centos Master

Prerequisites

Pacemaker packages

pacemaker
pcs
corosync

DRBD Package

DRBD utils 8.4 or up.

Configuration

ufm_ha_cluster usage

Copy
Copied!

            
            ufm_ha_cluster --help
Usage: ufm_ha_cluster [-h|--help] <command> [<options>]
This script manages ufm HA cluster.
 
OPTIONS:
   -h|--help                        Show this message
 
COMMANDS:
    config          Configure HA cluster
    set-password    Change hacluster password
    status          Check HA cluster status
    failover        Master node failover
    takeover        Standby node takeover
    start           Start HA services
    stop            Stop HA services 
    attach          attach new standby node from cluster
    detach          detach the old standby to cluster
 
For more help about each command, type:
  ufm_ha_cluster <command> --help

Setting HA Cluster Password

HA cluster user is a user used for pacemaker synchronization. the password for the user should be the same on both machines. To set the password, run the following command on both machines (order does not matter).

Copy
Copied!

            
            ufm_ha_cluster set-password -p <new-password>

Configuring Pacemaker and DRBD

Copy
Copied!

            
            ufm_ha_cluster config --help
Usage: ufm_ha_cluster config [<options>]
 
The config command configures ha add-on for ufm server.
 
OPTIONS:
    -r | --role <node role>             Node role (master or standby)
                                        mandatory.
    -n | --peer-node <node-hostname>    Peer node name.
                                        mandatory.
    -s | --peer-sync-ip <ip address>    Peer node sync ip adreess
                                        mandatory.
    -c | --sync-interface               Local interface to be used for drbd sync
                                        mandatory.
    -i | --virtual-ip <virtual-ip>      Cluster virtual IP.
                                        mandatory.
    -f | --ha-config-file <file path>   HA configuration file.
                                        default: ufm-ha.conf
    -p | --hacluster-pwd <pwd>          hacluster user password
                                        default: default password
    -h | --help                         Show this message

Warning

You must run configuration script on the standby machine, then on the master machine.
Running config command will not start UFM services, you have to run it directly from the master machine.
Initial file system sync between master and standby may take few minutes, depending on your sync interface speed.
You must wait for the sync process before starting the services. You may use the status command for monitoring the sync.

If you are using high-availability for both UFM Cyber-AI and UFM Enterprise you have to change the following line in ufm-ha.conf file:

Copy
Copied!

            
                   systemd_services=ufm-cyberai
	systemd_services=ufm-cyberai ufm-ha-watcher ufm-enterprise

Stopping UFM Services

You may stop UFM services using the following stop command.

Copy
Copied!

            
            ufm_ha_cluster stop

Takeover Services

Takeover command can be executed on the standby machine so it will be the master.

Copy
Copied!

            
            ufm_ha_cluster takeover

Master Failover

Failover command can be executed on the master machine so it will be the standby.

Copy
Copied!

            
            ufm_ha_cluster failover

Replace HA Node

To replace old standby, detach the old standby, then configure the new standby, and attach it to the cluster.

On the master, run the detach command:

Copy
Copied!

            
            ufm_ha_cluster detach

On the new standby, run the config command, for more information, refer to ufm-cai-jobs.

On the master node, run the attach command:

Copy
Copied!

            
            Ufm_ha_cluster –n <peer_node> -s <peer_sync_ip> -p <hacluster-pwd> -c  <sync-interface>

On This Page