High Availability

NVIDIA UFM Enterprise User Manual v6.10.0

UFM provides High Availability (HA) mechanisms to allow smooth fabric operation even if the UFM server fails or the connection between the UFM server and the rest of the fabric is not operating optimally.

UFM High Availability requires two distinct servers to run UFM software: one server is initially configured as the UFM active server and the other is configured as the UFM standby server. As a result, when the UFM active server fails or communication to the UFM active server ceases functioning, the UFM standby server takes over and becomes the new UFM active server. After such a failover, it is possible to repair the “old active UFM server” and bring it online as a new “UFM standby server.”

Warning

Throughout this document, the following terms are used interchangeably:

Master—Active

Standby—Slave

UFM recovery relies on three mechanisms:

  • UFM Database replication (from active to standby server)

  • UFM Keep Alive (heartbeat) mechanism

  • UFM server failover

For information about installing and running the UFM software for High Availability, see Installing UFM Server Software for High Availability.

HA-Related Events

When the UFM server fails over to the UFM standby server, a UFM Failover event is generated.

HA-Related Considerations

We recommend that you locate the active and standby UFM servers in different sections of the fabric, so a single failure of an edge switch or a line card will not disconnect both UFM servers from the fabric.

We recommend that you bring up the failed UFM server as quickly as possible, to enable the fabric to sustain a possible secondary failure of the new active UFM server.

Important

CAUTION: A secondary failover (from the "new" active server to the "newly" brought up standby server) will succeed only after the UFM database's initial replication as the "new standby server" has been completed. UFM can sustain a second failover only a few minutes after the new UFM standby server is up and running.
This time depends on the size of the replicated partition and link speed (between the active and standby servers).


The high availability capability is based on standard Linux packages – heartbeat and drbd.

Both heartbeat and drbd are installed on the master and slave nodes:

  • drbd synchronizes a replicated partition between the two servers (but the partition itself /dev/drbd0 is visible only on the master node).
    /opt/ufm/files is mounted on the drbd device and all data under this directory (partition) is replicated.

  • Heartbeat is responsible for starting UFM on master node and stopping it on the slave.
    Heartbeat sends "keep alive" messages between the two servers, and when the master fails, the slave assumes mastership.

    Warning

    For high availability, use a reliable and high-capacity out-of-band management network (1 Gb Ethernet is recommended). Using inband IPoIB will cause the HA split-brain condition if there is an InfiniBand network failure.

A virtual IP (VIP) address is an IP address that is not connected to a specific computer or Network Interface card (NIC) on a computer. Incoming packets are sent to the VIP address, but all packets travel through real Network Interfaces.

The VIP address belongs to the master node; failover of the system will result in failover of the virtual IP to the second node as well. When using UFM with HA, it is essential to always use the virtual IP instead of the server’s IP to assure UFM operation on the master server.

Always use the virtual IP instead of the server’s IP to assure connection to the UFM master server.

The failover will not happen if the standby server is not ready to take the mastership. UFM Health periodically checks the readiness of the standby server for the following:

  • management network connectivity

  • DRBD state

  • disk space availability

  • if the server is connected to the same InfiniBand cluster

  • if the management InfiniBand port is Active and the IPoIB interface is UP and RUNNING

If any of the condition above is not met, UFM Health will send a critical event. It is strongly recommended to repair the standby server as soon as possible to prevent risk of cluster malfunction.

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.