NVIDIA UFM High-Availability User Guide v5.3.2
NVIDIA UFM High-Availability User Guide v5.3.2

Overview

UFM HA provides High-Availability on the host level for UFM products (UFM Enterprise/UFM Appliance Gen 3.0 and UFM Cyber-AI). The solution is based on Pacemaker to monitor host resources, services, and applications; and DRBD to sync file-system states. The HA package can be used with both bare-metal and Dockerized UFM deployments.

UFM HA should be installed on the master and standby nodes. The below figure describes the UFM Enterprise HA Architecture.

SW-HA-Arch-version-1-modificationdate-1719405294350-api-v2.png

The below files are replicated between the master and standby nodes:

Copy
Copied!
            

/opt/ufm/files/*

Examples: log files, events, SQLite DB files (configuration, Telemetry history, persistent states topology groups).

The master and standby nodes communicate with each other to establish and monitor a High-Availability solution. This connectivity is used by both the Pacemaker and DRBD. Below are connectivity options:

  1. Cloud Connectivity. The following figure describes the external network infrastructure.

    ext_network_infrastructure-version-1-modificationdate-1719405293670-api-v2.png

  2. Back-to-back Connectivity, described in the following figure.

    back_to_back_connectivity-version-1-modificationdate-1719405292987-api-v2.png

UFM-HA employs a dual-link configuration comprising primary and secondary connections to enhance system stability while mitigating the risk of connectivity challenges. It leverages two prioritized IP addresses, primary and secondary, which the Pacemaker utilizes to establish two connectivity links. Notably, DRBD utilizes the primary IP address to synchronize data. It is recommended to utilize this IP address for interfaces with high transfer rates such as InfiniBand interfaces for optimal performance (IP over IB) and rapid DRBD synchronization. On the other hand, the secondary connectivity link may be effected via the management interface, typically an Ethernet interface.

DRBD and Pacemaker can use the same network interface or utilize different interfaces. For example, while the Pacemaker connectivity can be done through the management interface (usually an Ethernet interface), the DRBD synchronization could be done on an InfiniBand interface for better performance (IP over IB).

See below the configuration options for selecting a dif:

  1. No VIP Connectivity Option

    no_vip_connectivity-version-1-modificationdate-1719405294003-api-v2.png

For some constrained network environments, the no VIP Connectivity option is supported. In this architecture,every UFM node has two physical IP addresses, primary and secondary. There is no VIP (floating) IP representing the whole cluster. This option allows two cluster nodes to lay in different subnets. In such a setup, clients who communicate with the UFM cluster should be aware of the active node status or constantly try to access both nodes.

The cluster software monitors the following HA cluster resources:

  • UFM Enterprise

    A systemd service runs and monitors all UFM Enterprise processes.

  • UFM HA Watcher

    The ufm-ha-watcher service monitors the health status of the UFM Enterprise and performs a failover in case the ufm-health process decides to perform a failover.

  • Virtual IP

    Also known as Cluster IP, a virtual IP is a unique IP resource allocated on the master node. The virtual IP address should be reachable from any machine that uses it (REST API or UI). Virtual IP is not a mandatory configuration and can be omitted.

  • DRBD and File System

    DRBD needs its block device on each node. This can be a physical disk partition or a logical volume. The volume size planning should be done according to specific cluster sizing. The UFM-HA creates a DRBD resource and a filesystem resource with primary/secondary states based on the node if it is a master of a standby node.

Cluster Network Access must consume UFM REST APIs and UI or performs management or monitoring tasks (ssh, scp, syslog, etc.).

For access to the UFM cluster, the below five IP addresses should be configured:

  • Primary Physical IP1 – For the master node

  • Secondary Physical IP1 – For the master node

  • Primary Physical IP2 – For the standby node

  • Secondary Physical IP2 – For the standby node

  • Virtual (floating) IP (VIP)

Each two IP addresses of the same class should be configured in the same subnet and accessible (routable) by both cluster nodes. A virtual IP address should be in the subnet of one of the classes. The cluster manages the virtual IP address state. By default, the VIP is assigned to the master node. In case of failure of the master node, the VIP is moved by the cluster SW to the standby node. Network failures from the client to the UFM cluster are not monitored or handled by the HA cluster. Network infrastructure redundancy is out of the UFM HA solution scope. UFM HA cluster components utilize L3 and communication protocols (TCP/IP) for their internal communication and are agnostic to underlying L2 networking infrastructure.

UFM HA is supported on the following Linux distributions:

  1. Ubuntu 18.04, 20.04 and 22.04

  2. CentOS7.7-9

  3. CentOS8 Stream, RHEL8.5

  4. CentOS9 Stream, RHEL9.X (2023)

© Copyright 2024, NVIDIA. Last updated on Jul 8, 2024.