NVIDIA UFM High-Availability User Guide v6.0.0

Standby Node Replacement

UFM Standby Node Replacement automates the process of replacing a failed standby node in a UFM High Availability (HA) cluster, minimizing downtime and reducing the need for manual operations.

This capability streamlines the preparation, configuration, and reintegration of a replacement node into the existing UFM-HA cluster.

The ufm_replace_standby CLI tool manages the entire workflow, hiding the underlying complexity and providing a simple, user‑friendly interface to prepare, execute, and validate the standby node replacement.

In the event of a standby node failure, this feature enables smooth integration of a replacement standby node into the UFM cluster, ensuring minimal service interruption and minimal manual effort.

Before starting the replacement procedure, ensure the following conditions are met:

  • The failed standby node has been completely removed from the cluster and disconnected from the network.

  • The replacement standby node is physically connected to the cluster network using the correct interfaces — both InfiniBand (IB) and Management links — in a Back‑to‑Back topology.

  • The replacement standby node is running the same UFM appliance version as the master node.

  • UFM‑HA version 6.0.0‑6 (with Seamless RMA support) is installed on both the master and standby nodes.

The ufm_replace_standby tool automates the replacement of a UFM‑HA standby node. It manages the complete workflow — from preparing and validating the new standby node to fully integrating it into the cluster.

To run the full replacement procedure, use:

Copy
Copied!
            

ufm_replace_standby run <Options> 

Command Options:

Option

Description

--standby-primary-ip=<ip>

Primary IP address of the new standby node.

--standby-secondary-ip=<ip>

Secondary IP address of the new standby node.

--hacluster-password=<pwd>

Optional — Password for the hacluster user. If the --reset_hacluster-password option is also specified, this value will be set as the new hacluster password. If not provided, the user will be prompted to enter the password interactively during the replacement process.

--reset_hacluster-password

Optional — Resets the existing hacluster password. The default value is false.

--preserve-ssh-trust

Optional — Preserves the established SSH trust after the replacement process completes (whether successful or failed). If not set, SSH trust will be removed at the end of a successful run — unless it existed prior to execution.

--verbose

Optional — Enables extended debug output in the console.

The ufm_replace_standby run command executes the complete standby node replacement sequence, which includes:

  1. Establish SSH Trust

    • Checks if SSH trust is already in place.

    • If not, generates SSH keys and prompts the user to securely copy them to enable passwordless access.

  2. Validate the New Standby Node

    • Confirms readiness for integration by:

      • Verifying required network interfaces.

      • Ensuring UFM‑HA and HA stack versions match the master node.

      • Checking that ufm_ha is installed and supports standby replacement.

    • All validation steps are performed remotely over SSH.

  3. Transfer Files

    • Copies ha_nodes.cfg from the master node to /etc/ufm_ha/ha_nodes.cfg on the standby node to maintain consistent HA configuration.

  4. Detach Old Standby

    • Removes the old standby node from HA configuration (PCS and DRBD), even if it is already disconnected.

    • Ensures the cluster is ready to accept the new standby.

    • Performed on the master node.

  5. Configure the New Standby Node

    • Prepares the standby node by:

      • Initializing the DRBD disk (if applicable).

      • Setting the hacluster user password.

      • Performing required pre‑join setup tasks.

    • Executed remotely on the standby node via SSH.

  6. Attach the New Standby

    • Adds the standby node to the UFM‑HA cluster by:

      • Authenticating with the hacluster user.

      • Updating PCS configuration.

      • Adjusting DRBD settings to include the new standby.

    • Actions performed from the master node.

  7. Clean‑up

    • Removes temporary SSH trust files (e.g., encrypted password, secret file).

    • If --preserve-ssh-trust is not set, removes SSH trust created during the process.

  8. Cluster Validation

    • Runs automated checks to confirm:

      • The new standby is successfully integrated into the cluster.

      • HA daemons and UFM services are running.

      • DRBD connectivity and disk state (if applicable) are healthy.

If the replacement process fails at any stage, ufm_replace_standby can resume from the failure point using its built‑in progress tracking.

After fixing the cause of the failure, resume without starting over:

Copy
Copied!
            

ufm_replace_standby run --resume [--hacluster-password=<pwd>]

Short form:

Copy
Copied!
            

ufm_replace_standby --resume [--hacluster-password=<pwd>]

Note

Note: If the process fails, the hacluster password is not retained, even if it was provided in the original run. You must either re‑enter it using--hacluster-password or provide it interactively.

To abort an incomplete process and start fresh:

Copy
Copied!
            

ufm_replace_standby run --abort

Short form:

Copy
Copied!
            

ufm_replace_standby --abort

Once ufm_replace_standby completes all replacement steps, it automatically runs a series of validation checks to confirm successful integration of the new standby node.

The validation process includes:

  • Confirming that the new standby has joined the UFM‑HA cluster.

  • Verifying that HA stack daemons and UFM services are running.

  • Checking disk state and DRBD connectivity (if applicable).

Each validation step is subject to a predefined timeout. For example, the tool will wait up to 10 seconds for DRBD to reach a connected and active state. If a timeout is exceeded, the validation process stops.

Note

Note: A validation timeout does not necessarily indicate a cluster failure. In such cases, manually verify the cluster status using ufm_ha_cluster and other HA monitoring tools to ensure the cluster is functioning properly.

© Copyright 2025, NVIDIA. Last updated on Aug 7, 2025.