NVIDIA Docs Hub NVIDIA Networking Networking Software Management Software NVIDIA UFM High-Availability User Guide v5.8.0 Monitoring and Troubleshooting

Monitoring and Troubleshooting

Check UFM Status	Run the below command on the master node: Copy Copied! `systemctl status ufm-enterprise.service`
Check HA Status	Run the below command: Copy Copied! `ufm_ha_cluster status pcs status`
Check DRBD Status	Run the below command: Copy Copied! `ufm_ha_cluster status`
Show DRBD Resource	Run the below command: Copy Copied! `drbdadm sh-resources`
Show DRBD Disk State	Run the below command: Copy Copied! `drbdadm dstate ha_data`
Show DRBD Role	Run the below command: Copy Copied! `drbdadm role ha_data`
Show DRBD Connectivity	Run the below command: Copy Copied! `drbdadm cstate ha_data`
Split-Brain Recovery	For automated HA solution, is it recommended to configure STONITH agents to kill (power-off) a peer node. Step 1: Manually choose a node which data modifications will be discarded. It is called the split-brain victim. Choose wisely; all modifications will be lost! When in doubt, run a backup of the victim’s data before you continue. When running a Pacemaker cluster, you can enable maintenance mode. Copy Copied! `ufm_ha_cluster enable-maintain` If the split-brain victim is in the Primary role, bring down all applications using this resource. Now, switch the victim to the Secondary role: Copy Copied! `victim# ufm_ha_cluster reset standby` Resync starts automatically if the survivor is in a WFConnection network state. If the split-brain survivor is still in a Standalone connection state, reconnect it: Copy Copied! `survivor# ufm_ha_cluster reset master` Now the resynchronization from the survivor (SyncSource) to the victim (SyncTarget) starts immediately. There is no full sync initiated, but all modifications on the victim will be overwritten by the survivor’s data, and modifications on the survivor will be applied to the victim.
Communication Timeout during HA Configuration	During the configuration phase of high availability, if you encounter errors regarding connectivity, such as 'Error: Unable to communicate with <master/standby IP>' or connection timeouts—even when server connectivity appears fine, consider checking the ypbind service, as it may be affecting communication. Stop the ypbind service on the master and standby and configure HA. After the configuration succeeds, enable the ypbind service again. Copy Copied! `systemctl stop ypbind # configure HA systemctl start ypbind`

Check UFM Status

Run the below command on the master node:

Copy
Copied!

            
            systemctl status ufm-enterprise.service

Check HA Status

Run the below command:

Copy
Copied!

            
            ufm_ha_cluster status 
pcs status

Check DRBD Status

Run the below command:

Copy
Copied!

            
            ufm_ha_cluster status

Show DRBD Resource

Run the below command:

Copy
Copied!

            
            drbdadm sh-resources

Show DRBD Disk State

Run the below command:

Copy
Copied!

            
            drbdadm dstate ha_data

Show DRBD Role

Run the below command:

Copy
Copied!

            
            drbdadm role ha_data

Show DRBD Connectivity

Run the below command:

Copy
Copied!

            
            drbdadm cstate ha_data

Split-Brain Recovery

For automated HA solution, is it recommended to configure STONITH agents to kill (power-off) a peer node.

Step 1:

Manually choose a node which data modifications will be discarded.

It is called the split-brain victim. Choose wisely; all modifications will be lost! When in doubt, run a backup of the victim’s data before you continue.

When running a Pacemaker cluster, you can enable maintenance mode.

Copy
Copied!

            
            ufm_ha_cluster enable-maintain

If the split-brain victim is in the Primary role, bring down all applications using this resource.

Now, switch the victim to the Secondary role:

Copy
Copied!

            
            victim# ufm_ha_cluster reset standby

Resync starts automatically if the survivor is in a WFConnection network state. If the split-brain survivor is still in a Standalone connection state, reconnect it:

Copy
Copied!

            
            survivor#  ufm_ha_cluster reset master

Now the resynchronization from the survivor (SyncSource) to the victim (SyncTarget) starts immediately. There is no full sync initiated, but all modifications on the victim will be overwritten by the survivor’s data, and modifications on the survivor will be applied to the victim.

Communication

Timeout during HA

Configuration

During the configuration phase of high availability, if you encounter errors regarding connectivity, such as 'Error: Unable to communicate with <master/standby IP>' or connection timeouts—even when server connectivity appears fine, consider checking the ypbind service, as it may be affecting communication.

Stop the ypbind service on the master and standby and configure HA. After the configuration succeeds, enable the ypbind service again.

Copy
Copied!

            
            systemctl stop ypbind
# configure HA
systemctl start ypbind