NVIDIA Docs Hub Homepage NVIDIA UFM Enterprise Appliance Software User Manual v1.8.1 Troubleshooting

Troubleshooting

The split-brain problem is a DRBD synchronization issue (HA status shows DUnknown in the DRBD disk state), which occurs when both HA nodes are rebooted. For example, in cases of electricity shut-down. To recover, please follow the below steps:

Step 1: Run the following command to clear the cluster failure.
Copy

Copied!
```
            
            pcs resource cleanup 
        
```
If the split-brain issue is not resolved, perform the below steps.
Step 2: Manually choose a node where data modifications will be discarded.
It is called the split-brain victim. Choose wisely; all modifications will be lost! When in doubt, run a backup of the victim’s data before you continue.
When running a Pacemaker cluster, you can enable maintenance mode. If the split-brain victim is in the Primary role, bring down all applications using this resource. Now switch the victim to the Secondary role:
Copy

Copied!
```
            
            victim# drbdadm secondary ha_data 
        
```

Step 3: Disconnect the resource if it’s in connection state WFConnection:

Copy
Copied!

            
            victim# drbdadm disconnect ha_data

Step 4: Force discard of all modifications on the split-brain victim:

Copy
Copied!

            
            victim# drbdadm connect --discard-my-data ha_data

Step 5: Resync starts automatically if the survivor is in a WFConnection network state. If the split-brain survivor is still in a Standalone connection state, reconnect it:
Copy

Copied!
```
            
            survivor# drbdadm connect ha_data  
        
```
Now the resynchronization from the survivor (SyncSource) to the victim (SyncTarget) starts immediately. There is no full sync initiated, but all modifications on the victim will be overwritten by the survivor’s data, and modifications on the survivor will be applied to the victim.

Split-Brain Recovery in HA Installation

Step 1: Run the following command to clear the cluster failure.
Copy

Copied!
```
            
            pcs resource cleanup 
        
```
If the split-brain issue is not resolved, perform the below steps.
Step 2: Manually choose a node where data modifications will be discarded.
It is called the split-brain victim. Choose wisely; all modifications will be lost! When in doubt, run a backup of the victim’s data before you continue.
When running a Pacemaker cluster, you can enable maintenance mode. If the split-brain victim is in the Primary role, bring down all applications using this resource. Now switch the victim to the Secondary role:
Copy

Copied!
```
            
            victim# drbdadm secondary ha_data 
        
```

Step 3: Disconnect the resource if it’s in connection state WFConnection:

Copy
Copied!

            
            victim# drbdadm disconnect ha_data

Step 4: Force discard of all modifications on the split-brain victim:

Copy
Copied!

            
            victim# drbdadm connect --discard-my-data ha_data

Step 5: Resync starts automatically if the survivor is in a WFConnection network state. If the split-brain survivor is still in a Standalone connection state, reconnect it:
Copy

Copied!
```
            
            survivor# drbdadm connect ha_data  
        
```
Now the resynchronization from the survivor (SyncSource) to the victim (SyncTarget) starts immediately. There is no full sync initiated, but all modifications on the victim will be overwritten by the survivor’s data, and modifications on the survivor will be applied to the victim.