Troubleshooting Guide#
Common Issues and Solutions#
Connection Issues#
BMC Connection Failures
- Symptoms:
An “Unable to connect to BMC” error.
Connection timeouts.
The SSH connection was refused.
Solutions: - Check configuration file and verify that the BMC IP address is correct. - Complete a ping test and ensure network connectivity. - Check the username and password. - Verify firewall rules by ensuring that SSH is allowed. - Confirm that the BMC is operational by checking the IPMI status.
Example Verification:
# Test BMC connectivity ping <BMC_IP> # Test SSH port nc -zv <BMC_IP> 22 # Test IPMI ipmitool -I lanplus -H <BMC_IP> -U <USERNAME> -P <PASSWORD> chassis status
- Host Connection Issues - Symptoms: - The host log collection fails. - An SSH timeout to host occurred. - Permission denied errors. - Solutions: - Verify host IP and credentials by checking the configuration file. - Check permissions in the SSH key and ensure that they are correct. - Complete a ping test to ensure that the host is running and is accessible. - Check the sudoers file and verify sudo access. 
Log Collection Issues#
- Insufficient Space - Symptoms: - A “No space left on device” error. - A failed .zip creation. - An incomplete log collection. - Solutions: - Check the disk usage and free up disk space. - Use - --skipzipoption to skip the .zip creation. - Check the configuration file and set a smaller- --zipsplit-threshold. - Check the configuration file and specify alternate output location.
- Timeout Issues - Symptoms: - The collection process hangs. - A partial log collection. - Timeout errors. - Solutions: - Increase the timeout values in the configuration file for applicable collectors. - Collect logs in smaller batches. - Check the network stability. - Verify the system load. 
Platform-Specific Issues#
- HGX-HMC issues - Symptoms: - Port forwarding failures. - Access to the HMC is denied. - I2C bus errors. - Solutions: - Enable port forwarding on the BMC. - Verify the HMC status. - Check I2C bus configuration 
- DGX system issues - Symptoms: - GPU log collection failures. - NVSwitch errors. - Fabric manager issues. - Solutions: - Verify that the NVIDIA drivers are installed and are current. - Check the GPU status by running - nvidia-smi. - Validate fabric manager service by running- systemctl status nvidia-fabricmanager.
Diagnostic Commands#
System Status Checks#
# Check BMC health
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis status
# Verify Redfish service
curl -k https://<BMC_IP>/redfish/v1/Systems
# Check host system status
ssh <HOST_USER>@<HOST_IP> "systemctl status nvidia-fabricmanager"
Log Analysis#
# View collection summary
cat <output_dir>/collection_summary.txt
# Check error log
cat <output_dir>/error.log
# Analyze specific collector output
cat <output_dir>/redfish/BMC_Redfish_R1_system_event_log.txt
Debug Mode#
Enable verbose logging for detailed troubleshooting:
nvdebug -i <BMC_IP> -u <USER> -p <PASS> -t <PLATFORM> -v
This provides:
- Detailed error messages (if applicable) 
- API call traces (if applicable) 
- Timing information (if applicable) 
- Collection progress (if applicable) 
Support Information#
When reporting issues, include:
- Full command output with the - -vflag.
- System information: - Platform type. 
- OS version. 
- Network configuration. 
 
- Error logs. 
- Collection summary. 
- Relevant BMC/Host logs.