Troubleshooting Guide#
Common Issues and Solutions#
Connection Issues#
BMC Connection Failures
- Symptoms:
An “Unable to connect to BMC” error.
Connection timeouts.
The SSH connection was refused.
Solutions: - Check configuration file and verify that the BMC IP address is correct. - Complete a ping test and ensure network connectivity. - Check the username and password. - Verify firewall rules by ensuring that SSH is allowed. - Confirm that the BMC is operational by checking the IPMI status.
Example Verification:
# Test BMC connectivity ping <BMC_IP> # Test SSH port nc -zv <BMC_IP> 22 # Test IPMI ipmitool -I lanplus -H <BMC_IP> -U <USERNAME> -P <PASSWORD> chassis status
Host Connection Issues
Symptoms: - The host log collection fails. - An SSH timeout to host occurred. - Permission denied errors.
Solutions: - Verify host IP and credentials by checking the configuration file. - Check permissions in the SSH key and ensure that they are correct. - Complete a ping test to ensure that the host is running and is accessible. - Check the sudoers file and verify sudo access.
Log Collection Issues#
Insufficient Space
Symptoms: - A “No space left on device” error. - A failed .zip creation. - An incomplete log collection.
Solutions: - Check the disk usage and free up disk space. - Use
--skipzip
option to skip the .zip creation. - Check the configuration file and set a smaller--zipsplit-threshold
. - Check the configuration file and specify alternate output location.Timeout Issues
Symptoms: - The collection process hangs. - A partial log collection. - Timeout errors.
Solutions: - Increase the timeout values in the configuration file for applicable collectors. - Collect logs in smaller batches. - Check the network stability. - Verify the system load.
Platform-Specific Issues#
HGX-HMC issues
Symptoms: - Port forwarding failures. - Access to the HMC is denied. - I2C bus errors.
Solutions: - Enable port forwarding on the BMC. - Verify the HMC status. - Check I2C bus configuration
DGX system issues
Symptoms: - GPU log collection failures. - NVSwitch errors. - Fabric manager issues.
Solutions: - Verify that the NVIDIA drivers are installed and are current. - Check the GPU status by running
nvidia-smi
. - Validate fabric manager service by runningsystemctl status nvidia-fabricmanager
.
Diagnostic Commands#
System Status Checks#
# Check BMC health
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis status
# Verify Redfish service
curl -k https://<BMC_IP>/redfish/v1/Systems
# Check host system status
ssh <HOST_USER>@<HOST_IP> "systemctl status nvidia-fabricmanager"
Log Analysis#
# View collection summary
cat <output_dir>/collection_summary.txt
# Check error log
cat <output_dir>/error.log
# Analyze specific collector output
cat <output_dir>/redfish/BMC_Redfish_R1_system_event_log.txt
Debug Mode#
Enable verbose logging for detailed troubleshooting:
nvdebug -i <BMC_IP> -u <USER> -p <PASS> -t <PLATFORM> -v
This provides:
Detailed error messages (if applicable)
API call traces (if applicable)
Timing information (if applicable)
Collection progress (if applicable)
Support Information#
When reporting issues, include:
Full command output with the
-v
flag.System information:
Platform type.
OS version.
Network configuration.
Error logs.
Collection summary.
Relevant BMC/Host logs.