Troubleshooting Guide#

Common Issues and Solutions#

Connection Issues#

  • BMC Connection Failures

    Symptoms:
    • An “Unable to connect to BMC” error.

    • Connection timeouts.

    • The SSH connection was refused.

    Solutions: - Check configuration file and verify that the BMC IP address is correct. - Complete a ping test and ensure network connectivity. - Check the username and password. - Verify firewall rules by ensuring that SSH is allowed. - Confirm that the BMC is operational by checking the IPMI status.

    Example Verification:

    # Test BMC connectivity
    ping <BMC_IP>
    
    # Test SSH port
    nc -zv <BMC_IP> 22
    
    # Test IPMI
    ipmitool -I lanplus -H <BMC_IP> -U <USERNAME> -P <PASSWORD> chassis status
    
  • Host Connection Issues

    Symptoms: - The host log collection fails. - An SSH timeout to host occurred. - Permission denied errors.

    Solutions: - Verify host IP and credentials by checking the configuration file. - Check permissions in the SSH key and ensure that they are correct. - Complete a ping test to ensure that the host is running and is accessible. - Check the sudoers file and verify sudo access.

Log Collection Issues#

  • Insufficient Space

    Symptoms: - A “No space left on device” error. - A failed .zip creation. - An incomplete log collection.

    Solutions: - Check the disk usage and free up disk space. - Use --skipzip option to skip the .zip creation. - Check the configuration file and set a smaller --zipsplit-threshold. - Check the configuration file and specify alternate output location.

  • Timeout Issues

    Symptoms: - The collection process hangs. - A partial log collection. - Timeout errors.

    Solutions: - Increase the timeout values in the configuration file for applicable collectors. - Collect logs in smaller batches. - Check the network stability. - Verify the system load.

Platform-Specific Issues#

  • HGX-HMC issues

    Symptoms: - Port forwarding failures. - Access to the HMC is denied. - I2C bus errors.

    Solutions: - Enable port forwarding on the BMC. - Verify the HMC status. - Check I2C bus configuration

  • DGX system issues

    Symptoms: - GPU log collection failures. - NVSwitch errors. - Fabric manager issues.

    Solutions: - Verify that the NVIDIA drivers are installed and are current. - Check the GPU status by running nvidia-smi. - Validate fabric manager service by running systemctl status nvidia-fabricmanager.

Diagnostic Commands#

System Status Checks#

# Check BMC health
ipmitool -I lanplus -H <BMC_IP> -U <USER> -P <PASS> chassis status

# Verify Redfish service
curl -k https://<BMC_IP>/redfish/v1/Systems

# Check host system status
ssh <HOST_USER>@<HOST_IP> "systemctl status nvidia-fabricmanager"

Log Analysis#

# View collection summary
cat <output_dir>/collection_summary.txt

# Check error log
cat <output_dir>/error.log

# Analyze specific collector output
cat <output_dir>/redfish/BMC_Redfish_R1_system_event_log.txt

Debug Mode#

Enable verbose logging for detailed troubleshooting:

nvdebug -i <BMC_IP> -u <USER> -p <PASS> -t <PLATFORM> -v

This provides:

  • Detailed error messages (if applicable)

  • API call traces (if applicable)

  • Timing information (if applicable)

  • Collection progress (if applicable)

Support Information#

When reporting issues, include:

  • Full command output with the -v flag.

  • System information:

    • Platform type.

    • OS version.

    • Network configuration.

  • Error logs.

  • Collection summary.

  • Relevant BMC/Host logs.