Configuration Guide#

NVDebug Runtime Configuration File (config.yaml)#

The NVDebug runtime configuration file, a YAML file, is used to specify the runtime configuration for the tool and contains the configuration for the tool.

PLATFORM: "arm64"
TargetBaseboard: "GB200 NVL"
LogSanitization: true
SKIP_BMC_SSH_LOGS: true

DUT Configuration File (dut_config.yaml)#

The DUT configuration file uses YAML format to specify target systems and their properties.

Basic Configuration for a single DUT#

DUT_Defaults: &dut_defaults
  NodeType: "Compute"
  BMC_IP: "192.168.1.100"
  BMC_USERNAME: "bmc_user"
  BMC_PASSWORD: "bmc_password"
  ipmi_cipher: "-C17"
  HOST_IP: "192.168.2.100"
  HOST_USERNAME: "host_user"
  HOST_PASSWORD: "host_password"
  RF_DEFAULT_PREFIX: "/redfish/v1"
  RF_AUTH: true
  SETUP_PORT_FORWARDING: True
  FORCE_PORT_FW: False
  IP_NETWORK: 'ipv4'

dut-1:
  <<: *dut_defaults

Rack Configuration for multiple DUTs#

DUT_Defaults: &dut_defaults ## User should not modify this line or add anything before this section
  NodeType: "Compute"
  BMC_IP: ""
  BMC_USERNAME: ""
  BMC_PASSWORD: ""
  BMC_SSH_USERNAME: ""
  BMC_SSH_PASSWORD: ""
  RF_User: ""
  RF_Pass: ""
  TUNNEL_TCP_PORT: ""
  ipmi_cipher: "-C17"
  HOST_IP: ""
  HOST_USERNAME: ""
  HOST_PASSWORD: ""
  RF_DEFAULT_PREFIX: "/redfish/v1"
  RF_AUTH: true
  SETUP_PORT_FORWARDING: True
  FORCE_PORT_FW: False
  IP_NETWORK: 'ipv4'

nvl-compute-1: &compute_defaults
  <<: *dut_defaults
  NodeType: "Compute"
  ConfigFileToUse: "config_compute.yaml"
  HOST_USERNAME: "host_user"
  HOST_PASSWORD: "host_password"
  BMC_USERNAME: "bmc_user"
  BMC_PASSWORD: "bmc_password"
  BMC_SSH_USERNAME: "bmc_ssh_user"
  BMC_SSH_PASSWORD: "bmc_ssh_password"
  <<: *compute_defaults
  HOST_IP: 192.168.1.6
  BMC_IP: 192.168.1.134
nvl-compute-2:
  <<: *compute_defaults
  HOST_IP: 192.168.1.7
  BMC_IP: 192.168.1.135

...

nvl-compute-9:
  <<: *compute_defaults
  HOST_IP: 192.168.1.14
  BMC_IP: 192.168.1.142

nvl-switch-1: &switch_defaults
  <<: *dut_defaults
  NodeType: "SwitchTray"
  ConfigFileToUse: "config_switch.yaml"
  HOST_USERNAME: "host_user"
  HOST_PASSWORD: "host_password"
  BMC_USERNAME: "bmc_user"
  BMC_PASSWORD: "bmc_password"
  BMC_SSH_USERNAME: "bmc_ssh_user"
  BMC_SSH_PASSWORD: "bmc_ssh_password"
  HOST_IP: 192.168.1.101
  BMC_IP: 192.168.1.229
nvl-switch-2:
  <<: *switch_defaults
  HOST_IP: 192.168.1.102
  BMC_IP: 192.168.1.230

...

nvl-switch-9:
  <<: *switch_defaults
  HOST_IP: 192.168.1.109
  BMC_IP: 192.168.1.237

Support Log Collectors#

List Available Collectors for a Platform

arm64#

$ ./nvdebug -l -t arm64
Redfish
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  R1   system_event_log                      Redfish_R1_system_event_log_{system_id}.json
  R2   manager_existing_log_dump             Redfish_R2_existing_dump_{id}.tar.xz
  R3   hgx_manager_on_demand_log_dump        Redfish_R3_hgx_manager_dump_{manager_id}_{task_id}.tar.xz
        -> R3 Supported Boards: Blackwell-HGX-8-GPU, HGX B300, Hopper-HGX-8-GPU, GB200 NVL, GB300 NVL, GH200 NVL

  R4   manager_journal_log                   Redfish_R4_journal_log_entries_{manager_id}.json
  R5   manager_fpga_register_dump            Redfish_R5_fpga_dump_{system_id}_{task_id}.tar.xz
  R6   manager_erot_dump                     Redfish_R6_erot_dump_{system_id}_{task_id}.tar.xz
  R8   firmware_inventory                    Redfish_R8_firmware_inventory.json
  R9   firmware_inventory_expand_query       Redfish_R9_firmware_inventory_expand_query.json
  R10   chassis_info                          Redfish_R10_chassis_info.json
  R11   chassis_expand_query                  Redfish_R11_chassis_expand_query.json
  R12   system_info                           Redfish_R12_system_info.json
  R13   system_expand_query                   Redfish_R13_system_expand_query.json
  R14   manager_info                          Redfish_R14_manager_info.json
  R15   manager_expand_query                  Redfish_R15_manager_expand_query.json
  R16   hgx_manager_retimer_dump              Redfish_R16_hgx_retimer_dump_{system_id}_{task_id}.tar.xz
        -> R16 Supported Boards: Blackwell-HGX-8-GPU, HGX B300, Hopper-HGX-8-GPU

  R18   telemetry_metric_reports              Redfish_R18_report_{metric_report}.json
  R19   chassis_thermal_metrics               Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20   firmware_inventory_table              Redfish_R20_firmware_inventory_table.txt
  R21   system_cper_logs                      Redfish_R21_cper_logs_{system}_{cper_id}.tar.xz
  R22   task_details                          Redfish_R22_task_{task_id}.json
  R23   nvlink_oob_logs                       Redfish_R23_NVLINK_OOB_Log_{id}.json
  R24   hgx_system_fw_attributes_dump         Redfish_R24_hgx_system_{system}_fw_attributes_{task_id}.tar.xz
  R25   additional_oob_logs                   Redfish_R25_OOB_Log_{id}.json
  R26   chassis_certificates                  Redfish_R26_chassis_{chassis_id}_certificate.json
  R27   spdm_erot_measurements                Redfish_R27_spdm_{erot_id}_index_{index}.json
        -> R27 Supported Boards: Blackwell-HGX-8-GPU, HGX B300, Hopper-HGX-8-GPU, GB200 NVL, GB300 NVL, GH200 NVL

  R28   hgx_system_hardware_checkout_dump     Redfish_R28_hgx_system_{system}_hardware_checkout_{task_id}.tar.xz
  R29   background_copy_status                Redfish_R29_{chassis_id}_copy_status.json
  R30   software_inventory                    Redfish_R30_software_inventory.json
  R31   hmc_fdr_log_dump                      Redfish_R31_{system_id}_hmc_fdr_log_dump
        -> R31 Supported Boards: Hopper-HGX-8-GPU

  R32   system_post_codes                     Redfish_R32_system_post_codes
  R35   network_device_debug_dump             Redfish_R35_network_device_debug_dump
        -> R35 Supported Boards: Blackwell-HGX-8-GPU, HGX B300

  R36   network_switch_debug_dump             Redfish_R36_network_switch_debug_dump
        -> R36 Supported Boards: Blackwell-HGX-8-GPU, HGX B300

  R37   gpu_debug_dump                        Redfish_R37_gpu_debug_dump
        -> R37 Supported Boards: Blackwell-HGX-8-GPU, HGX B300, GB200 NVL, GB300 NVL

  R38   gpu_diagnostic_dump                   Redfish_R38_gpu_diagnostic_dump
        -> R38 Supported Boards: Blackwell-HGX-8-GPU, HGX B300, GB200 NVL, GB300 NVL

  R39   sma_debug_dump                        Redfish_R39_sma_debug_dump
        -> R38 Supported Boards: HGX B300, GB300 NVL

  R40   custom_dump_service                   Redfish_R40_custom_dump_service

  R41   chassis_thermal_subsystem_leak_detection Redfish_R41_chassis_thermal_subsystem_leak_detection


IPMI
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  I1   mc_info                               IPMI_I1_mc_info.txt
  I2   lan_info                              IPMI_I2_lan_info.txt
  I3   session_info                          IPMI_I3_session_info.txt
  I4   fru_info                              IPMI_I4_fru_info.txt
  I5   sdr_info                              IPMI_I5_sdr_info.txt
  I6   sel_info                              IPMI_I6_sel_info.txt
  I7   sensor_list                           IPMI_I7_sensor_list.txt
  I8   sel_list                              IPMI_I8_sel_list.txt
  I9   sel_raw_dump                          IPMI_I9_sel_raw_dump.txt
  I10   chassis_status                        IPMI_I10_chassis_status.txt
  I11   chassis_restart_cause                 IPMI_I11_chassis_restart_cause.txt
  I12   user_list                             IPMI_I12_user_list.txt
  I13   channel_info                          IPMI_I13_channel_info.txt
  I14   sdr_elist                             IPMI_I14_sdr_elist.txt

SSH
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  S1   bmc_status                            BMC_SSH_S1_bmc_status.txt
  S2   bmc_dmesg                             BMC_SSH_S2_bmc_dmesg.txt
  S3   network_info                          BMC_SSH_S3_network_info/...
  S4   openbmc_stack_info                    BMC_SSH_S4_openbmc_stack_info.txt
  S5   bmc_list_kernel_modules               BMC_SSH_S5_bmc_list_kernel_modules.txt
  S6   openbmc_pldm_journal_log              BMC_SSH_S6_openbmc_pldm_journal_log.txt
  S7   i2c_device_list                       BMC_SSH_S7_i2c_device_bus_scan{number}.txt
  S8   bmc_mem_cpu_utilization               BMC_SSH_S8_bmc_mem_cpu_utilization/...
  S9   openbmc_boot_status                   BMC_SSH_S9_openbmc_boot_status.txt
  S10   var_log_dir_zip                       BMC_SSH_S10_var_log_dir_zip{timestamp}.tar.gz
  S11   uptime                                BMC_SSH_S11_uptime.txt
  S14   hmc_boot_progress                     BMC_SSH_S14_hmc_boot_progress.txt
        -> S14 Supported Boards: Blackwell-HGX-8-GPU, HGX B300, Hopper-HGX-8-GPU, GB200 NVL, GB300 NVL, GH200 NVL

  S15   bmc_power_status                      BMC_SSH_S15_bmc_power_status/...
  S16   virtual_eeprom_data                   BMC_SSH_S16_virtual_eeprom_data/...
  S17   smbus_power_temperature_telemetry     BMC_SSH_S17_smbus_power_temperature_telemetry/...
  S18   smbpbi_system_status                  BMC_SSH_S18_smbpbi_system_status/...

Host
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  H1   node_dmesg                            Host_H1_node_dmesg.tar.gz
  H2   node_lspci                            Host_H2_node_lspci*.txt
  H3   node_smbios                           Host_H3_dmidecode*.txt
  H4   node_lshw                             Host_H4_lshw*.txt
  H5   node_nvidia_smi                       Host_H5_nvidia-smi*.txt
  H6   node_kern_log                         Host_H6_node_kern_log.tar.gz
  H7   node_crash_dump                       Host_H7_node_crash_dump.tar.gz
  H8   node_nvme_list                        Host_H8_nvme_list_-v.txt
  H9   node_fabric_manager_log               Host_H9_fabricmanager.log
  H10   node_nvflash_log                      Host_H10_nvflash_--check_-i_{num}.txt
  H11   nvidia_bug_report                     Host_H11_nvidia_bug_report_op.log.gz
  H15   node_subnet_manager                   Host_H15_node_subnet_manager/
  H16   one_diag_dump                         Host_H16_one_diag_dump/
  H17   node_nvme_log_dump                    Host_H17_node_nvme_log_dump/
  H18   node_os_info                          Host_H18_cat_etc_os-release.txt
  H19   node_memory_info                      Host_H19_memory_info/
  H20   sos_report                            Host_H20_sos_report/
  H21   nvos_cli_dumps                        Host_21_nvos_cli_dumps/

HealthCheck
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  C1   out_of_band_health_check              HealthCheck_C1_out_of_band_health_check.json

x86_64#

$ ./nvdebug -l -t x86_64
Redfish
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  R1   system_event_log                      Redfish_R1_system_event_log_{system_id}.json
  R2   manager_existing_log_dump             Redfish_R2_existing_dump_{id}.tar.xz
  R4   manager_journal_log                   Redfish_R4_journal_log_entries_{manager_id}.json
  R5   manager_fpga_register_dump            Redfish_R5_fpga_dump_{system_id}_{task_id}.tar.xz
  R6   manager_erot_dump                     Redfish_R6_erot_dump_{system_id}_{task_id}.tar.xz
  R8   firmware_inventory                    Redfish_R8_firmware_inventory.json
  R9   firmware_inventory_expand_query       Redfish_R9_firmware_inventory_expand_query.json
  R10   chassis_info                          Redfish_R10_chassis_info.json
  R11   chassis_expand_query                  Redfish_R11_chassis_expand_query.json
  R12   system_info                           Redfish_R12_system_info.json
  R13   system_expand_query                   Redfish_R13_system_expand_query.json
  R14   manager_info                          Redfish_R14_manager_info.json
  R15   manager_expand_query                  Redfish_R15_manager_expand_query.json
  R18   telemetry_metric_reports              Redfish_R18_report_{metric_report}.json
  R19   chassis_thermal_metrics               Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20   firmware_inventory_table              Redfish_R20_firmware_inventory_table.txt
  R22   task_details                          Redfish_R22_task_{task_id}.json
  R23   nvlink_oob_logs                       Redfish_R23_NVLINK_OOB_Log_{id}.json
  R24   hgx_system_fw_attributes_dump         Redfish_R24_hgx_system_{system}_fw_attributes_{task_id}.tar.xz
  R25   additional_oob_logs                   Redfish_R25_OOB_Log_{id}.json
  R26   chassis_certificates                  Redfish_R26_chassis_{chassis_id}_certificate.json
  R28   hgx_system_hardware_checkout_dump     Redfish_R28_hgx_system_{system}_hardware_checkout_{task_id}.tar.xz
  R29   background_copy_status                Redfish_R29_{chassis_id}_copy_status.json
  R30   software_inventory                    Redfish_R30_software_inventory.json
  R32   system_post_codes                     Redfish_R32_system_post_codes
  R35   network_device_debug_dump             Redfish_R35_network_device_debug_dump
        -> R35 Supported Boards: Blackwell-HGX-8-GPU, HGX B300

  R36   network_switch_debug_dump             Redfish_R36_network_switch_debug_dump
        -> R36 Supported Boards: Blackwell-HGX-8-GPU, HGX B300

  R37   gpu_debug_dump                        Redfish_R37_gpu_debug_dump
        -> R37 Supported Boards: Blackwell-HGX-8-GPU, HGX B300

  R38   gpu_diagnostic_dump                   Redfish_R38_gpu_diagnostic_dump
        -> R38 Supported Boards: Blackwell-HGX-8-GPU, HGX B300


IPMI
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  I1   mc_info                               IPMI_I1_mc_info.txt
  I2   lan_info                              IPMI_I2_lan_info.txt
  I3   session_info                          IPMI_I3_session_info.txt
  I4   fru_info                              IPMI_I4_fru_info.txt
  I5   sdr_info                              IPMI_I5_sdr_info.txt
  I6   sel_info                              IPMI_I6_sel_info.txt
  I7   sensor_list                           IPMI_I7_sensor_list.txt
  I8   sel_list                              IPMI_I8_sel_list.txt
  I9   sel_raw_dump                          IPMI_I9_sel_raw_dump.txt
  I10   chassis_status                        IPMI_I10_chassis_status.txt
  I11   chassis_restart_cause                 IPMI_I11_chassis_restart_cause.txt
  I12   user_list                             IPMI_I12_user_list.txt
  I13   channel_info                          IPMI_I13_channel_info.txt
  I14   sdr_elist                             IPMI_I14_sdr_elist.txt

SSH
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  S2   bmc_dmesg                             BMC_SSH_S2_bmc_dmesg.txt
  S3   network_info                          BMC_SSH_S3_network_info/...
  S5   bmc_list_kernel_modules               BMC_SSH_S5_bmc_list_kernel_modules.txt
  S6   openbmc_pldm_journal_log              BMC_SSH_S6_openbmc_pldm_journal_log.txt
  S7   i2c_device_list                       BMC_SSH_S7_i2c_device_bus_scan{number}.txt
  S8   bmc_mem_cpu_utilization               BMC_SSH_S8_bmc_mem_cpu_utilization/...
  S10   var_log_dir_zip                       BMC_SSH_S10_var_log_dir_zip{timestamp}.tar.gz
  S11   uptime                                BMC_SSH_S11_uptime.txt
  S15   bmc_power_status                      BMC_SSH_S15_bmc_power_status/...

Host
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  H1   node_dmesg                            Host_H1_node_dmesg.tar.gz
  H2   node_lspci                            Host_H2_node_lspci*.txt
  H3   node_smbios                           Host_H3_dmidecode*.txt
  H4   node_lshw                             Host_H4_lshw*.txt
  H5   node_nvidia_smi                       Host_H5_nvidia-smi*.txt
  H6   node_kern_log                         Host_H6_node_kern_log.tar.gz
  H7   node_crash_dump                       Host_H7_node_crash_dump.tar.gz
  H8   node_nvme_list                        Host_H8_nvme_list_-v.txt
  H9   node_fabric_manager_log               Host_H9_fabricmanager.log
  H10   node_nvflash_log                      Host_H10_nvflash_--check_-i_{num}.txt
  H11   nvidia_bug_report                     Host_H11_nvidia_bug_report_op.log.gz
  H15   node_subnet_manager                   Host_H15_node_subnet_manager/
  H16   one_diag_dump                         Host_H16_one_diag_dump/
  H17   node_nvme_log_dump                    Host_H17_node_nvme_log_dump/
  H18   node_os_info                          Host_H18_cat_etc_os-release.txt
  H19   node_memory_info                      Host_H19_memory_info/

HealthCheck
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  C1   out_of_band_health_check              HealthCheck_C1_out_of_band_health_check.json

DGX#

$ ./nvdebug -l -t DGX
Redfish
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  R8   firmware_inventory                    Redfish_R8_firmware_inventory.json
  R9   firmware_inventory_expand_query       Redfish_R9_firmware_inventory_expand_query.json
  R10   chassis_info                          Redfish_R10_chassis_info.json
  R11   chassis_expand_query                  Redfish_R11_chassis_expand_query.json
  R12   system_info                           Redfish_R12_system_info.json
  R13   system_expand_query                   Redfish_R13_system_expand_query.json
  R14   manager_info                          Redfish_R14_manager_info.json
  R15   manager_expand_query                  Redfish_R15_manager_expand_query.json
  R17   dgx_manager_oem_log_dump              Redfish_R17_dgx_oem_dump_{manager_id}_{task_id}.tar.xz
  R18   telemetry_metric_reports              Redfish_R18_report_{metric_report}.json
  R19   chassis_thermal_metrics               Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20   firmware_inventory_table              Redfish_R20_firmware_inventory_table.txt
  R22   task_details                          Redfish_R22_task_{task_id}.json
  R23   nvlink_oob_logs                       Redfish_R23_NVLINK_OOB_Log_{id}.json
  R25   additional_oob_logs                   Redfish_R25_OOB_Log_{id}.json
  R26   chassis_certificates                  Redfish_R26_chassis_{chassis_id}_certificate.json
  R30   software_inventory                    Redfish_R30_software_inventory.json
  R32   system_post_codes                     Redfish_R32_system_post_codes

IPMI
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  I1   mc_info                               IPMI_I1_mc_info.txt
  I2   lan_info                              IPMI_I2_lan_info.txt
  I3   session_info                          IPMI_I3_session_info.txt
  I4   fru_info                              IPMI_I4_fru_info.txt
  I5   sdr_info                              IPMI_I5_sdr_info.txt
  I6   sel_info                              IPMI_I6_sel_info.txt
  I7   sensor_list                           IPMI_I7_sensor_list.txt
  I8   sel_list                              IPMI_I8_sel_list.txt
  I9   sel_raw_dump                          IPMI_I9_sel_raw_dump.txt
  I10   chassis_status                        IPMI_I10_chassis_status.txt
  I11   chassis_restart_cause                 IPMI_I11_chassis_restart_cause.txt
  I12   user_list                             IPMI_I12_user_list.txt
  I13   channel_info                          IPMI_I13_channel_info.txt
  I14   sdr_elist                             IPMI_I14_sdr_elist.txt

SSH
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  S2   bmc_dmesg                             BMC_SSH_S2_bmc_dmesg.txt
  S3   network_info                          BMC_SSH_S3_network_info/...
  S5   bmc_list_kernel_modules               BMC_SSH_S5_bmc_list_kernel_modules.txt
  S8   bmc_mem_cpu_utilization               BMC_SSH_S8_bmc_mem_cpu_utilization/...
  S11   uptime                                BMC_SSH_S11_uptime.txt
  S12   fpga_register_table                   BMC_SSH_S12_fpga_register_table.txt
  S13   hmc_boot_status                       BMC_SSH_S13_hmc_boot_status.txt
  S15   bmc_power_status                      BMC_SSH_S15_bmc_power_status/...

Host
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  H1   node_dmesg                            Host_H1_node_dmesg.tar.gz
  H2   node_lspci                            Host_H2_node_lspci*.txt
  H3   node_smbios                           Host_H3_dmidecode*.txt
  H4   node_lshw                             Host_H4_lshw*.txt
  H5   node_nvidia_smi                       Host_H5_nvidia-smi*.txt
  H6   node_kern_log                         Host_H6_node_kern_log.tar.gz
  H7   node_crash_dump                       Host_H7_node_crash_dump.tar.gz
  H8   node_nvme_list                        Host_H8_nvme_list_-v.txt
  H9   node_fabric_manager_log               Host_H9_fabricmanager.log
  H10   node_nvflash_log                      Host_H10_nvflash_--check_-i_{num}.txt
  H11   nvidia_bug_report                     Host_H11_nvidia_bug_report_op.log.gz
  H15   node_subnet_manager                   Host_H15_node_subnet_manager/
  H16   one_diag_dump                         Host_H16_one_diag_dump/
  H17   node_nvme_log_dump                    Host_H17_node_nvme_log_dump/
  H18   node_os_info                          Host_H18_cat_etc_os-release.txt
  H19   node_memory_info                      Host_H19_memory_info/

HealthCheck
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  C1   out_of_band_health_check              HealthCheck_C1_out_of_band_health_check.json

HGX-HMC#

$ ./nvdebug -l -t HGX-HMC
Redfish
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  R1   system_event_log                      Redfish_R1_system_event_log_{system_id}.json
  R2   manager_existing_log_dump             Redfish_R2_existing_dump_{id}.tar.xz
  R3   hgx_manager_on_demand_log_dump        Redfish_R3_hgx_manager_dump_{manager_id}_{task_id}.tar.xz
  R4   manager_journal_log                   Redfish_R4_journal_log_entries_{manager_id}.json
  R5   manager_fpga_register_dump            Redfish_R5_fpga_dump_{system_id}_{task_id}.tar.xz
  R6   manager_erot_dump                     Redfish_R6_erot_dump_{system_id}_{task_id}.tar.xz
  R7   hgx_manager_self_test_report          Redfish_R7_hgx_manager_{system_id}_selftest_{task_id}.tar.xz
        -> R7 Supported Boards: Hopper-HGX-8-GPU, GH200 NVL, MGX-GH200, MGX C2, MGX-GH200-NVL2, MGX-PCIE-NVL16, DC-Hopper-PCIe

  R8   firmware_inventory                    Redfish_R8_firmware_inventory.json
  R9   firmware_inventory_expand_query       Redfish_R9_firmware_inventory_expand_query.json
  R10   chassis_info                          Redfish_R10_chassis_info.json
  R11   chassis_expand_query                  Redfish_R11_chassis_expand_query.json
  R12   system_info                           Redfish_R12_system_info.json
  R13   system_expand_query                   Redfish_R13_system_expand_query.json
  R14   manager_info                          Redfish_R14_manager_info.json
  R15   manager_expand_query                  Redfish_R15_manager_expand_query.json
  R16   hgx_manager_retimer_dump              Redfish_R16_hgx_retimer_dump_{system_id}_{task_id}.tar.xz
  R18   telemetry_metric_reports              Redfish_R18_report_{metric_report}.json
  R19   chassis_thermal_metrics               Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20   firmware_inventory_table              Redfish_R20_firmware_inventory_table.txt
  R22   task_details                          Redfish_R22_task_{task_id}.json
  R23   nvlink_oob_logs                       Redfish_R23_NVLINK_OOB_Log_{id}.json
  R24   hgx_system_fw_attributes_dump         Redfish_R24_hgx_system_{system}_fw_attributes_{task_id}.tar.xz
  R25   additional_oob_logs                   Redfish_R25_OOB_Log_{id}.json
  R26   chassis_certificates                  Redfish_R26_chassis_{chassis_id}_certificate.json
  R27   spdm_erot_measurements                Redfish_R27_spdm_{erot_id}_index_{index}.json
  R28   hgx_system_hardware_checkout_dump     Redfish_R28_hgx_system_{system}_hardware_checkout_{task_id}.tar.xz
  R29   background_copy_status                Redfish_R29_{chassis_id}_copy_status.json
  R30   software_inventory                    Redfish_R30_software_inventory.json
  R31   hmc_fdr_log_dump                      Redfish_R31_{system_id}_hmc_fdr_log_dump
  R35   network_device_debug_dump             Redfish_R35_network_device_debug_dump
        -> R35 Supported Boards: Blackwell-HGX-8-GPU, HGX B300

  R36   network_switch_debug_dump             Redfish_R36_network_switch_debug_dump
        -> R36 Supported Boards: Blackwell-HGX-8-GPU, HGX B300

  R37   gpu_debug_dump                        Redfish_R37_gpu_debug_dump
  R38   gpu_diagnostic_dump                   Redfish_R38_gpu_diagnostic_dump

Note: TCP port forwarding support on host-BMC is required for HGX board logs collection via Redfish.


IPMI
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  I1   mc_info                               IPMI_I1_mc_info.txt
  I2   lan_info                              IPMI_I2_lan_info.txt
  I3   session_info                          IPMI_I3_session_info.txt
  I4   fru_info                              IPMI_I4_fru_info.txt
  I5   sdr_info                              IPMI_I5_sdr_info.txt
  I6   sel_info                              IPMI_I6_sel_info.txt
  I7   sensor_list                           IPMI_I7_sensor_list.txt
  I8   sel_list                              IPMI_I8_sel_list.txt
  I9   sel_raw_dump                          IPMI_I9_sel_raw_dump.txt
  I10   chassis_status                        IPMI_I10_chassis_status.txt
  I11   chassis_restart_cause                 IPMI_I11_chassis_restart_cause.txt
  I12   user_list                             IPMI_I12_user_list.txt
  I13   channel_info                          IPMI_I13_channel_info.txt
  I14   sdr_elist                             IPMI_I14_sdr_elist.txt

SSH
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  S14   hmc_boot_progress                     BMC_SSH_S14_hmc_boot_progress.txt
        -> S14 Supported Boards: Unknown

  S15   bmc_power_status                      BMC_SSH_S15_bmc_power_status/...

Host
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  H1   node_dmesg                            Host_H1_node_dmesg.tar.gz
  H2   node_lspci                            Host_H2_node_lspci*.txt
  H3   node_smbios                           Host_H3_dmidecode*.txt
  H4   node_lshw                             Host_H4_lshw*.txt
  H5   node_nvidia_smi                       Host_H5_nvidia-smi*.txt
  H6   node_kern_log                         Host_H6_node_kern_log.tar.gz
  H7   node_crash_dump                       Host_H7_node_crash_dump.tar.gz
  H8   node_nvme_list                        Host_H8_nvme_list_-v.txt
  H9   node_fabric_manager_log               Host_H9_fabricmanager.log
  H10   node_nvflash_log                      Host_H10_nvflash_--check_-i_{num}.txt
  H11   nvidia_bug_report                     Host_H11_nvidia_bug_report_op.log.gz
  H15   node_subnet_manager                   Host_H15_node_subnet_manager/
  H16   one_diag_dump                         Host_H16_one_diag_dump/
  H17   node_nvme_log_dump                    Host_H17_node_nvme_log_dump/
  H18   node_os_info                          Host_H18_cat_etc_os-release.txt
  H19   node_memory_info                      Host_H19_memory_info/

HealthCheck
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  C1   out_of_band_health_check              HealthCheck_C1_out_of_band_health_check.json

NVSwitch#

$ ./nvdebug -l -t NVSwitch
Redfish
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  R1   system_event_log                      Redfish_R1_system_event_log_{system_id}.json
  R2   manager_existing_log_dump             Redfish_R2_existing_dump_{id}.tar.xz
  R8   firmware_inventory                    Redfish_R8_firmware_inventory.json
  R9   firmware_inventory_expand_query       Redfish_R9_firmware_inventory_expand_query.json
  R10   chassis_info                          Redfish_R10_chassis_info.json
  R11   chassis_expand_query                  Redfish_R11_chassis_expand_query.json
  R12   system_info                           Redfish_R12_system_info.json
  R13   system_expand_query                   Redfish_R13_system_expand_query.json
  R14   manager_info                          Redfish_R14_manager_info.json
  R15   manager_expand_query                  Redfish_R15_manager_expand_query.json
  R19   chassis_thermal_metrics               Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20   firmware_inventory_table              Redfish_R20_firmware_inventory_table.txt
  R22   task_details                          Redfish_R22_task_{task_id}.json
  R25   additional_oob_logs                   Redfish_R25_OOB_Log_{id}.json
  R26   chassis_certificates                  Redfish_R26_chassis_{chassis_id}_certificate.json
  R29   background_copy_status                Redfish_R29_{chassis_id}_copy_status.json
  R30   software_inventory                    Redfish_R30_software_inventory.json

IPMI
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  NA   NA                                    NA

SSH
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  S2   bmc_dmesg                             BMC_SSH_S2_bmc_dmesg.txt
  S3   network_info                          BMC_SSH_S3_network_info/...
  S5   bmc_list_kernel_modules               BMC_SSH_S5_bmc_list_kernel_modules.txt
  S11   uptime                                BMC_SSH_S11_uptime.txt

Host
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  H1   node_dmesg                            Host_H1_node_dmesg.tar.gz
  H2   node_lspci                            Host_H2_node_lspci*.txt
  H3   node_smbios                           Host_H3_dmidecode*.txt
  H6   node_kern_log                         Host_H6_node_kern_log.tar.gz
  H7   node_crash_dump                       Host_H7_node_crash_dump.tar.gz
  H9   node_fabric_manager_log               Host_H9_fabricmanager.log
  H12   nvos_inventory                        Host_H12_nv_show_platform_{type}.txt
  H13   nvos_gnmi_config                      Host_H13_nvos_gnmi_config.txt
  H14   nvos_tech_support_dump                Host_H14_nvos_tech_support_dump/
  H19   node_memory_info                      Host_H19_memory_info/

HealthCheck
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  NA   NA                                    NA

PowerShelf#

$ ./nvdebug -l -t PowerShelf
Redfish
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  R1   system_event_log                      Redfish_R1_system_event_log_{system_id}.json
  R2   manager_existing_log_dump             Redfish_R2_existing_dump_{id}.tar.xz
  R8   firmware_inventory                    Redfish_R8_firmware_inventory.json
  R9   firmware_inventory_expand_query       Redfish_R9_firmware_inventory_expand_query.json
  R10   chassis_info                          Redfish_R10_chassis_info.json
  R11   chassis_expand_query                  Redfish_R11_chassis_expand_query.json
  R14   manager_info                          Redfish_R14_manager_info.json
  R15   manager_expand_query                  Redfish_R15_manager_expand_query.json
  R19   chassis_thermal_metrics               Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20   firmware_inventory_table              Redfish_R20_firmware_inventory_table.txt
  R22   task_details                          Redfish_R22_task_{task_id}.json
  R25   additional_oob_logs                   Redfish_R25_OOB_Log_{id}.json
  R26   chassis_certificates                  Redfish_R26_chassis_{chassis_id}_certificate.json
  R33   power_equipment_info                  Redfish_R33_power_equipment_info
  R34   power_equipment_expand_query          Redfish_R34_power_equipment_expand_query

IPMI
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  NA   NA                                    NA

SSH
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  NA   NA                                    NA

Host
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  NA   NA                                    NA

HealthCheck
  CID   Collector Name                        Log Location
------+-------------------------------------+-------------------------------
  NA   NA                                    NA

Running Specific Collector Groups and Collectors#

Redfish Collectors#

$ nvdebug -i $BMC_IP -u $BMC_USER -p $BMC_PASS -t $TARGET -g Redfish

Run Specific Firmware Inventory Collector#

$ nvdebug -i $BMC_IP -u $BMC_USER -p $BMC_PASS -t $TARGET -S R8

Configuration Considerations#

Running nvdebug in IPv6 Networks#

By default, nvdebug uses IPv4. For IPv6, set IP_NETWORK to ipv6 in the DUT configuration. When providing IPv6 addresses for the BMC/Host, do not use square brackets.