NVIDIA Docs Hub NVIDIA Networking Networking Software Switch Software NVIDIA NVOS User Manual for InfiniBand Switches v25.02.3000 Troubleshooting

Troubleshooting

Resetting NVOS Password

To reset forgotten password of default user accounts, see Reset Local Users' Passwords section.

Image Upgrade Recovery

If the system encounters issues after an image upgrade, the user can switch back to the old partition.

Check the current partition.

Copy
Copied!

            
            admin@nvos:~$ nv show system image  
            operational              
----------  ------------------------ 
current     nvos-25.02.1500 
next        nvos-25.02.1500 
partition1  nvos-25.02.1500 
partition2  nvos-25.02.1400

Change to the other partition.

Copy
Copied!

            
            admin@nvos:~$ nv action boot-next system image partition2 
admin@nvos:~$ nv show system image 
 
            operational              
----------  ------------------------ 
current     nvos-25.02.1500 
next        nvos-25.02.1400 
partition1  nvos-25.02.1500 
partition2  nvos-25.02.1400

Reboot the system.

Copy
Copied!

            
            admin@nvos:~$ nv action reboot system

Management Interface Has No IPv4/6

Check that DHCP is running on the relevant protocol:

Copy
Copied!

            
            admin@nvos:~$ nv show interface eth0 ip
                operational                             applied
--------------  --------------------------------------  -------
vrf             default                                 default
arp-timeout     1800                                    1800
autoconf        enabled                                 enabled
dhcp-client
  state         enabled                                 enabled
  set-hostname  enabled                                 enabled
  is-running    yes
  has-lease     no
dhcp-client6
  state         enabled                                 enabled
  set-hostname  enabled                                 enabled
  is-running    yes
  has-lease     yes
[address]       10:10:10:10:10:10:10:10/10
[address]       fdfd:fdfd:7:80:eaeb:d3ff:fe4b:70b8/64
[gateway]

If the interface does not have a lease, run the following action to renew the DHCP lease:

Copy
Copied!

            
            admin@nvos:~$ nv action renew interface eth0 ip dhcp-client
Action executing ...
Renewing DHCP lease for eth0, connection may be interrupted
Action executing ...
DHCP lease for eth0 was renewed
Action succeeded

System Fatal Recovery

The system has mechanism to detect if ASIC encountered health/firmware burn issue and try to recover from it.

During the fatal detection and recovery, events will be raised as well. For more information, see ASIC-Related Events in the Event Managment section.

Detecting a Fatal State

The system’s fatal state is indicated in the CLI prompt and in the nv show system health command.

Example:

Copy
Copied!

            
            [System_Fatal_State]admin@nvos~$ nv show system health
             operational  applied
----------  -----------  -------
status      FATAL
status-led  amber
 
 
 
Health issues
================
    Component    Status information
    -----------  ---------------------------
    ASIC-HEALTH  Switch ASIC in fatal state.

Automatic Recovery Mechanism

The system has an internal mechanism to recover from a fatal state without user intervention. The recovery process involves the following steps:

Restart the ASICs of the system.
If restarting the ASICs does not resolve the issue, the system will attempt to recover through a system reboot.
If after the reboot system still encounters ASIC issues, another reboot will be performed.
After the second reboot, the system will start without configuring the ASICs, leaving all ports down. NVOS is running, so logs can be collected for analysis.
To try to revive the switch, perform a power-cycle by the running the command nv action power-cycle system.
If system entered fatal state again, please contact NVIDIA's support team.

Note

Any reboot or power cycle initiated by the user will also reset the system’s fatal detection and recovery mechanism. This process starts the recovery steps from the beginning.

Recovery Timeframe

After the recovery steps are completed and the system remains operational for 10 minutes without any health issues, it will exit the fatal state.
During this 10-minute observation period, the system may still appear in a fatal state as reflected in the CLI prompt and system health command.
Once the system exits the fatal state, the CLI prompt and system health command will confirm the recovery.

On This Page