image image image image image

On This Page

The chassis manager provides the user access to the following information:

Accessible ParametersDescription

switch temperatures

Displays system’s temperature

power supply voltages

Displays power supplies’ voltage levels

fan unit

Displays system fans’ status

power unit

Displays system power consumers

Flash memory

Displays information about system memory utilization.

Additionally, it monitors:

  • AC power to the PSUs
  • DC power out from the PSUs
  • Chassis failures

System Health Monitor

The system health monitor scans the system to decide whether or not the system is healthy. When the monitor discovers that one of the system's modules (fan, or power supply) is in an unhealthy state or returned from an unhealthy state, it notifies the users through the following methods:

  • System logs—accessible to the user at any time as they are saved permanently on the system
  • Status LEDs—changed by the system health monitor when an error is found in the system and is resolved
  • Email/SNMP traps—notification on any error found in the system and resolved

Re-Notification on Errors

When the system is in an unhealthy state, the system health monitor notifies the user about the current unresolved issue every X seconds. The user can configure the re-notification gap by running the “health notif-cntr <counter>” command.

System Health Monitor Alerts Scenarios

System Health Monitor sends notification alerts in the following cases:

Alert MessageScenarioNotification IndicatorRecovery ActionRecovery Message

<fan_name> speed is below minimal range

A chassis fan speed is below minimal threshold (15% of maximum speed)

Email, fan LED and system status LED set red, log alert, SNMP.

Check the fan and replace it if required

“<fan_name> has been restored to its normal state”

<fan_name> is unresponsive

A chassis fan is not responsive on the switch system

Email, fan LED and system status LED set red, log alert, SNMP

Check fan connectivity and replace it if required

“<fan_name> has been restored to its normal state”

<fan_name> is not present

A chassis fan is missing

Email, fan LED and system status LED set red, log alert, SNMP

Insert a fan unit

“<fan_name> has been restored to its normal state”

Insufficient number of working fans in the system

Insufficient number of working fans in the system

Email, fan LED and system status LED set red, log alert, SNMP

Plug in additional fans or change faulty fans

“The system currently has sufficient number of working fans”

Power Supply <ps_number> voltage is out of range

The power supply voltage is out of range.

Email, power supply LED and system status LED set red, log alert, SNMP

Check the power connection of the PS

“Power Supply <ps_number> voltage is in range”

Power supply <ps_number> temperature is too hot

A power supply unit temperature is higher than the maximum threshold of 70 Celsius on the switch system

Email, power supply LED and system status LED set red, log alert, SNMP

Check chassis fans connections. On switch systems, check system fan connections.

“Power supply <ps_number> temperature is back to normal”

Power Supply <number> is unresponsive

A power supply is malfunctioning or disconnected

Email, system status and power supply LED set red, log alert, SNMP

Connect power cable or replace malfunctioning PS

“Power supply has been removed” or “PS has been restored to its normal state”

ASIC temperature is too hot

An ASIC unit temperature is higher than the maximum threshold of 105 Celsius on switch systems

Email, system status LED set red, log alert, SNMP

Check the fan’s system

“ASIC temperature is back to normal”

Power Management

Width Reduction Power Saving

Link width reduction (LWR) is a 

NVIDIA

 proprietary power saving feature to be utilized to economize the power usage of the fabric. LWR may be used to manually or automatically configure a certain connection between 

NVIDIA switch

 systems to lower the width of a link from 4X operation to 1X based on the traffic flow.

LWR is relevant only for 

40GbE

 speeds in which the links are operational at a 4X width.

When “show interfaces” is used, a port’s speed appears unchanged even when only one lane is active.

LWR has three operating modes per interface:

  • Disabled—LWR does not operate and the link remains in 4X under all circumstances.
  • Automatic—the link automatically alternates between 4X and 1X based on traffic flow.
  • Force—a port is forced to operate in 1X mode lowering the throughput capability of the port. This mode should be chosen in cases where constant low throughput is expected on the port for a certain time period—after which the port should be configured to one of the other two modes, to allow higher throughput to pass through the port.


The following table describes LWR configuration behavior:

Switch-A ConfigurationSwitch-B ConfigurationBehavior

Disable

Disable

LWR is disabled

Disable

Force

Transmission from Switch-B to Switch-A operates at 1X. On the opposite direction, LWR is disabled.

Disable

Auto

Depending on traffic flow, transmission from Switch-B to Switch-A may operate at 1X. On the opposite direction, LWR is disabled.

Auto

Force

Transmission from Switch-B to Switch-A operates at 1 lane. Transmission from Switch-A to Switch-B may operate at 1X depending on the traffic.

Auto

Auto

Width of the connection depends on the traffic flow

Force

Force

Connection between the switches operates at 1x

Monitoring Environmental Conditions

  1. Display module’s temperature. Run: 

    switch (config) # show temperature 
    ---------------------------------------------------------
    Module      Component              Reg  CurTemp    Status
                                            (Celsius) 
    ---------------------------------------------------------
    MGMT        SIB                    T1   33.00      OK    
    MGMT        Board AMB temp         T1   24.50      OK    
    MGMT        Ports AMB temp         T1   27.00      OK    
    MGMT        CPU package Sensor     T1   29.00      OK    
    MGMT        CPU Core Sensor        T1   28.00      OK    
    MGMT        CPU Core Sensor        T2   24.00      OK    
    PS1         power-mon              T1   22.00      OK    
    PS2         power-mon              T1   23.00      OK 
  2. Display measured voltage levels of power supplies. Run: 

    switch (config) # show voltage
    ------------------------------------------------------------------------------------------------
    Module   Power Meter              Reg                    Expected  Actual   Status  High   Low  
                                                             Voltage   Voltage          Range  Range
    ------------------------------------------------------------------------------------------------
    MGMT     acdc-monitor1            DDR3 0.675V            0.68      0.67     OK      0.78   0.57 
    MGMT     acdc-monitor1            CPU 0.9V               0.78      0.78     OK      0.89   0.66 
    MGMT     acdc-monitor1            SYS 3.3V               3.30      3.34     OK      3.79   2.80 
    MGMT     acdc-monitor1            CPU 1.8V               1.80      1.79     OK      2.07   1.53 
    MGMT     acdc-monitor1            CPU/PCH 1.05V          1.05      1.05     OK      1.21   0.89 
    MGMT     acdc-monitor1            CPU 1.05V              1.05      1.05     OK      1.21   0.89 
    MGMT     acdc-monitor1            DDR3 1.35V             1.35      1.35     OK      1.55   1.15 
    MGMT     acdc-monitor1            USB 5V                 5.00      5.04     OK      5.75   4.25 
    MGMT     acdc-monitor1            1.05V LAN              1.50      1.50     OK      1.72   1.27 
    MGMT     ASICVoltMonitor1         Asic 1.2V              1.20      1.21     OK      1.38   1.02 
    MGMT     ASICVoltMonitor1         Asic 3.3V              3.30      3.32     OK      3.79   2.80 
    MGMT     ASICVoltMonitor2         Vcore SPC              0.95      0.96     OK      1.09   0.81 
    MGMT     acdc-monitor2            1.8V Switch SPC        1.80      1.82     OK      2.07   1.53 
    PS1      power-mon                N/A                    0.00      0.00     FAIL    0.00   0.00 
    PS2      power-mon                vout 12V               12.00     11.98    OK      13.80  10.20

  3. Display the fan speed and status. Run: 

    switch (config) # show fan
    -----------------------------------------------------
    Module          Device          Fan  Speed     Status
                                         (RPM)
    -----------------------------------------------------
    FAN1            FAN             F1   9305.00   OK
    FAN2            FAN             F1   8823.00   OK
    FAN3            FAN             F1   9057.00   OK
    FAN4            FAN             F1   9369.00   OK
    PS1             FAN             F1   10288.00  OK
    PS2             FAN             -    -         NOT PRESENT

  4. Display the voltage current and status of each module in the system. Run: 

    switch (config) # show power consumers
    ------------------------------------------------------------------
    Module  Device            Sensor  Power   Voltage  Current  Status
                                      [Watts] [Volts]  [Amp]
    ------------------------------------------------------------------
    PS1     power-mon         input   37.50   12.02    3.19     OK
    MGMT    acdc-monitor2     input   -       -        -        OK
    
    Total power used : 37.50 Watts

USB Access

The OS can access USB devices attached to switch systems. USB devices are automatically recognized and mounted upon insertion. To access a USB device for reading or writing a file, you need to provide the path to the file on the mounted USB device in the following format: 

scp://username:password@hostname/var/mnt/usb1/<file name>

While username and password are the admin username and password and hostname is the IP of the switch.

Examples:

  • To fetch an image from a USB device, run the command: 

    switch (config) # image fetch scp://username:password@hostname/var/mnt/usb1/<image filename>
  • To save log file (my-logfile) to a USB device under the name “test_logfile” using the command “logging files”, run: 

    switch (config) # logging files upload my-logfile scp://username:password@hostname/var/mnt/usb1/test_logfile
  • To safely remove the USB and to flush the cache, after writing (log files, for example) to a USB, use the “usb eject” command: 

    switch (config) # usb eject

Unit Identification LED

The unit identification (UID) LED is a hardware feature used as a means of locating a specific switch system in a server room.

To activate the UID LED on a switch system, run: 

switch (config) # led MGMT uid on

To verify the LED status, run: 

switch (config) # show leds
Module   LED            Status
--------------------------------------------------------------------------
MGMT     UID            Blue

To deactivate the UID LED on a switch system, run: 

switch (config) # led MGMT uid off

System Reboot

To reboot your switch system, run: 

switch (config) # reload

Viewing Active Events

NVIDIA Onyx supports viewing all active events on the system. The following events may be observed with the command “show system hardware events”.

Event NameDescription

Ethernet Family

Invalid Mac (SMAC=MC)

Source MAC is a multicast address

Invalid Mac (SMAC=DMAC)

Source MAC is same as destination mac address

Invalid Ethertype

Packet has an unknown Ethertype (0x05DC < ethertype < 0x600)

IP Routing Family

Ingress Router interface is disabled

Ingress packet has been dropped because incoming L3 interface is admin down

Mismatched IP (UC DIP over MC/BC Mac)

Packet MAC is multicast/broadcast but destination IP is unicast

Invalid IP (DIP=loopback)

Destination IP is loopback IP

(For IPv6: DIP==::1/128 or DIP==0:0:0:0:0:ffff:7f00:0/104

For IPv4: DIP==127.0.0.0/8)

Invalid IP (SIP=MC)

Source IP is multicast address

(For IPv6: SIP == FF00::/8

For IPv4: SIP == 224.0.0.0: 239.255.255.255 aka 224.0.0.0/4)

Invalid IP (SIP=unspecified)

Source IP is unspecified

Invalid IP (SIP=DIP)

Source IP is identical to destination IP

Mismatched MC Mac

Packet’s multicast MAC does not correspond to packet’s MC IP address

IPv6 neighbor not resolved

IPv6 neighbor not resolved

Invalid IPv6 (SIP=Link Local)

Source IP is link local (IPv6)

MC RPF check failure

Multicast RPF check failure

TTL expired

TTL value is zero

Egress Router interface is disabled

Egress packet has been dropped because outgoing L3 interface is admin/oper is down

IPv4 neighbor not resolved

Entry not found for destination

Tunnel Family

NVE Decap fragmentation error

Fragmentation error during decapsulation