Chassis Management
The chassis manager provides the user access to the following information:
Accessible Parameters |
Description |
switch temperatures |
Displays system’s temperature |
power supply voltages |
Displays power supplies’ voltage levels |
fan unit |
Displays system fans’ status |
power unit |
Displays system power consumers |
Flash memory |
Displays information about system memory utilization. |
Additionally, it monitors:
AC power to the PSUs
DC power out from the PSUs
Chassis failures
The system health monitor scans the system to decide whether or not the system is healthy. When the monitor discovers that one of the system's modules (fan, or power supply) is in an unhealthy state or returned from an unhealthy state, it notifies the users through the following methods:
System logs—accessible to the user at any time as they are saved permanently on the system
Status LEDs—changed by the system health monitor when an error is found in the system and is resolved
Email/SNMP traps—notification on any error found in the system and resolved
Re-Notification on Errors
When the system is in an unhealthy state, the system health monitor notifies the user about the current unresolved issue every X seconds. The user can configure the re-notification gap by running the “health notif-cntr <counter>” command.
System Health Monitor Alerts Scenarios
System Health Monitor sends notification alerts in the following cases:
Alert Message |
Scenario |
Notification Indicator |
Recovery Action |
Recovery Message |
<fan_name> speed is below minimal range |
A chassis fan speed is below minimal threshold (15% of maximum speed) |
Email, fan LED and system status LED set red, log alert, SNMP. |
Check the fan and replace it if required |
“<fan_name> has been restored to its normal state” |
<fan_name> is unresponsive |
A chassis fan is not responsive on the switch system |
Email, fan LED and system status LED set red, log alert, SNMP |
Check fan connectivity and replace it if required |
“<fan_name> has been restored to its normal state” |
<fan_name> is not present |
A chassis fan is missing |
Email, fan LED and system status LED set red, log alert, SNMP |
Insert a fan unit |
“<fan_name> has been restored to its normal state” |
Insufficient number of working fans in the system |
Insufficient number of working fans in the system |
Email, fan LED and system status LED set red, log alert, SNMP |
Plug in additional fans or change faulty fans |
“The system currently has sufficient number of working fans” |
Power Supply <ps_number> voltage is out of range |
The power supply voltage is out of range. |
Email, power supply LED and system status LED set red, log alert, SNMP |
Check the power connection of the PS |
“Power Supply <ps_number> voltage is in range” |
Power supply <ps_number> temperature is too hot |
A power supply unit temperature is higher than the maximum threshold of 70 Celsius on the switch system |
Email, power supply LED and system status LED set red, log alert, SNMP |
Check chassis fans connections. On switch systems, check system fan connections. |
“Power supply <ps_number> temperature is back to normal” |
Power Supply <number> is unresponsive |
A power supply is malfunctioning or disconnected |
Email, system status and power supply LED set red, log alert, SNMP |
Connect power cable or replace malfunctioning PS |
“Power supply has been removed” or “PS has been restored to its normal state” |
ASIC temperature is too hot |
An ASIC unit temperature is higher than the maximum threshold of 105 Celsius on switch systems |
Email, system status LED set red, log alert, SNMP |
Check the fan’s system |
“ASIC temperature is back to normal” |
Width Reduction Power Saving
Link width reduction (LWR) is a
NVIDIA
proprietary power saving feature to be utilized to economize the power usage of the fabric. LWR may be used to manually or automatically configure a certain connection between
NVIDIA switch
systems to lower the width of a link from 4X operation to 1X based on the traffic flow.
LWR is relevant only for
40GbE
speeds in which the links are operational at a 4X width.
When “show interfaces” is used, a port’s speed appears unchanged even when only one lane is active.
LWR has three operating modes per interface:
Disabled—LWR does not operate and the link remains in 4X under all circumstances.
Automatic—the link automatically alternates between 4X and 1X based on traffic flow.
Force—a port is forced to operate in 1X mode lowering the throughput capability of the port. This mode should be chosen in cases where constant low throughput is expected on the port for a certain time period—after which the port should be configured to one of the other two modes, to allow higher throughput to pass through the port.
The following table describes LWR configuration behavior:
Switch-A Configuration |
Switch-B Configuration |
Behavior |
Disable |
Disable |
LWR is disabled |
Disable |
Force |
Transmission from Switch-B to Switch-A operates at 1X. On the opposite direction, LWR is disabled. |
Disable |
Auto |
Depending on traffic flow, transmission from Switch-B to Switch-A may operate at 1X. On the opposite direction, LWR is disabled. |
Auto |
Force |
Transmission from Switch-B to Switch-A operates at 1 lane. Transmission from Switch-A to Switch-B may operate at 1X depending on the traffic. |
Auto |
Auto |
Width of the connection depends on the traffic flow |
Force |
Force |
Connection between the switches operates at 1x |
Display module’s temperature. Run:
switch
(config) # show temperature --------------------------------------------------------- Module Component Reg CurTemp Status (Celsius) --------------------------------------------------------- MGMT SIB T133.00
OK MGMT Board AMB temp T124.50
OK MGMT Ports AMB temp T127.00
OK MGMT CPUpackage
Sensor T129.00
OK MGMT CPU Core Sensor T128.00
OK MGMT CPU Core Sensor T224.00
OK PS1 power-mon T122.00
OK PS2 power-mon T123.00
OKDisplay measured voltage levels of power supplies. Run:
switch
(config) # show voltage ------------------------------------------------------------------------------------------------ Module Power Meter Reg Expected Actual Status High Low Voltage Voltage Range Range ------------------------------------------------------------------------------------------------ MGMT acdc-monitor1 DDR30
.675V0.68
0.67
OK0.78
0.57
MGMT acdc-monitor1 CPU0
.9V0.78
0.78
OK0.89
0.66
MGMT acdc-monitor1 SYS3
.3V3.30
3.34
OK3.79
2.80
MGMT acdc-monitor1 CPU1
.8V1.80
1.79
OK2.07
1.53
MGMT acdc-monitor1 CPU/PCH1
.05V1.05
1.05
OK1.21
0.89
MGMT acdc-monitor1 CPU1
.05V1.05
1.05
OK1.21
0.89
MGMT acdc-monitor1 DDR31
.35V1.35
1.35
OK1.55
1.15
MGMT acdc-monitor1 USB 5V5.00
5.04
OK5.75
4.25
MGMT acdc-monitor11
.05V LAN1.50
1.50
OK1.72
1.27
MGMT ASICVoltMonitor1 Asic1
.2V1.20
1.21
OK1.38
1.02
MGMT ASICVoltMonitor1 Asic3
.3V3.30
3.32
OK3.79
2.80
MGMT ASICVoltMonitor2 Vcore SPC0.95
0.96
OK1.09
0.81
MGMT acdc-monitor21
.8V Switch SPC1.80
1.82
OK2.07
1.53
PS1 power-mon N/A0.00
0.00
FAIL0.00
0.00
PS2 power-mon vout 12V12.00
11.98
OK13.80
10.20
Display the fan speed and status. Run:
switch
(config) # show fan ----------------------------------------------------- Module Device Fan Speed Status (RPM) ----------------------------------------------------- FAN1 FAN F19305.00
OK FAN2 FAN F18823.00
OK FAN3 FAN F19057.00
OK FAN4 FAN F19369.00
OK PS1 FAN F110288.00
OK PS2 FAN - - NOT PRESENTDisplay the voltage current and status of each module in the system. Run:
switch
(config) # show power consumers ------------------------------------------------------------------ Module Device Sensor Power Voltage Current Status [Watts] [Volts] [Amp] ------------------------------------------------------------------ PS1 power-mon input37.50
12.02
3.19
OK MGMT acdc-monitor2 input - - - OK Total power used :37.50
Watts
The OS can access USB devices attached to switch systems. USB devices are automatically recognized and mounted upon insertion. To access a USB device for reading or writing a file, you need to provide the path to the file on the mounted USB device in the following format:
scp://username:password@hostname/var/mnt/usb1/<file name>
While username and password are the admin username and password and hostname is the IP of the switch.
Examples:
To fetch an image from a USB device, run the command:
switch
(config) # image fetch scp://username:password@hostname/var/mnt/usb1/<image filename>
To save log file (my-logfile) to a USB device under the name “test_logfile” using the command “logging files”, run:
switch
(config) # logging files upload my-logfile scp://username:password@hostname/var/mnt/usb1/test_logfile
To safely remove the USB and to flush the cache, after writing (log files, for example) to a USB, use the “usb eject” command:
switch
(config) # usb eject
The unit identification (UID) LED is a hardware feature used as a means of locating a specific switch system in a server room.
To activate the UID LED on a switch system, run:
switch
(config) # led MGMT uid on
To verify the LED status, run:
switch
(config) # show leds
Module LED Status
--------------------------------------------------------------------------
MGMT UID Blue
To deactivate the UID LED on a switch system, run:
switch
(config) # led MGMT uid off
To reboot your switch system, run:
switch
(config) # reload
NVIDIA Onyx supports viewing all active events on the system. The following events may be observed with the command show system hardware events.
Event Name |
Description |
Ethernet Family |
|
Invalid Mac (SMAC=MC) |
Source MAC is a multicast address |
Invalid Mac (SMAC=DMAC) |
Source MAC is same as destination mac address |
Invalid Ethertype |
Packet has an unknown Ethertype (0x05DC < ethertype < 0x600) |
IP Routing Family |
|
Ingress Router interface is disabled |
Ingress packet has been dropped because incoming L3 interface is admin down |
Mismatched IP (UC DIP over MC/BC Mac) |
Packet MAC is multicast/broadcast but destination IP is unicast |
Invalid IP (DIP=loopback) |
Destination IP is loopback IP (For IPv6: DIP==::1/128 or DIP==0:0:0:0:0:ffff:7f00:0/104 For IPv4: DIP==127.0.0.0/8) |
Invalid IP (SIP=MC) |
Source IP is multicast address (For IPv6: SIP == FF00::/8 For IPv4: SIP == 224.0.0.0: 239.255.255.255 aka 224.0.0.0/4) |
Invalid IP (SIP=unspecified) |
Source IP is unspecified |
Invalid IP (SIP=DIP) |
Source IP is identical to destination IP |
Mismatched MC Mac |
Packet’s multicast MAC does not correspond to packet’s MC IP address |
IPv6 neighbor not resolved |
IPv6 neighbor not resolved |
Invalid IPv6 (SIP=Link Local) |
Source IP is link local (IPv6) |
MC RPF check failure |
Multicast RPF check failure |
TTL expired |
TTL value is zero |
Egress Router interface is disabled |
Egress packet has been dropped because outgoing L3 interface is admin/oper is down |
IPv4 neighbor not resolved |
Entry not found for destination |
Tunnel Family |
|
NVE Decap fragmentation error |
Fragmentation error during decapsulation |