Appendix – Supported Port Counters and Events
Port counters and events are available in the following views:
Events and Port Counters area, at the bottom of the UFM window
Error window (Error tab) in the Manage Devices tab
In the New Monitoring Session window, in the Monitor tab, when clicking Create New Session
Event Log in the Log tab (click Show Event Log)
The following tables list and describe the port counters and events currently supported:
InfiniBand Port Counters
Calculated Port Counters
InfiniBand Port Counters | |
Counter | Description |
Xmit Data (in bytes) | Total number of data octets, divided by 4, transmitted on all VLs from the port, including all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. All link packets are excluded. Results are reported as a multiple of four octets. |
Rcv Data (in bytes) | Total number of data octets, divided by 4, received on all VLs at the port. All octets between (and not including) the start of packet delimiter and the VCRC are excluded and may include packets containing errors. All link packets are excluded. When the received packet length exceeds the maximum allowed packet length specified in C7-45: the counter may include all data octets exceeding this limit. Results are reported as a multiple of four octets. |
Xmit Packets | Total number of packets transmitted on all VLs from the port, including packets with errors and excluding link packets. |
Rcv Packets | Total number of packets, including packets containing errors and excluding link packets, received from all VLs on the port. |
Rcv Errors | Total number of packets containing errors that were received on the port including:
|
Xmit Discards | Total number of outbound packets discarded by the port when the port is down or congested for the following reasons:
|
Symbol Errors | Total number of minor link errors detected on one or more physical lanes. |
Link Error Recovery | Total number of times the Port Training state machine has successfully completed the link error recovery process. |
Link Error Downed | Total number of times the Port Training state machine has failed the link error recovery process and downed the link. |
Local Integrity Error | The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors; |
Rcv Remote Physical Error | Total number of packets marked with the EBP delimiter received on the port. |
Xmit Constraint Error | Total number of packets not transmitted from the switch physical port for the following reasons:
|
Rcv Constraint Error | Total number of packets received on the switch physical port that are discarded for the following reasons:
|
Excess Buffer Overrun Error | The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error |
Rcv Switch Relay Error | Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:
|
VL15 Dropped | Number of incoming VL15 packets dropped because of resource limitations (e.g., lack of buffers) in the port |
XmitWait | The number of ticks during which the port selected by PortSelect had data to transmit but no data was sent during the entire tick because of insufficient credits or of lack of arbitration. |
InfiniBand Calculated Port Counters | |
Counter | Description |
Normalized XmitData | Effective port bandwidth utilization in % XmitData incremental/ Link Capacity |
Normalized Congested Bandwidth | Amount of bandwidth that was suppressed due to congestion (XmitWait incremental/ Time) * Link Capacity Separate counters are used for Tier 4 ports and for the rest of the ports. |
Normalized XmitWait | Congestion in relation to packets transmitted over the link XmitWait incremental / XmitPackets incremental This event is calculated only for the port directly connected to receiving hosts. Separate counters are used for Tier 4 ports and for the rest of the ports. |
Device events are listed as VDM or CDM in the Source column of the Events table in the UFM GUI. For information about defining event policy, see Configuring Event Management.
Alarm ID | Alarm Name | To Log | Alarm | Default Severity | Default Threshold | Default TTL | Related Object | Category | Description/Message |
116 | Port Xmit Discards | 1 | 1 | Minor | 200 | 300 | Port | Communication Error | Total number of outbound packets discarded by the port when the port is down or congested. Reasons include:
|
117 | Port Xmit Constraint Errors | 1 | 1 | Minor | 200 | 300 | Port | Communication Error | Total number of packets not transmitted from the switch physical port for the following reasons:
|
120 | Excessive Buffer Overrun Errors | 1 | 1 | Minor | 100 | 300 | Port | Communication Error | The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error. Message: ExcessiveBufferOverrunErrors counter threshold exceeded. Threshold is %d, received value is %d. |
121 | VL15 Dropped | 1 | 1 | Minor | 50 | 300 | Port | Communication Error | Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port. Message: VL15Dropped counter threshold exceeded. Threshold is %d, received value is %d. |
118 | Port Receive Constraint Errors | 1 | 1 | Minor | 200 | 300 | Port | Communication Error | Total number of packets received on the switch physical port that are discarded for the following reasons:
|
145 | System Image GUID changed | 1 | 0 | Info | 1 | 300 | Port | Communication Error | System GUID is changed for the specific LID |
115 | Port Receive Switch Relay Errors | 1 | 1 | Minor | 9999 | 300 | Port | Fabric Configuration | Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay. Reasons for this include:
|
256 | Bad M_Key | 1 | 0 | Minor | 1 | 300 | Port | Fabric Configuration | Found bad M_Key. Check your HCA driver or partition settings. SM Trap. Management Key (M_Key): Enforces the control of a master subnet manager. Administered by the subnet manager and used in certain subnet management packets. Message: Bad M_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d) |
257 | Bad P_Key | 1 | 0 | Minor | 1 | 300 | Port | Fabric Configuration | Found a bad P_Key. Check your partitioning settings. SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM). Message: Bad P_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d) |
258 | Bad Q_Key | 1 | 0 | Minor | 1 | 300 | Port | Fabric Configuration | Found bad Q_Key. Security error. SM Trap. Queue Key (Q_Key): Enforces access rights for reliable and unreliable datagram service (RAW datagram service type not included). Message: Bad Q_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d) |
259 | Bad P_Key Switch External Port | 1 | 0 | Critical | 1 | 300 | Port | Fabric Configuration | Found a bad P_Key. Check your partitioning settings. SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM). Message: Bad P_Key switch external port: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d) |
64 | GID Address In Service | 1 | 0 | Info | 1 | 300 | Port | Fabric Notification | New GID is connected to the Fabric |
65 | GID Address Out of Service | 1 | 0 | Warning | 1 | 300 | Port | Fabric Notification | Existing GID is disconnected from the Fabric |
66 | New MCast Group Created | 1 | 0 | Info | 1 | 300 | Port | Fabric Notification | New Multicast Group is created in SM |
67 | MCast Group Deleted | 1 | 0 | Info | 1 | 300 | Port | Fabric Notification | Multicast Group is removed from SM. |
328 | Link is Up | 1 | 0 | Info | 1 | 0 | Link | Fabric Topology | Event is sent upon discovery of a new link |
328 | Link is Down | 1 | 0 | Warning | 1 | 0 | Link | Fabric Topology | Event is sent when exiting link is removed |
144 | Capability Mask Modified | 0 | 0 | Info | 1 | 300 | Port | Fabric Notification | Capability Mask of the specific LID is modified |
602 | UFM Server Failover | 1 | 1 | Critical | 1 | 0 | Site | Fabric Notification | Failover in UFM Server (in HA mode) |
391 | Switch Module Removed | 1 | 0 | Info | 1 | 0 | Switch | Fabric Notification | Module (line card, FAN or PS) is removed from the switch |
331 | Node is Down | 1 | 0 | Warning | 1 | 0 | Site | Fabric Topology | Node is disconnected or down |
332 | Node is Up | 1 | 0 | Info | 1 | 300 | Site | Fabric Topology | Node is connected or up |
907 | Switch is Down | 1 | 1 | Critical | 1 | 0 | Site | Fabric Topology | Switch is disconnected or down |
908 | Switch is Up | 1 | 1 | Info | 1 | 300 | Site | Fabric Topology | Switch is connected or up |
370 | Gateway Ethernet Link State Changed | 1 | 0 | Warning | 1 | 0 | Gateway | Gateway | Gateway Ethernet Physical link has changed state |
371 | Gateway Reregister Event Received | 1 | 0 | Warning | 1 | 0 | Gateway | Gateway | 10GbE Gateway received a re-register event from the SM. |
372 | Number of Gateways Changed | 1 | 0 | Warning | 1 | 0 | Gateway | Gateway | Change in the number of 10GbE Gateways has been detected |
373 | Gateway will be Rebooted | 1 | 0 | Warning | 1 | 0 | Gateway | Gateway | 10GbE Gateway is about to reboot |
374 | Gateway Reloading Finished | 1 | 0 | Info | 1 | 0 | Gateway | Gateway | 10GbE Gateway has finished reloading. |
110 | Symbol Error | 1 | 1 | Warning | 200 | 300 | Port | Hardware | Total number of minor link errors detected on one or more physical lanes |
111 | Link Error Recovery | 1 | 1 | Minor | 1 | 300 | Port | Hardware | Total number of times the Port Training state machine has successfully completed the link error recovery process |
112 | Link Downed | 1 | 1 | Critical | 1 | 300 | Port | Hardware | Total number of times the Port Training state machine has failed the link error recovery process and downed the link. |
113 | Port Receive Errors | 1 | 1 | Minor | 5 | 300 | Port | Hardware | Total number of packets containing an error that were received on a port. These errors include:
|
114 | Port Receive Remote Physical Errors | 1 | 0 | Minor | 5 | 300 | Port | Hardware | Total number of packets marked with the EBP delimiter received on the port |
119 | Local Link Integrity Errors | 1 | 1 | Minor | 5 | 300 | Port | Hardware | The number of times that the frequency of packets containing local physical errors has exceeded LocalPhyErrors. Message: LocalLinkIntegrityErrors counter threshold exceeded. Threshold is %d, received value is %d |
122 | Congested Bandwidth (%) Threshold Reached | 1 | 1 | Minor | 10 | 300 | Port | Hardware | Percent of Congested Bandwidth has exceeded defined threshold. Note: a different threshold can be set specifically for Tier 4 ports. |
131 | Non-optimal link width (1X instead of 4X) | 1 | 1 | Minor | 1 | 0 | Port | Hardware | 4X link operates as 1X link |
132 | Non-optimal link width (1X or 4X instead of 12X) | 1 | 1 | Minor | 1 | 0 | Port | Hardware | 12X links operates as 4X or 1X link |
701 | Non-optimal Link Speed | 1 | 1 | Minor | 1 | 0 | Port | Hardware | DDR link operates as SDR or QRD link operates as DDR or QDR or EDR link operates as FDR,QDR,DDR or SDR or FDR link operates as QDR,DDR or SDR |
140 | Excessive Buffer Overrun Threshold Reached | 1 | 0 | Minor | 1 | 300 | Port | Hardware | SM Trap. This error is detected when the number of consecutive flow control update periods with at least one overrun error in each period exceeds the OverrunErrors threshold given in the PortInfo attribute. Message: Excessive Buffer Overrun Threshold is reached: lid %(lid)d, port #%(portn)d |
141 | Flow Control Update Watchdog Timer Expired | 1 | 0 | Warning | 1 | 300 | Port | Hardware | SM Trap. The error indicates a failure of the flow control machine at the other end of the link. If the timer expires without receiving an update, a flow control update error has occurred. Message: Flow Control Update watchdog timer has expired: lid %(lid)d, port #%(portn)d |
392 | Module Temperature Threshold Reached | 1 | 0 | Info | 40 | 0 | Module | Hardware | Temperature detected by module sensor is too high, has exceeded the defined threshold. |
350 | Environment Added | 1 | 0 | Info | 1 | 0 | Env | Logical Model | New Logical Environment is created |
351 | Environment Removed | 1 | 0 | Info | 1 | 0 | Env | Logical Model | Logical Environment is deleted |
306 | Logical Server Added | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | New Logical Server or Logical Servers Group is created |
307 | Logical Server Removed | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | Logical Server or Logical Servers Group is deleted |
352 | Network Added | 1 | 0 | Info | 1 | 0 | Network | Logical Model | New Network is created |
353 | Network Removed | 1 | 0 | Info | 1 | 0 | Network | Logical Model | Network is deleted |
340 | Network Interface Added | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | New Network Interface is created |
341 | Network Interface Removed | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | Network Interface is deleted |
313 | Compute Resource Allocated | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | A resource is allocated to the Logical Server |
312 | Compute Resource Released | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | A resource is released from the Logical Server |
317 | Logical Server Compute Resource is Up | 1 | 1 | Warning | 1 | 0 | Logical Server | Logical Model | An allocated resource is Down or Disconnected |
316 | Logical Server Compute Resource is Down | 1 | 1 | Critical | 1 | 0 | Logical Server | Logical Model | An allocated resources is Up or Connected back |
301 | Logical Server State Changed | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | Logical Server state is changed |
302 | Logical Server State Change Failed | 1 | 0 | Minor | 1 | 0 | Logical Server | Logical Model | Logical Server has failed to change the state. RM (Resource Manager) Event. Indicates error in Logical Server state change. This error might be caused by any error condition related to the Logical Server resources allocation. Message: Logical Server changed state from %s to %s |
308 | Logical Server Resources Allocated | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | New resources are allocated to the Logical Server |
314 | Logical Server Additional Resources Allocated | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | Additional resources are allocated to the Logical Server |
315 | Logical Server Resources Released | 1 | 0 | Info | 1 | 0 | Logical Server | Logical Model | Resources were released from the Logical Server |
336 | Port Action Succeeded | 1 | 0 | Info | 1 | 0 | Port | Maintenance | Port Management Action (reset, disable) succeeded |
337 | Port Action Failed | 1 | 0 | Minor | 1 | 0 | Port | Maintenance | Port Management Action (reset, disable) failed |
338 | Device Action Succeeded | 1 | 0 | Info | 1 | 0 | Port | Maintenance | Device Management Action succeeded |
339 | Device Action Failed | 1 | 0 | Minor | 1 | 0 | Port | Maintenance | Device Management Action failed |
385 | Switch FW Upgrade Started | 1 | 0 | Info | 1 | 0 | Switch | Maintenance | Switch FW Upgrade process has started |
386 | Switch SW Upgrade Started | 1 | 0 | Info | 1 | 0 | Switch | Maintenance | Switch SW Upgrade process has started |
381 | Switch Upgrade Failed | 1 | 0 | Info | 1 | 0 | Switch | Maintenance | Switch SW or FW Upgrade process failed |
388 | Host FW Upgrade Started | 1 | 0 | Info | 1 | 0 | Computer | Maintenance | Host FW Upgrade process has started |
389 | Host SW Upgrade Started | 1 | 0 | Info | 1 | 0 | Computer | Maintenance | Host SW Upgrade process has started |
383 | Host Upgrade Failed | 1 | 0 | Info | 1 | 0 | Computer | Maintenance | Host SW or FW Upgrade process failed |
502 | Device Upgrade Finished | 1 | 0 | Info | 1 | 300 | Device | Maintenance | Device SW or FW Upgrade has finished |
909 | Director Switch is Down | 1 | 1 | Critical | 1 | 300 | Site | Fabric Topology | Director Switch is disconnected or down |
910 | Director Switch is Up | 1 | 1 | Info | 1 | 0 | Site | Fabric Topology | Director Switch is connected or up |
911 | Module Temperature Low Threshold Reached | 1 | 1 | Warning | 60 | 300 | Module | Hardware | Temperature detected by module sensor is too high, has exceeded the low threshold |
912 | Module Temperature High Threshold Reached | 1 | 1 | Critical | 60 | 300 | Module | Hardware | Temperature detected by module sensor is too high, has exceeded the high threshold |
913 | Module High Voltage | 1 | 1 | Warning | 10 | 420 | Switch | Module Status | Sensor Voltage Threshold Exceeded |
914 | Module High Current | 1 | 1 | Warning | 10 | 420 | Switch | Module Status | Sensor Current Threshold Exceeded |
394 | Module Status FAULT | 1 | 1 | Critical | 1 | 420 | Switch | Module Status | Module Status FAULT |
545 | SM is not responding | 1 | 1 | Critical | 1 | 300 | Grid | Maintenance | SM is not responding |
915 | BER_ERROR | 1 | 1 | Critical | 1e-8 | 420 | Port | Hardware | Effective BER Error on port exceeded the threshold |
916 | BER_WARNING | 1 | 1 | Warning | 1e-13 | 420 | Port | Hardware | Effective BER Warning on port exceeded the threshold |
1300 | SM_SAKEY_ VIOLATION | 1 | 1 | Warning | 5300 | Port | Fabric Notification | "SA Key Violation Committed" | |
1301 | SM_SGID_SPOOFED | 1 | 1 | Warning | 5300 | Port | Fabric Notification | "SGID spoofed by VPort/port" | |
1302 | SM_RATE_LIMIT_ EXCEEDED | 1 | 1 | Warning | 5300 | Port | Fabric Notification | "Rate Limit Exceeded" | |
1303 | SM_MULTICAST_ GROUPS_LIMIT_ EXCEEDED | 1 | 1 | Warning | 5300 | Port | Fabric Notification | "Multicast Groups Limit Exceeded" | |
1304 | SM_SERVICES_ LIMIT_EXCEEDED | 1 | 1 | Warning | 5300 | Port | Fabric Notification | "Services, Limit Exceeded" | |
1305 | SM_EVENT_ SUBSCRIPTION_ LIMIT_EXCEEDED | 1 | 1 | Warning | 5300 | Port | Fabric Notification | "Event Subscription Limit Exceeded" | |
1500 | New cable detected | 1 | 0 | Info | 1 | 0 | Link | Hardware | "New cable was detected" |
1502 | Cable detected in a new location | 1 | 0 | Warning | 1 | 0 | Link | Hardware | "Cable detected in a new location" |
1503 | Duplicate Cable Detected | 1 | 0 | Critical | 1 | 0 | Link | Hardware | "Duplicate cable S/N" |