Appendix: Supported Port Counters and Events
Port counters and events are available in the following views:
Events and Port Counters area, at the bottom of the UFM window
Error window (Error tab) in the Manage Devices tab
In the New Monitoring Session window, in the Monitor tab, when clicking Create New Session
Event Log in the Log tab (click Show Event Log)
The following tables list and describe the port counters and events currently supported:
InfiniBand Port Counters
Calculated Port Counters
Table 57: InfiniBand Port Counters
Counter
Description
Xmit Data (in bytes)
Total number of data octets, divided by 4, transmitted on all VLs from the port, including all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. All link packets are excluded. Results are reported as a multiple of four octets.
Rcv Data (in bytes)
Total number of data octets, divided by 4, received on all VLs at the port.
All octets between (and not including) the start of packet delimiter and the VCRC are excluded and may include packets containing errors.
All link packets are excluded. When the received packet length exceeds the maximum allowed packet length specified in C7-45: the counter may include all data octets exceeding this limit. Results are reported as a multiple of four octets.
Xmit Packets
Total number of packets transmitted on all VLs from the port, including packets with errors and excluding link packets.
Rcv Packets
Total number of packets, including packets containing errors and excluding link packets, received from all VLs on the port.
Rcv Errors
Total number of packets containing errors that were received on the port including:
Xmit Discards
Total number of outbound packets discarded by the port when the port is down or congested for the following reasons:
Symbol Errors
Total number of minor link errors detected on one or more physical lanes.
Link Error Recovery
Total number of times the Port Training state machine has successfully completed the link error recovery process.
Link Error Downed
Total number of times the Port Training state machine has failed the link error recovery process and downed the link.
Local Integrity Error
The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors;
Rcv Remote Physical Error
Total number of packets marked with the EBP delimiter received on the port.
Xmit Constraint Error
Total number of packets not transmitted from the switch physical port for the following reasons:
Rcv Constraint Error
Total number of packets received on the switch physical port that are discarded for the following reasons:
Excess Buffer Overrun Error
The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error
Rcv Switch Relay Error
Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:
VL15 Dropped
Number of incoming VL15 packets dropped because of resource limitations (e.g., lack of buffers) in the port
XmitWait
The number of ticks during which the port selected by PortSelect had data to transmit but no data was sent during the entire tick because of insufficient credits or of lack of arbitration.
Table 58: InfiniBand Calculated Port Counters
Counter
Description
Normalized XmitData
Effective port bandwidth utilization in %
XmitData incremental/ Link Capacity
Normalized Congested Bandwidth
Amount of bandwidth that was suppressed due to congestion
(XmitWait incremental/ Time) * Link Capacity
Separate counters are used for Tier 4 ports and for the rest of the ports.
Normalized XmitWait
Congestion in relation to packets transmitted over the link
XmitWait incremental / XmitPackets incremental
This event is calculated only for the port directly connected to receiving hosts.
Separate counters are used for Tier 4 ports and for the rest of the ports.
Device events are listed as VDM or CDM in the Source column of the Events table in the UFM GUI. For information about defining event policy, see Configuring Event Management.
Alarm ID
Alarm Name
To Log
Alarm
Default Severity
Default Threshold
Default TTL
Related Object
Category
Description/Message
116
Port Xmit Discards
1
1
Minor
200
300
Port
Communication Error
Total number of outbound packets discarded by the port when the port is down or congested. Reasons include:
117
Port Xmit Constraint Errors
1
1
Minor
200
300
Port
Communication Error
Total number of packets not transmitted from the switch physical port for the following reasons:
120
Excessive Buffer Overrun Errors
1
1
Minor
100
300
Port
Communication Error
The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error.
Message: ExcessiveBufferOverrunErrors counter threshold exceeded. Threshold is %d, received value is %d.
121
VL15 Dropped
1
1
Minor
50
300
Port
Communication Error
Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port.
Message: VL15Dropped counter threshold exceeded. Threshold is %d, received value is %d.
118
Port Receive Constraint Errors
1
1
Minor
200
300
Port
Communication Error
Total number of packets received on the switch physical port that are discarded for the following reasons:
145
System Image GUID changed
1
0
Info
1
300
Port
Communication Error
System GUID is changed for the specific LID
115
Port Receive Switch Relay Errors
1
1
Minor
9999
300
Port
Fabric Configuration
Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay.
Reasons for this include:
256
Bad M_Key
1
0
Minor
1
300
Port
Fabric Configuration
Found bad M_Key. Check your HCA driver or partition settings.
SM Trap. Management Key (M_Key): Enforces the control of a master subnet manager. Administered by the subnet manager and used in certain subnet management packets.
Message: Bad M_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)
257
Bad P_Key
1
0
Minor
1
300
Port
Fabric Configuration
Found a bad P_Key. Check your partitioning settings.
SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM).
Message: Bad P_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)
258
Bad Q_Key
1
0
Minor
1
300
Port
Fabric Configuration
Found bad Q_Key. Security error.
SM Trap. Queue Key (Q_Key): Enforces access rights for reliable and unreliable datagram service (RAW datagram service type not included).
Message: Bad Q_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)
259
Bad P_Key Switch External Port
1
0
Critical
1
300
Port
Fabric Configuration
Found a bad P_Key. Check your partitioning settings.
SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM).
Message: Bad P_Key switch external port: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)
64
GID Address In Service
1
0
Info
1
300
Port
Fabric Notification
New GID is connected to the Fabric
65
GID Address Out of Service
1
0
Warning
1
300
Port
Fabric Notification
Existing GID is disconnected from the Fabric
66
New MCast Group Created
1
0
Info
1
300
Port
Fabric Notification
New Multicast Group is created in SM
67
MCast Group Deleted
1
0
Info
1
300
Port
Fabric Notification
Multicast Group is removed from SM.
328
Link is Up
1
0
Info
1
0
Link
Fabric Topology
Event is sent upon discovery of a new link
328
Link is Down
1
0
Warning
1
0
Link
Fabric Topology
Event is sent when exiting link is removed
144
Capability Mask Modified
0
0
Info
1
300
Port
Fabric Notification
Capability Mask of the specific LID is modified
602
UFM Server Failover
1
1
Critical
1
0
Site
Fabric Notification
Failover in UFM Server (in HA mode)
391
Switch Module Removed
1
0
Info
1
0
Switch
Fabric Notification
Module (line card, FAN or PS) is removed from the switch
331
Node is Down
1
0
Warning
1
0
Site
Fabric Topology
Node is disconnected or down
332
Node is Up
1
0
Info
1
300
Site
Fabric Topology
Node is connected or up
907
Switch is Down
1
1
Critical
1
0
Site
Fabric Topology
Switch is disconnected or down
908
Switch is Up
1
1
Info
1
300
Site
Fabric Topology
Switch is connected or up
370
Gateway Ethernet Link State Changed
1
0
Warning
1
0
Gateway
Gateway
Gateway Ethernet Physical link has changed state
371
Gateway Re-register Event Received
1
0
Warning
1
0
Gateway
Gateway
10GbE Gateway received a re-register event from the SM.
372
Number of Gateways is Changed
1
0
Warning
1
0
Gateway
Gateway
Change in the number of 10GbE Gateways has been detected
373
Gateway will be Rebooted
1
0
Warning
1
0
Gateway
Gateway
10GbE Gateway is about to reboot
374
Gateway Reloading Finished
1
0
Info
1
0
Gateway
Gateway
10GbE Gateway has finished reloading.
110
Symbol Error
1
1
Warning
200
300
Port
Hardware
Total number of minor link errors detected on one or more physical lanes
111
Link Error Recovery
1
1
Minor
1
300
Port
Hardware
Total number of times the Port Training state machine has successfully completed the link error recovery process
112
Link Downed
1
1
Critical
1
300
Port
Hardware
Total number of times the Port Training state machine has failed the link error recovery process and downed the link.
113
Port Receive Errors
1
1
Minor
5
300
Port
Hardware
Total number of packets containing an error that were received on a port. These errors include:
114
Port Receive Remote Physical Errors
1
0
Minor
5
300
Port
Hardware
Total number of packets marked with the EBP delimiter received on the port
119
Local Link Integrity Errors
1
1
Minor
5
300
Port
Hardware
The number of times that the frequency of packets containing local physical errors has exceeded LocalPhyErrors.
Message: LocalLinkIntegrityErrors counter threshold exceeded. Threshold is %d, received value is %d
122
Congested Bandwidth (%) Threshold Reached
1
1
Minor
10
300
Port
Hardware
Percent of Congested Bandwidth has exceeded defined threshold.
Note: a different threshold can be set specifically for Tier 4 ports.
131
Non-optimal link width (1X instead of 4X)
1
1
Minor
1
0
Port
Hardware
4X link operates as 1X link
132
Non-optimal link width (1X or 4X instead of 12X)
1
1
Minor
1
0
Port
Hardware
12X links operates as 4X or 1X link
701
Non-optimal Link Speed
1
1
Minor
1
0
Port
Hardware
DDR link operates as SDR or
QRD link operates as DDR or QDR
or
EDR link operates as FDR,QDR,DDR or SDR
or
FDR link operates as QDR,DDR or SDR
140
Excessive Buffer Overrun Threshold Reached
1
0
Minor
1
300
Port
Hardware
SM Trap. This error is detected when the number of consecutive flow control update periods with at least one overrun error in each period exceeds the OverrunErrors threshold given in the PortInfo attribute.
Message: Excessive Buffer Overrun Threshold is reached: lid %(lid)d, port #%(portn)d
141
Flow Control Update Watchdog Timer Expired
1
0
Warning
1
300
Port
Hardware
SM Trap. The error indicates a failure of the flow control machine at the other end of the link. If the timer expires without receiving an update, a flow control update error has occurred.
Message: Flow Control Update watchdog timer has expired: lid %(lid)d, port #%(portn)d
392
Module Temperature Threshold Reached
1
0
Info
40
0
Module
Hardware
Temperature detected by module sensor is too high, has exceeded the defined threshold.
350
Environment Added
1
0
Info
1
0
Env
Logical Model
New Logical Environment is created
351
Environment Removed
1
0
Info
1
0
Env
Logical Model
Logical Environment is deleted
306
Logical Server Added
1
0
Info
1
0
Logical Server
Logical Model
New Logical Server or Logical Servers Group is created
307
Logical Server Removed
1
0
Info
1
0
Logical Server
Logical Model
Logical Server or Logical Servers Group is deleted
352
Network Added
1
0
Info
1
0
Network
Logical Model
New Network is created
353
Network Removed
1
0
Info
1
0
Network
Logical Model
Network is deleted
340
Network Interface Added
1
0
Info
1
0
Logical Server
Logical Model
New Network Interface is created
341
Network Interface Removed
1
0
Info
1
0
Logical Server
Logical Model
Network Interface is deleted
313
Compute Resource Allocated
1
0
Info
1
0
Logical Server
Logical Model
A resource is allocated to the Logical Server
312
Compute Resource Released
1
0
Info
1
0
Logical Server
Logical Model
A resource is released from the Logical Server
317
Logical Server Compute Resource is Up
1
1
Warning
1
0
Logical Server
Logical Model
An allocated resource is Down or Disconnected
316
Logical Server Compute Resource is Down
1
1
Critical
1
0
Logical Server
Logical Model
An allocated resources is Up or Connected back
301
Logical Server State Changed
1
0
Info
1
0
Logical Server
Logical Model
Logical Server state is changed
302
Logical Server State Change Failed
1
0
Minor
1
0
Logical Server
Logical Model
Logical Server has failed to change the state.
RM (Resource Manager) Event. Indicates error in Logical Server state change. This error might be caused by any error condition related to the Logical Server resources allocation.
Message: Logical Server changed state from %s to %s
308
Logical Server Resources Allocated
1
0
Info
1
0
Logical Server
Logical Model
New resources are allocated to the Logical Server
314
Logical Server Additional Resources Allocated
1
0
Info
1
0
Logical Server
Logical Model
Additional resources are allocated to the Logical Server
315
Logical Server Resources Released
1
0
Info
1
0
Logical Server
Logical Model
Resources were released from the Logical Server
336
Port Action Succeeded
1
0
Info
1
0
Port
Maintenance
Port Management Action (reset, disable) succeeded
337
Port Action Failed
1
0
Minor
1
0
Port
Maintenance
Port Management Action (reset, disable) failed
338
Device Action Succeeded
1
0
Info
1
0
Port
Maintenance
Device Management Action succeeded
339
Device Action Failed
1
0
Minor
1
0
Port
Maintenance
Device Management Action failed
385
Switch FW Upgrade Started
1
0
Info
1
0
Switch
Maintenance
Switch FW Upgrade process has started
386
Switch SW Upgrade Started
1
0
Info
1
0
Switch
Maintenance
Switch SW Upgrade process has started
381
Switch Upgrade Failed
1
0
Info
1
0
Switch
Maintenance
Switch SW or FW Upgrade process failed
388
Host FW Upgrade Started
1
0
Info
1
0
Computer
Maintenance
Host FW Upgrade process has started
389
Host SW Upgrade Started
1
0
Info
1
0
Computer
Maintenance
Host SW Upgrade process has started
383
Host Upgrade Failed
1
0
Info
1
0
Computer
Maintenance
Host SW or FW Upgrade process failed
502
Device Upgrade Finished
1
0
Info
1
300
Device
Maintenance
Device SW or FW Upgrade has finished
909
Director Switch is Down
1
1
Critical
1
300
Site
Fabric Topology
Director Switch is disconnected or down
910
Director Switch is Up
1
1
Info
1
0
Site
Fabric Topology
Director Switch is connected or up
911
Module Temperature Low Threshold Reached
1
1
Warning
60
300
Module
Hardware
Temperature detected by module sensor is too high, has exceeded the low threshold
912
Module Temperature High Threshold Reached
1
1
Critical
60
300
Module
Hardware
Temperature detected by module sensor is too high, has exceeded the high threshold
913
Module High Voltage
1
1
Warning
10
420
Switch
Module Status
Sensor Voltage Threshold Exceeded
914
Module High Current
1
1
Warning
10
420
Switch
Module Status
Sensor Current Threshold Exceeded
394
Module Status FAULT
1
1
Critical
1
420
Switch
Module Status
Module Status FAULT
545
SM is not responding
1
1
Critical
1
300
Grid
Maintenance
SM is not responding
915
BER_ERROR
1
1
Critical
1e-8
420
Port
Hardware
Effective BER Error on port exceeded the threshold
916
BER_WARNING
1
1
Warning
1e-13
420
Port
Hardware
Effective BER Warning on port exceeded the threshold
1300
SM_SAKEY_VIOLATION
1
1
Warning
5300
Port
Fabric Notification
"SA Key Violation Committed"
1301
SM_SGID_SPOOFED
1
1
Warning
5300
Port
Fabric Notification
"SGID spoofed by VPort/port"
1302
SM_RATE_LIMIT_EXCEEDED
1
1
Warning
5300
Port
Fabric Notification
"Rate Limit Exceeded"
1303
SM_MULTICAST_GROUPS_LIMIT_EXCEEDED
1
1
Warning
5300
Port
Fabric Notification
"Multicast Groups Limit Exceeded"
1304
SM_SERVICES_LIMIT_EXCEEDED
1
1
Warning
5300
Port
Fabric Notification
"Services, Limit Exceeded"
1305
SM_EVENT_SUBSCRIPTION_LIMIT_EXCEEDED
1
1
Warning
5300
Port
Fabric Notification
"Event Subscription Limit Exceeded"
1500
New cable detected
1
0
Info
1
0
Link
Hardware
"New cable was detected"
1502
Cable detected in a new location
1
0
Warning
1
0
Link
Hardware
"Cable detected in a new location"
1503
Duplicate Cable Detected
1
0
Critical
1
0
Link
Hardware
"Duplicate cable S/N"