What can I help you with?
NVIDIA UFM Enterprise User Manual v6.15.8

Appendix – Supported Port Counters and Events

Port counters and events are available in the following views:

  • Events and Port Counters area, at the bottom of the UFM window

  • Error window (Error tab) in the Manage Devices tab

  • In the New Monitoring Session window, in the Monitor tab, when clicking Create New Session

  • Event Log in the Log tab (click Show Event Log)

The following tables list and describe the port counters and events currently supported:

  • InfiniBand Port Counters

  • Calculated Port Counters

InfiniBand Port Counters

Counter

Description

Xmit Data (in bytes)

Total number of data octets, divided by 4, transmitted on all VLs from the port, including all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. All link packets are excluded. Results are reported as a multiple of four octets.

Rcv Data (in bytes)

Total number of data octets, divided by 4, received on all VLs at the port.

All octets between (and not including) the start of packet delimiter and the VCRC are excluded and may include packets containing errors.

All link packets are excluded. When the received packet length exceeds the maximum allowed packet length specified in C7-45: the counter may include all data octets exceeding this limit. Results are reported as a multiple of four octets.

Xmit Packets

Total number of packets transmitted on all VLs from the port, including packets with errors and excluding link packets.

Rcv Packets

Total number of packets, including packets containing errors and excluding link packets, received from all VLs on the port.

Rcv Errors

Total number of packets containing errors that were received on the port including:

  • Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

  • Malformed data packet errors (LVer, length, VL)

  • Malformed link packet errors (operand, length, VL)

  • Packets discarded due to buffer overrun (overflow)

Xmit Discards

Total number of outbound packets discarded by the port when the port is down or congested for the following reasons:

  • Output port is not in the active state

  • Packet length has exceeded NeighborMTU

  • Switch Lifetime Limit exceeded

  • Switch HOQ Lifetime Limit exceeded, including packets discarded while in VLStalled State.

Symbol Errors

Total number of minor link errors detected on one or more physical lanes.

Link Error Recovery

Total number of times the Port Training state machine has successfully completed the link error recovery process.

Link Error Downed

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

Local Integrity Error

The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors;

Rcv Remote Physical Error

Total number of packets marked with the EBP delimiter received on the port.

Xmit Constraint Error

Total number of packets not transmitted from the switch physical port for the following reasons:

  • FilterRawOutbound is true and packet is raw

  • PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

Rcv Constraint Error

Total number of packets received on the switch physical port that are discarded for the following reasons:

  • FilterRawInbound is true and packet is raw

  • PartitionEnforcementInbound is true and packet fails partition key check or IP version check

Excess Buffer Overrun Error

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error

Rcv Switch Relay Error

Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

  • DLID mapping

  • VL mapping

  • Looping (output port = input port)

VL15 Dropped

Number of incoming VL15 packets dropped because of resource limitations (e.g., lack of buffers) in the port

XmitWait

The number of ticks during which the port selected by PortSelect had data to transmit but no data was sent during the entire tick because of insufficient credits or of lack of arbitration.

InfiniBand Calculated Port Counters

Counter

Description

Normalized XmitData

Effective port bandwidth utilization in %

XmitData incremental/ Link Capacity

Normalized Congested Bandwidth

Amount of bandwidth that was suppressed due to congestion

(XmitWait incremental/ Time) * Link Capacity

Separate counters are used for Tier 4 ports and for the rest of the ports.

Normalized XmitWait

Congestion in relation to packets transmitted over the link

XmitWait incremental / XmitPackets incremental

This event is calculated only for the port directly connected to receiving hosts.

Separate counters are used for Tier 4 ports and for the rest of the ports.

Device events are listed as VDM or CDM in the Source column of the Events table in the UFM GUI. For information about defining event policy, see Configuring Event Management.

Alarm ID

Alarm Name

To Log

Alarm

Default Severity

Default Threshold

Default TTL

Related Object

Category

Description/Message

116

Port Xmit

Discards

1

1

Minor

200

300

Port

Communication Error

Total number of outbound packets discarded by the port when the port is down or congested. Reasons include:

  • Output port is not in the active state

  • Packet length exceeded NeighborMTU

  • Switch Lifetime Limit exceeded

  • Switch HOQ Lifetime Limit exceeded

  • Packets discarded while in VLStalled State

117

Port Xmit

Constraint Errors

1

1

Minor

200

300

Port

Communication Error

Total number of packets not transmitted from the switch physical port for the following reasons:

  • FilterRawOutbound is true and packet is raw

  • PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

120

Excessive Buffer

Overrun Errors

1

1

Minor

100

300

Port

Communication Error

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error.

Message: ExcessiveBufferOverrunErrors counter threshold exceeded. Threshold is %d, received value is %d.

121

VL15 Dropped

1

1

Minor

50

300

Port

Communication Error

Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port.

Message: VL15Dropped counter threshold exceeded. Threshold is %d, received value is %d.

118

Port Receive

Constraint Errors

1

1

Minor

200

300

Port

Communication Error

Total number of packets received on the switch physical port that are discarded for the following reasons:

  • FilterRawInbound is true and packet is raw

  • PartitionEnforcementInbound is true and packet fails partition key check or IP version check

145

System Image

GUID changed

1

0

Info

1

300

Port

Communication Error

System GUID is changed for the specific LID

115

Port Receive

Switch Relay

Errors

1

1

Minor

9999

300

Port

Fabric Configuration

Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay.

Reasons for this include:

  • DLID mapping

  • VL mapping

  • Looping (output port = input port)

256

Bad M_Key

1

0

Minor

1

300

Port

Fabric Configuration

Found bad M_Key. Check your HCA driver or partition settings.

SM Trap. Management Key (M_Key): Enforces the control of a master subnet manager. Administered by the subnet manager and used in certain subnet management packets.

Message: Bad M_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

257

Bad P_Key

1

0

Minor

1

300

Port

Fabric Configuration

Found a bad P_Key. Check your partitioning settings.

SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM).

Message: Bad P_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

258

Bad Q_Key

1

0

Minor

1

300

Port

Fabric Configuration

Found bad Q_Key. Security error.

SM Trap. Queue Key (Q_Key): Enforces access rights for reliable and unreliable datagram service (RAW datagram service type not included).

Message: Bad Q_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

259

Bad P_Key

Switch External

Port

1

0

Critical

1

300

Port

Fabric Configuration

Found a bad P_Key. Check your partitioning settings.

SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM).

Message: Bad P_Key switch external port: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

64

GID Address

In Service

1

0

Info

1

300

Port

Fabric Notification

New GID is connected to the Fabric

65

GID Address

Out of Service

1

0

Warning

1

300

Port

Fabric Notification

Existing GID is disconnected from the Fabric

66

New MCast

Group Created

1

0

Info

1

300

Port

Fabric Notification

New Multicast Group is created in SM

67

MCast Group

Deleted

1

0

Info

1

300

Port

Fabric Notification

Multicast Group is removed from SM.

328

Link is Up

1

0

Info

1

0

Link

Fabric Topology

Event is sent upon discovery of a new link

328

Link is Down

1

0

Warning

1

0

Link

Fabric Topology

Event is sent when exiting link is removed

144

Capability Mask

Modified

0

0

Info

1

300

Port

Fabric Notification

Capability Mask of the specific LID is modified

602

UFM Server

Failover

1

1

Critical

1

0

Site

Fabric Notification

Failover in UFM Server (in HA mode)

391

Switch Module

Removed

1

0

Info

1

0

Switch

Fabric Notification

Module (line card, FAN or PS) is removed from the switch

331

Node is Down

1

0

Warning

1

0

Site

Fabric Topology

Node is disconnected or down

332

Node is Up

1

0

Info

1

300

Site

Fabric Topology

Node is connected or up

907

Switch is Down

1

1

Critical

1

0

Site

Fabric Topology

Switch is disconnected or down

908

Switch is Up

1

1

Info

1

300

Site

Fabric Topology

Switch is connected or up

370

Gateway Ethernet

Link State Changed

1

0

Warning

1

0

Gateway

Gateway

Gateway Ethernet Physical link has changed state

371

Gateway Reregister

Event Received

1

0

Warning

1

0

Gateway

Gateway

10GbE Gateway received a re-register event from the SM.

372

Number of

Gateways Changed

1

0

Warning

1

0

Gateway

Gateway

Change in the number of 10GbE Gateways has been detected

373

Gateway will

be Rebooted

1

0

Warning

1

0

Gateway

Gateway

10GbE Gateway is about to reboot

374

Gateway Reloading

Finished

1

0

Info

1

0

Gateway

Gateway

10GbE Gateway has finished reloading.

110

Symbol Error

1

1

Warning

200

300

Port

Hardware

Total number of minor link errors detected on one or more physical lanes

111

Link Error

Recovery

1

1

Minor

1

300

Port

Hardware

Total number of times the Port Training state machine has successfully completed the link error recovery process

112

Link Downed

1

1

Critical

1

300

Port

Hardware

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

113

Port Receive

Errors

1

1

Minor

5

300

Port

Hardware

Total number of packets containing an error that were received on a port. These errors include:

  • Local physical errors (CRC, VCRC, FCCRC and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

  • Data packet errors

  • Link packet errors

  • Packets discarded due to buffer overrun

114

Port Receive

Remote Physical Errors

1

0

Minor

5

300

Port

Hardware

Total number of packets marked with the EBP delimiter received on the port

119

Local Link Integrity Errors

1

1

Minor

5

300

Port

Hardware

The number of times that the frequency of packets containing local physical errors has exceeded LocalPhyErrors.

Message: LocalLinkIntegrityErrors counter threshold exceeded. Threshold is %d, received value is %d

122

Congested Bandwidth

(%) Threshold Reached

1

1

Minor

10

300

Port

Hardware

Percent of Congested Bandwidth has exceeded defined threshold.

Note: a different threshold can be set specifically for Tier 4 ports.

131

Non-optimal link

width (1X instead of 4X)

1

1

Minor

1

0

Port

Hardware

4X link operates as 1X link

132

Non-optimal link width

(1X or 4X instead of 12X)

1

1

Minor

1

0

Port

Hardware

12X links operates as 4X or 1X link

701

Non-optimal Link Speed

1

1

Minor

1

0

Port

Hardware

DDR link operates as SDR or

QRD link operates as DDR or QDR

or

EDR link operates as FDR,QDR,DDR or SDR

or

FDR link operates as QDR,DDR or SDR

140

Excessive Buffer

Overrun Threshold

Reached

1

0

Minor

1

300

Port

Hardware

SM Trap. This error is detected when the number of consecutive flow control update periods with at least one overrun error in each period exceeds the OverrunErrors threshold given in the PortInfo attribute.

Message: Excessive Buffer Overrun Threshold is reached: lid %(lid)d, port #%(portn)d

141

Flow Control

Update Watchdog

Timer Expired

1

0

Warning

1

300

Port

Hardware

SM Trap. The error indicates a failure of the flow control machine at the other end of the link. If the timer expires without receiving an update, a flow control update error has occurred.

Message: Flow Control Update watchdog timer has expired: lid %(lid)d, port #%(portn)d

392

Module Temperature

Threshold Reached

1

0

Info

40

0

Module

Hardware

Temperature detected by module sensor is too high, has exceeded the defined threshold.

350

Environment Added

1

0

Info

1

0

Env

Logical Model

New Logical Environment is created

351

Environment Removed

1

0

Info

1

0

Env

Logical Model

Logical Environment is deleted

306

Logical Server

Added

1

0

Info

1

0

Logical Server

Logical Model

New Logical Server or Logical Servers Group is created

307

Logical Server

Removed

1

0

Info

1

0

Logical Server

Logical Model

Logical Server or Logical Servers Group is deleted

352

Network Added

1

0

Info

1

0

Network

Logical Model

New Network is created

353

Network Removed

1

0

Info

1

0

Network

Logical Model

Network is deleted

340

Network Interface

Added

1

0

Info

1

0

Logical Server

Logical Model

New Network Interface is created

341

Network Interface

Removed

1

0

Info

1

0

Logical Server

Logical Model

Network Interface is deleted

313

Compute Resource

Allocated

1

0

Info

1

0

Logical Server

Logical Model

A resource is allocated to the Logical Server

312

Compute Resource

Released

1

0

Info

1

0

Logical Server

Logical Model

A resource is released from the Logical Server

317

Logical Server

Compute Resource

is Up

1

1

Warning

1

0

Logical Server

Logical Model

An allocated resource is Down or Disconnected

316

Logical Server

Compute Resource

is Down

1

1

Critical

1

0

Logical Server

Logical Model

An allocated resources is Up or Connected back

301

Logical Server

State Changed

1

0

Info

1

0

Logical Server

Logical Model

Logical Server state is changed

302

Logical Server

State Change Failed

1

0

Minor

1

0

Logical Server

Logical Model

Logical Server has failed to change the state.

RM (Resource Manager) Event. Indicates error in Logical Server state change. This error might be caused by any error condition related to the Logical Server resources allocation.

Message: Logical Server changed state from %s to %s

308

Logical Server

Resources Allocated

1

0

Info

1

0

Logical Server

Logical Model

New resources are allocated to the Logical Server

314

Logical Server

Additional Resources

Allocated

1

0

Info

1

0

Logical Server

Logical Model

Additional resources are allocated to the Logical Server

315

Logical Server

Resources Released

1

0

Info

1

0

Logical Server

Logical Model

Resources were released from the Logical Server

336

Port Action Succeeded

1

0

Info

1

0

Port

Maintenance

Port Management Action (reset, disable) succeeded

337

Port Action Failed

1

0

Minor

1

0

Port

Maintenance

Port Management Action (reset, disable) failed

338

Device Action Succeeded

1

0

Info

1

0

Port

Maintenance

Device Management Action succeeded

339

Device Action Failed

1

0

Minor

1

0

Port

Maintenance

Device Management Action failed

385

Switch FW

Upgrade Started

1

0

Info

1

0

Switch

Maintenance

Switch FW Upgrade process has started

386

Switch SW

Upgrade Started

1

0

Info

1

0

Switch

Maintenance

Switch SW Upgrade process has started

381

Switch Upgrade Failed

1

0

Info

1

0

Switch

Maintenance

Switch SW or FW Upgrade process failed

388

Host FW

Upgrade Started

1

0

Info

1

0

Computer

Maintenance

Host FW Upgrade process has started

389

Host SW

Upgrade Started

1

0

Info

1

0

Computer

Maintenance

Host SW Upgrade process has started

383

Host Upgrade

Failed

1

0

Info

1

0

Computer

Maintenance

Host SW or FW Upgrade process failed

502

Device Upgrade

Finished

1

0

Info

1

300

Device

Maintenance

Device SW or FW Upgrade has finished

909

Director Switch

is Down

1

1

Critical

1

300

Site

Fabric Topology

Director Switch is disconnected or down

910

Director Switch

is Up

1

1

Info

1

0

Site

Fabric Topology

Director Switch is connected or up

911

Module Temperature

Low Threshold Reached

1

1

Warning

60

300

Module

Hardware

Temperature detected by module sensor is too high, has exceeded the low threshold

912

Module Temperature

High Threshold Reached

1

1

Critical

60

300

Module

Hardware

Temperature detected by module sensor is too high, has exceeded the high threshold

913

Module High Voltage

1

1

Warning

10

420

Switch

Module Status

Sensor Voltage Threshold Exceeded

914

Module High Current

1

1

Warning

10

420

Switch

Module Status

Sensor Current Threshold Exceeded

394

Module Status FAULT

1

1

Critical

1

420

Switch

Module Status

Module Status FAULT

545

SM is not responding

1

1

Critical

1

300

Grid

Maintenance

SM is not responding

915

BER_ERROR

1

1

Critical

1e-8

420

Port

Hardware

Effective BER Error on port exceeded the threshold

916

BER_WARNING

1

1

Warning

1e-13

420

Port

Hardware

Effective BER Warning on port exceeded the threshold

1300

SM_SAKEY_

VIOLATION

1

1

Warning

5300

Port

Fabric Notification

"SA Key Violation Committed"

1301

SM_SGID_SPOOFED

1

1

Warning

5300

Port

Fabric Notification

"SGID spoofed by VPort/port"

1302

SM_RATE_LIMIT_

EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Rate Limit Exceeded"

1303

SM_MULTICAST_

GROUPS_LIMIT_

EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Multicast Groups Limit Exceeded"

1304

SM_SERVICES_

LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Services, Limit Exceeded"

1305

SM_EVENT_

SUBSCRIPTION_

LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Event Subscription Limit Exceeded"

1500

New cable detected

1

0

Info

1

0

Link

Hardware

"New cable was detected"

1502

Cable detected

in a new location

1

0

Warning

1

0

Link

Hardware

"Cable detected in a new location"

1503

Duplicate Cable Detected

1

0

Critical

1

0

Link

Hardware

"Duplicate cable S/N"

© Copyright 2025, NVIDIA. Last updated on Jun 30, 2025.