NVIDIA UFM-SDN Appliance User Manual v4.10.0
NVIDIA UFM-SDN Appliance User Manual v4.10.0

Appendix - Supported Port Counters and Events

Port counters and events are available in the following views:

  • Events and Port Counters area, at the bottom of the UFM window

  • Error window (Error tab) in the Manage Devices tab

  • In the New Monitoring Session window, in the Monitor tab, when clicking Create New Session

  • Event Log in the Log tab (click Show Event Log)

The following tables list and describe the port counters and events currently supported:

  • InfiniBand Port Counters

  • Calculated Port Counters

InfiniBand Port Counters

Counter

Description

Xmit Data (in bytes)

Total number of data octets, divided by 4, transmitted on all VLs from the port, including all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. All link packets are excluded. Results are reported as a multiple of four octets.

Rcv Data (in bytes)

Total number of data octets, divided by 4, received on all VLs at the port.

All octets between (and not including) the start of packet delimiter and the VCRC are excluded and may include packets containing errors. All link packets are excluded. When the received packet length exceeds the maximum allowed packet length specified in C7-45: the counter may include all data octets exceeding this limit.

Results are reported as a multiple of four octets.

Xmit Packets

Total number of packets transmitted on all VLs from the port, including packets with errors and excluding link packets.

Rcv Packets

Total number of packets, including packets containing errors and excluding link packets, received from all VLs on the port.

Rcv Errors

Total number of packets containing errors that were received on the port including:

  • Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

  • Malformed data packet errors (LVer, length, VL)

  • Malformed link packet errors (operand, length, VL)

  • ackets discarded due to buffer overrun (overflow)

Xmit Discards

Total number of outbound packets discarded by the port when the port is down or congested for the following reasons:

  • Output port is not in the active state

  • Packet length has exceeded NeighborMTU

  • Switch Lifetime Limit exceeded

  • Switch HOQ Lifetime Limit exceeded, including packets discarded while in VLStalled State.

Symbol Errors

Total number of minor link errors detected on one or more physical lanes.

Link Error Recovery

Total number of times the Port Training state machine has successfully completed the link error recovery process.

Link Error Downed

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

Local Integrity Error

The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors

Rcv Remote Physical Error

Total number of packets marked with the EBP delimiter received on the port.

Xmit Constraint Error

Total number of packets not transmitted from the switch physical port for the following reasons:

  • FilterRawOutbound is true and packet is raw

  • PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

Rcv Constraint Error

Total number of packets received on the switch physical port that are discarded for the following reasons:

  • FilterRawInbound is true and packet is raw

  • PartitionEnforcementInbound is true and packet fails partition key check or IP version check

Excess Buffer Overrun Error

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error

Rcv Switch Relay Error

Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

  • DLID mapping

  • VL mapping

  • Looping (output port = input port)

VL15 Dropped

Number of incoming VL15 packets dropped because of resource limitations (e.g., lack of buffers) in the port

XmitWait

The number of ticks during which the port selected by PortSelect had data to transmit but no data was sent during the entire tick because of insufficient credits or of lack of arbitration.

InfiniBand Calculated Port Counters

Counter

Description

Normalized XmitData

Effective port bandwidth utilization in %
XmitData incremental/ Link Capacity

Normalized Congested Bandwidth

Amount of bandwidth that was suppressed due to congestion
(XmitWait incremental/ Time) * Link Capacity
Separate counters are used for Tier 4 ports and for the rest of the ports.

Device events are listed as VDM or CDM in the Source column of the Events table in the UFM GUI. For information about defining event policy, see Configuring Event Management.

Alarm ID

Alarm Name

To Log

Alarm

Default Severity

Default Threshold

Default TTL

Related Object

Category

Description/Message

64

GID Address In Service

1

0

Info

1

300

Port

Fabric Notification

65

GID Address Out of Service

1

0

Warning

1

300

Port

Fabric Notification

66

New MCast Group Created

1

0

Info

1

300

Port

Fabric Notification

67

MCast Group Deleted

1

0

Info

1

300

Port

Fabric Notification

110

Symbol Error

1

1

Warning

200

300

Port

Hardware

111

Link Error Recovery

1

1

Minor

1

300

Port

Hardware

112

Link Downed

1

1

Critical

1

300

Port

Hardware

113

Port Receive Errors

1

1

Minor

5

300

Port

Hardware

114

Port Receive Remote Physical Errors

0

0

Minor

5

300

Port

Hardware

115

Port Receive Switch Relay Errors

1

1

Minor

999

300

Port

Fabric Configuration

116

Port Xmit Discards

1

1

Minor

200

300

Port

Communication Error

117

Port Xmit Constraint Errors

1

1

Minor

200

300

Port

Communication Error

118

Port Receive Constraint Errors

1

1

Minor

200

300

Port

Communication Error

119

Local Link Integrity Errors

1

1

Minor

5

300

Port

Hardware

120

Excessive Buffer Overrun Errors

1

1

Minor

100

300

Port

Communication Error

121

VL15 Dropped

1

1

Minor

50

300

Port

Communication Error

122

Congested Bandwidth (%) Threshold Reached

1

1

Minor

10

300

Port

Hardware

131

Non-optimal link width (1X instead of 4X)

1

1

Minor

1

0

Port

Hardware

132

Non-optimal link width (1X or 4X instead of 12X)

1

1

Minor

1

0

Port

Hardware

140

Excessive Buffer Overrun Threshold Reached

1

0

Minor

11

300

Port

Hardware

141

Flow Control Update Watchdog Timer Expired

1

0

Warning

1

300

Port

Hardware

144

Capability Mask Modified

1

0

Info

1

300

Port

Fabric Notification

145

System Image GUID changed

1

0

Info

1

300

Port

Communication Error

256

Bad M_Key

1

0

Minor

1

300

Port

Fabric Configuration

257

Bad P_Key

1

0

Minor

1

300

Port

Fabric Configuration

258

Bad Q_Key

1

0

Minor

1

300

Port

Fabric Configuration

259

Bad P_Key Switch External Port

1

0

Critical

1

300

Port

Fabric Configuration

301

Logical Server State Changed

1

0

Info

1

0

Logical Server

Logical Model

302

Logical Server State Change Failed

1

0

Minor

1

0

Logical Server

Logical Model

306

Logical Server Added

1

0

Info

1

0

Logical Server

Logical Model

307

Logical Server Removed

1

0

Info

1

0

Logical Server

Logical Model

308

Logical Server Resources Allocated

1

0

Info

1

0

Logical Server

Logical Model

312

Compute Resource Released

1

0

Info

1

0

Logical Server

Logical Model

313

Compute Resource Allocated

1

0

Info

1

0

Logical Server

Logical Model

314

Logical Server Additional Resources Allocated

1

0

Info

1

0

Logical Server

Logical Model

315

Logical Server Resources Released

1

0

Info

1

0

Logical Server

Logical Model

316

Logical Server Compute Resource is Down

1

1

Critical

1

0

Logical Server

Logical Model

317

Logical Server Compute Resource is Up

1

1

Warning

1

0

Logical Server

Logical Model

328

Link is Up

1

0

Info

1

0

Link

Fabric Topology

328

Link is Down

1

0

Warning

1

0

Link

Fabric Topology

331

Node is Down

1

0

Warning

1

0

Site

Fabric Topology

332

Node is Up

1

0

Info

1

300

Site

Fabric Topology

336

Port Action Succeeded

1

0

Info

1

0

Port

Maintenance

337

Port Action Failed

1

0

Minor

1

0

Port

Maintenance

338

Device Action Succeeded

1

0

Info

1

0

Port

Maintenance

339

Device Action Failed

1

0

Minor

1

0

Port

Maintenance

340

Network Interface Added

1

0

Info

1

0

Logical Server

Logical Model

341

Network Interface Removed

1

0

Info

1

0

Logical Server

Logical Model

350

Environment Added

1

0

Info

1

0

Env

Logical Model

351

Environment Removed

1

0

Info

1

0

Env

Logical Model

352

Network Added

1

0

Info

1

0

Network

Logical Model

353

Network Removed

1

0

Info

1

0

Network

Logical Model

370

Gateway Ethernet Link State Changed

1

0

Warning

1

0

Gateway

Gateway

371

Gateway Reregister Event Received

1

0

Warning

1

0

Gateway

Gateway

372

Number of Gateways Changed

1

0

Warning

1

0

Gateway

Gateway

373

Gateway will be Rebooted

1

0

Warning

1

0

Gateway

Gateway

374

Gateway Reloading Finished

1

0

Info

1

0

Gateway

Gateway

381

Switch Upgrade Failed

1

0

Info

1

0

Switch

Maintenance

383

Host Upgrade Failed

1

0

Info

1

0

Computer

Maintenance

385

Switch FW Upgrade Started

1

0

Info

1

0

Switch

Maintenance

386

Switch SW Upgrade Started

1

0

Info

1

0

Switch

Maintenance

388

Host FW Upgrade Started

1

0

Info

1

0

Computer

Maintenance

389

Host SW Upgrade Started

1

0

Info

1

0

Computer

Maintenance

391

Switch Module Removed

1

0

Info

1

0

Switch

Fabric Notification

392

Module Temperature Threshold Reached

1

0

Info

40

0

Module

Hardware

394

Module Status FAULT

1

1

Critical

1

420

Switch

Module Status

502

Device Upgrade Finished

1

0

Info

1

300

Device

Maintenance

545

SM is not responding

1

1

Critical

1

300

Grid

Maintenance

602

UFM Server Failover

1

1

Critical

1

0

Site

Fabric Notification

701

Non-optimal Link Speed

1

1

Minor

1

0

Port

Hardware

907

Switch is Down

1

1

Critical

1

0

Site

Fabric Topology

908

Switch is Up

1

1

Info

1

300

Site

Fabric Topology

909

Director Switch is Down

1

1

Critical

1

300

Site

Fabric Topology

910

Director Switch is Up

1

1

Info

1

0

Site

Fabric Topology

911

Module Temperature Low Threshold Reached

1

1

Warning

60

300

Module

Hardware

912

Module Temperature High Threshold Reached

1

1

Critical

60

300

Module

Hardware

913

Module High Voltage

1

1

Warning

10

420

Switch

Module Status

914

Module High Current

1

1

Warning

10

420

Switch

Module Status

915

BER_ERROR

1

1

Critical

1e-8

420

Port

Hardware

916

BER_WARNING

1

1

Warning

1e-13

420

Port

Hardware

917

SYMBOL_BER_ERROR

1

1

Critical

420

Port

Hardware

1300

SM_SAKEY_VIOLATION

1

1

Warning

5300

Port

Fabric Notification

1301

SM_SGID_SPOOFED

1

1

Warning

5300

Port

Fabric Notification

1302

SM_RATE_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

1303

SM_MULTICAST_GROUPS_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

1304

SM_SERVICES_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

1305

SM_EVENT_SUBSCRIPTION_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

1500

New cable detected

1

0

Info

1

0

Link

Hardware

1502

Cable detected in a new location

1

0

Warning

1

0

Link

Hardware

1503

Duplicate Cable Detected

1

0

Critical

1

0

Link

Hardware

For a list of AHX related events, please refer to "AHX Monitoring Events".

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.