Appendix - Supported Port Counters and Events

Port counters and events are available in the following views:

  • Events and Port Counters area, at the bottom of the UFM window

  • Error window (Error tab) in the Manage Devices tab

  • In the New Monitoring Session window, in the Monitor tab, when clicking Create New Session

  • Event Log in the Log tab (click Show Event Log)

The following tables list and describe the port counters and events currently supported:

  • InfiniBand Port Counters

  • Calculated Port Counters

InfiniBand Port Counters

Counter

Description

Xmit Data (in bytes)

Total number of data octets, divided by 4, transmitted on all VLs from the port, including all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. All link packets are excluded. Results are reported as a multiple of four octets.

Rcv Data (in bytes)

Total number of data octets, divided by 4, received on all VLs at the port.

All octets between (and not including) the start of packet delimiter and the VCRC are excluded and may include packets containing errors. All link packets are excluded. When the received packet length exceeds the maximum allowed packet length specified in C7-45: the counter may include all data octets exceeding this limit.

Results are reported as a multiple of four octets.

Xmit Packets

Total number of packets transmitted on all VLs from the port, including packets with errors and excluding link packets.

Rcv Packets

Total number of packets, including packets containing errors and excluding link packets, received from all VLs on the port.

Rcv Errors

Total number of packets containing errors that were received on the port including:

  • Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

  • Malformed data packet errors (LVer, length, VL)

  • Malformed link packet errors (operand, length, VL)

  • ackets discarded due to buffer overrun (overflow)

Xmit Discards

Total number of outbound packets discarded by the port when the port is down or congested for the following reasons:

  • Output port is not in the active state

  • Packet length has exceeded NeighborMTU

  • Switch Lifetime Limit exceeded

  • Switch HOQ Lifetime Limit exceeded, including packets discarded while in VLStalled State.

Symbol Errors

Total number of minor link errors detected on one or more physical lanes.

Link Error Recovery

Total number of times the Port Training state machine has successfully completed the link error recovery process.

Link Error Downed

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

Local Integrity Error

The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors

Rcv Remote Physical Error

Total number of packets marked with the EBP delimiter received on the port.

Xmit Constraint Error

Total number of packets not transmitted from the switch physical port for the following reasons:

  • FilterRawOutbound is true and packet is raw

  • PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

Rcv Constraint Error

Total number of packets received on the switch physical port that are discarded for the following reasons:

  • FilterRawInbound is true and packet is raw

  • PartitionEnforcementInbound is true and packet fails partition key check or IP version check

Excess Buffer Overrun Error

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error

Rcv Switch Relay Error

Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

  • DLID mapping

  • VL mapping

  • Looping (output port = input port)

VL15 Dropped

Number of incoming VL15 packets dropped because of resource limitations (e.g., lack of buffers) in the port

XmitWait

The number of ticks during which the port selected by PortSelect had data to transmit but no data was sent during the entire tick because of insufficient credits or of lack of arbitration.

InfiniBand Calculated Port Counters

Counter

Description

Normalized XmitData

Effective port bandwidth utilization in %

XmitData incremental/ Link Capacity

Normalized Congested Bandwidth

Amount of bandwidth that was suppressed due to congestion

(XmitWait incremental/ Time) * Link Capacity

Separate counters are used for Tier 4 ports and for the rest of the ports.

Device events are listed as VDM or CDM in the Source column of the Events table in the UFM GUI. For information about defining event policy, see Configuring Event Management.

Alarm ID

Alarm Name

To Log

Alarm

Default Severity

Default Threshold

Default TTL

Related Object

Category

Source

64

GID Address In Service

1

0

Info

1

300

Port

Fabric Notification

SM

65

GID Address Out of Service

1

0

Warning

1

300

Port

Fabric Notification

SM

66

New MCast Group Created

1

0

Info

1

300

Port

Fabric Notification

SM

67

MCast Group Deleted

1

0

Info

1

300

Port

Fabric Notification

SM

110

Symbol Error

1

1

Warning

200

300

Port

Hardware

Telemetry

111

Link Error Recovery

1

1

Minor

1

300

Port

Hardware

Telemetry

112

Link Downed

1

1

Critical

1

300

Port

Hardware

Telemetry

113

Port Receive Errors

1

1

Minor

5

300

Port

Hardware

Telemetry

114

Port Receive Remote Physical Errors

0

0

Minor

5

300

Port

Hardware

Telemetry

115

Port Receive Switch Relay Errors

1

1

Minor

999

300

Port

Fabric Configuration

Telemetry

116

Port Xmit Discards

1

1

Minor

200

300

Port

Communication Error

Telemetry

117

Port Xmit Constraint Errors

1

1

Minor

200

300

Port

Communication Error

Telemetry

118

Port Receive Constraint Errors

1

1

Minor

200

300

Port

Communication Error

Telemetry

119

Local Link Integrity Errors

1

1

Minor

5

300

Port

Hardware

Telemetry

120

Excessive Buffer Overrun Errors

1

1

Minor

100

300

Port

Communication Error

Telemetry

121

VL15 Dropped

1

1

Minor

50

300

Port

Communication Error

Telemetry

122

Congested Bandwidth (%) Threshold Reached

1

1

Minor

10

300

Port

Hardware

Telemetry

123

Port Bandwidth (%) Threshold Reached

1

1

Minor

95

300

Port

Communication Error

Telemetry

130

Non-optimal link width

1

1

Minor

1

0

Port

Hardware

SM

134

T4 Port Congested Bandwidth

1

1

Warning

10

300

Port

Communication Error

Telemetry

141

Flow Control Update Watchdog Timer Expired

1

0

Warning

1

300

Port

Hardware

SM

144

Capability Mask Modified

1

0

Info

1

300

Port

Fabric Notification

SM

145

System Image GUID changed

1

0

Info

1

300

Port

Communication Error

SM

156

Link Speed Enforcement Disabled

1

0

Critical

0

300

Site

Fabric Notification

SM

250

Running in Limited Mode

1

1

Critical

1

0

Grid

Maintenance

Licensing

251

Switching to Limited Mode

1

1

Critical

1

0

Grid

Maintenance

Licensing

252

License Expired

1

1

Warning

1

0

Grid

Maintenance

Licensing

253

Duplicated licenses

1

0

Critical

1

0

Grid

Maintenance

Licensing

254

License Limit Exceeded

1

0

Critical

1

0

Grid

Maintenance

Licensing

255

License is About to Expire

1

0

Warning

1

0

Grid

Maintenance

Licensing

256

Bad M_Key

1

0

Minor

1

300

Port

Security

SM

257

Bad P_Key

1

0

Minor

1

300

Port

Security

SM

258

Bad Q_Key

1

0

Minor

1

300

Port

Security

SM

259

Bad P_Key Switch External Port

1

0

Critical

1

300

Port

Security

SM

328

Link is Up

1

0

Info

1

0

Link

Fabric Topology

SM

329

Link is Down

1

0

Warning

1

0

Site

Fabric Topology

SM

331

Node is Down

1

0

Warning

1

0

Site

Fabric Topology

SM

332

Node is Up

1

0

Info

1

300

Site

Fabric Topology

SM

336

Port Action Succeeded

1

0

Info

1

0

Port

Maintenance

UFM

337

Port Action Failed

1

0

Minor

1

0

Port

Maintenance

UFM

338

Device Action Succeeded

1

0

Info

1

0

Port

Maintenance

UFM

339

Device Action Failed

1

0

Minor

1

0

Port

Maintenance

UFM

344

Partial Switch ASIC Failure

1

1

Critical

1

0

Switch

Maintenance

UFM

370

Gateway Ethernet Link State Changed

1

0

Warning

1

0

Gateway

Gateway

SM

371

Gateway Reregister Event Received

1

0

Warning

1

0

Gateway

Gateway

SM

372

Number of Gateways Changed

1

0

Warning

1

0

Gateway

Gateway

SM

373

Gateway will be Rebooted

1

0

Warning

1

0

Gateway

Gateway

SM

374

Gateway Reloading Finished

1

0

Info

1

0

Gateway

Gateway

SM

380

Switch Upgrade Error

1

1

Critical

1

0

Switch

Maintenance

UFM

381

Switch Upgrade Failed

1

0

Info

1

0

Switch

Maintenance

UFM

328

Module status NOT PRESENT

1

1

Warning

1

420

Switch

Module Status

UFM

383

Host Upgrade Failed

1

0

Info

1

0

Computer

Maintenance

UFM

384

Switch Module Powered Off

1

1

Info

1

420

Switch

Module Status

UFM

385

Switch FW Upgrade Started

1

0

Info

1

0

Switch

Maintenance

UFM

386

Switch SW Upgrade Started

1

0

Info

1

0

Switch

Maintenance

UFM

387

Switch Upgrade Finished

1

0

Info

1

0

Switch

Maintenance

UFM

388

Host FW Upgrade Started

1

0

Info

1

0

Computer

Maintenance

UFM

389

Host SW Upgrade Started

1

0

Info

1

0

Computer

Maintenance

UFM

391

Switch Module Removed

1

0

Info

1

0

Switch

Fabric Notification

Switch

392

Module Temperature Threshold Reached

1

0

Info

40

0

Module

Hardware

Switch

393

Switch Module Added

1

0

Info

1

0

Switch

Fabric Notification

Switch

394

Module Status FAULT

1

1

Critical

1

420

Switch

Module Status

Switch

395

Device Action Started

1

0

Info

1

0

Port

Maintenance

UFM

396

Site Action Started

1

0

Info

1

0

Port

Maintenance

UFM

397

Site Action Failed

1

0

Minor

1

0

Port

Maintenance

UFM

398

Switch Chip Added

1

0

Info

1

0

Switch

Fabric Notification

Switch

399

Switch Chip Removed

1

0

Critical

1

0

Switch

Fabric Notification

Switch

403

Device Pending Reboot

1

1

Warning

0

300

Device

Maintenance

UFM

404

System Information is missing

1

1

Warning

1

300

Switch

Communication Error

UFM

405

Switch Identity Validation Failed

1

1

Warning

1

300

Switch

Communication Error

UFM

406

Switch System Information is missing

1

1

Waring

1

300

Switch

Communication Error

UFM

407

COMEX Ambient Temperature Threshold Reached

1

1

Minor

60

300

Switch

Hardware

Switch

408

Switch is Unresponsive

1

1

Critical

1

300

Switch

Communication Error

UFM

502

Device Upgrade Finished

1

0

Info

1

300

Device

Maintenance

UFM

506

Device Upgrade Finished

1

0

Info

1

300

Device

Maintenance

UFM

508

Core Dump Created

1

1

Info

1

300

Grid

Maintenance

UFM

510

SM Failover

0

1

Critical

1

300

Grid

Fabric Notification

SM

511

SM State Change

0

1

Info

1

300

Grid

Fabric Notification

SM

512

SM UP

0

1

Info

1

300

Grid

Fabric Notification

SM

513

SM System Log Message

0

1

Minor

1

300

Grid

Fabric Notification

SM

514

SM LID Change

0

1

Warning

1

300

Grid

Fabric Notification

SM

515

Fabric Health Report Info

1

1

Info

1

300

Grid

Fabric Notification

UFM

516

Fabric Health Report Warning

1

1

Warning

1

300

Grid

Fabric Notification

UFM

517

Fabric Health Report Error

1

1

Critical

1

300

Grid

Fabric Notification

UFM

518

UFM-related process is down

1

1

Critical

1

300

Grid

Maintenance

UFM

519

Logs purge failure

1

1

Minor

1

300

Grid

Maintenance

UFM

520

Restart of UFM-related process succeeded

1

1

Info

1

300

Grid

Maintenance

UFM

521

UFM is being stopped

1

1

Critical

1

300

Grid

Maintenance

UFM

522

UFM is being restarted

1

1

Critical

1

300

Grid

Maintenance

UFM

523

UFM failover is being attempted

1

1

Info

1

300

Grid

Maintenance

UFM

524

UFM cannot connect to DB

1

1

Critical

1

300

Grid

Maintenance

UFM

525

Disk utilization threshold reached

1

1

Critical

1

300

Grid

Maintenance

UFM

526

Memory utilization threshold reached

1

1

Critical

1

300

Grid

Maintenance

UFM

527

CPU utilization threshold reached

1

1

Critical

1

300

Grid

Maintenance

UFM

528

Fabric interface is down

1

1

Critical

1

300

Grid

Maintenance

UFM

529

UFM standby server problem

1

1

Critical

1

300

Grid

Maintenance

UFM

530

SM is down

1

1

Critical

1

300

Grid

Maintenance

UFM

531

DRBD Bad Condition

1

1

Critical

1

300

Grid

Maintenance

UFM

532

Remote UFM-SM Sync

1

1

Info

1

0

Grid

Maintenance

UFM

533

Remote UFM-SM problem

1

1

Critical

1

0

Site

Maintenance

UFM

535

MH Purge Failed

1

1

Warning

1

300

Grid

Maintenance

UFM

536

UFM Health Watchdog Info

1

1

Info

1

300

Grid

Maintenance

UFM

537

UFM Health Watchdog Critical

1

1

Critical

1

300

Grid

Maintenance

UFM

538

Time Diff Between HA Servers

1

1

Warning

1

300

Grid

Maintenance

UFM

539

DRBD TCP Connection Performance

1

1

Warning

1

900

Grid

Maintenance

UFM

540

Daily Report Completed successfully

1

0

Info

1

300

Grid

Maintenance

UFM

541

Daily Report Completed with Error

1

0

Minor

1

300

Grid

Maintenance

UFM

542

Daily Report Failed

1

0

Critical

1

300

Grid

Maintenance

UFM

543

Daily Report Mail Sent successfully

1

0

Info

1

300

Grid

Maintenance

UFM

544

Daily Report Mail Sent Failed

1

0

Minor

1

300

Grid

Maintenance

UFM

545

SM is not responding

1

1

Critical

1

300

Grid

Maintenance

UFM

560

User Connected

Security

UFM

561

User Disconnected

Security

UFM

602

UFM Server Failover

1

1

Critical

1

0

Site

Fabric Notification

UFM

603

Events Suppression

1

0

Critical

0

300

Site

Maintenance

UFM

604

Report Succeeded

1

1

Info

1

300

Grid

Maintenance

UFM

605

Report Failed

1

1

Critical

1

300

Grid

Maintenance

UFM

606

Correction Attempts Paused

1

0

Warning

1

0

Site

Fabric Notification

UFM

701

Non-optimal Link Speed

1

1

Minor

1

0

Port

Hardware

UFM

702

Unhealthy IB Port

1

1

Warning

1

0

Port

Hardware

SM

703

Fabric Collector Connected

1

0

Info

1

0

Grid

Maintenance

UFM

704

Fabric Collector Disconnected

1

1

Critical

1

0

Grid

Maintenance

UFM

750

High data retransmission count on port

1

1

Warning

500

1

Port

Hardware

SM

901

Fabric Configuration Started

0

1

Info

1

0

Grid

Fabric Notification

UFM

902

Fabric Configuration Completed

0

1

Info

1

0

Grid

Fabric Notification

UFM

903

Fabric Configuration Failed

0

1

Critical

1

0

Grid

Fabric Notification

UFM

904

Device Configuration Failure

0

1

Critical

1

0

Device

Fabric Notification

UFM

905

Device Configuration Timeout

0

1

Critical

1

0

Device

Fabric Notification

UFM

906

Provisioning Validation Failure

0

1

Critical

1

0

Grid

Fabric Notification

UFM

907

Switch is Down

1

1

Critical

1

0

Site

Fabric Topology

UFM

908

Switch is Up

1

1

Info

1

300

Site

Fabric Topology

UFM

909

Director Switch is Down

1

1

Critical

1

300

Site

Fabric Topology

UFM

910

Director Switch is Up

1

1

Info

1

0

Site

Fabric Topology

UFM

911

Module Temperature Low Threshold Reached

1

1

Warning

60

300

Module

Hardware

Telemetry

912

Module Temperature High Threshold Reached

1

1

Critical

60

300

Module

Hardware

Telemetry

913

Module High Voltage

1

1

Warning

10

420

Switch

Module Status

Telemetry

914

Module High Current

1

1

Warning

10

420

Switch

Module Status

Telemetry

915

BER_ERROR

1

1

Critical

1e-8

420

Port

Hardware

Telemetry

916

BER_WARNING

1

1

Warning

1e-13

420

Port

Hardware

Telemetry

917

SYMBOL_BER_ERROR

1

1

Critical

10

420

Port

Hardware

Telemetry

918

High Symbol BER reported

1

1

Warning

10

420

Port

Hardware

Telemetry

919

Cable Temperature High

1

1

Critical

0

0

Port

Hardware

Telemetry

920

Cable Temperature Low

1

1

Critical

0

0

Port

Hardware

Telemetry

1300

SM_SAKEY_VIOLATION

1

1

Warning

5300

Port

Security

SM

1301

SM_SGID_SPOOFED

1

1

Warning

5300

Port

Security

SM

1302

SM_RATE_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Security

SM

1303

SM_MULTICAST_GROUPS_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Security

SM

1304

SM_SERVICES_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Security

SM

1305

SM_EVENT_SUBSCRIPTION_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Security

SM

1306

Unallowed SM was detected in the fabric

1

1

Warning

0

300

Port

Fabric Notification

SM

1307

SMInfo SET request was received from unallowed SM

1

1

Warning

0

300

Port

Fabric Notification

SM

1309

SM was detected with non-matching SMKey

1

1

Warning

0

300

Port

Fabric Notification

SM

1310

Duplicated node GUID was detected

1

1

Critical

1

0

Device

Fabric Notification

SM

1311

Duplicated port GUID was detected

1

1

Critical

1

0

Port

Fabric Notification

SM

1312

Switch was Rebooted

1

1

Info

1

0

Device

Fabric Notification

UFM

1315

Topo Config File Error

1

1

Critical

1

0

Grid

Fabric Notification

UFM

1316

Topo Config Subnet Mismatch

1

1

Critical

1

0

Grid

Fabric Notification

Topodiff

1400

High Ambient Temperature

1

1

Warning

0

86400

Switch

Hardware

Switch

1401

High Fluid Temperature

1

1

Warning

0

86400

Switch

Hardware

Switch

1402

Low Fluid Level

1

1

Warning

0

86400

Switch

Hardware

Switch

1403

Low Supply Pressure

1

1

Warning

0

86400

Switch

Hardware

Switch

1404

High Supply Pressure

1

1

Warning

0

86400

Switch

Hardware

Switch

1405

Low Return Pressure

1

1

Warning

0

86400

Switch

Hardware

Switch

1406

High Return Pressure

1

1

Warning

0

86400

Switch

Hardware

Switch

1407

High Differential Pressure

1

1

Warning

0

86400

Switch

Hardware

Switch

1408

Low Differential Pressure

1

1

Warning

0

86400

Switch

Hardware

Switch

1409

System Fail Safe

1

1

Warning

0

86400

Switch

Hardware

Switch

1410

Fault Critical

1

1

Critical

0

86400

Switch

Hardware

Switch

1411

Fault Pump1

1

1

Critical

0

86400

Switch

Hardware

Switch

1412

Fault Pump2

1

1

Critical

0

86400

Switch

Hardware

Switch

1413

Fault Fluid Level Critical

1

1

Critical

0

86400

Switch

Hardware

Switch

1414

Fault Fluid Over Temperature

1

1

Critical

0

86400

Switch

Hardware

Switch

1415

Fault Primary DC

1

1

Critical

0

86400

Switch

Hardware

Switch

1416

Fault Redundant DC

1

1

Critical

0

86400

Switch

Hardware

Switch

1417

Fault Fluid Leak

1

1

Critical

0

86400

Switch

Hardware

Switch

1418

Fault Sensor Failure

1

1

Critical

0

86400

Switch

Hardware

Switch

1419

Cooling Device Monitoring Error

1

0

Critical

0

1

Grid

Hardware

Switch

1420

Cooling Device Communication Error

1

1

Critical

0

86400

Switch

Hardware

Switch

1500

New cable detected

1

0

Info

1

0

Link

Security

UFM

1502

Cable detected in a new location

1

0

Warning

1

0

Link

Security

UFM

1503

Duplicate Cable Detected

1

0

Critical

1

0

Link

Security

UFM

1315

Topo Config File Error

1

1

Critical

1

0

Grid

Fabric Notification

UFM

1504

SHARP Allocation Succeeded

1

1

Info

1

0

Grid

SHARP

SHARP

1505

SHARP Allocation Failed

1

0

Warning

1

0

Grid

SHARP

SHARP

1506

SHARP Deallocation Succeeded

1

0

Info

1

0

Grid

SHARP

SHARP

1507

SHARP Deallocation Failed

1

0

Warning

1

0

Grid

SHARP

SHARP

1508

Device Collect System Dump Started

1

0

Info

1

300

Device

Maintenance

UFM

1509

Device Collect System Dump Finished

1

0

Info

1

300

Device

Maintenance

UFM

1510

Device Collect System Dump Error

1

0

Critical

1

300

Device

Maintenance

UFM

1511

Virtual Port Added

1

0

Info

1

0

Port

Fabric Notification

SM

1512

Virtual Port Removed

1

0

Warning

1

0

Port

Fabric Notification

SM

1513

Burn Cables Transceivers Started

1

0

Info

1

0

Device

Maintenance

UFM

1514

Burn Cables Transceivers Finished

1

0

Info

1

0

Device

Maintenance

UFM

1515

Burn Cables Transceivers Failed

1

0

Warning

1

0

Device

Maintenance

UFM

1516

Activate Cables Transceivers FW Finished

1

0

Info

1

0

Device

Maintenance

UFM

1517

Activate Cables Transceivers FW Failed

1

0

Warning

1

0

Device

Maintenance

UFM

1520

Aggregation Node Discovery Failed

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1521

Job Started

1

0

Info

1

0

SHARP AM

SHARP

SHARP

1522

Job Ended

1

0

Info

1

0

SHARP AM

SHARP

SHARP

1523

Job Start Failed

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1524

Job Error

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1525

Trap QP Error

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1526

Trap Invalid Request

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1527

Trap Sharp Error

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1528

Trap QP Alloc timeout

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1529

Trap AMKey Violation

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1530

Unsupported Trap

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1531

Reservation Updated

1

0

Info

1

0

SHARP AM

SHARP

SHARP

1532

Sharp is not Responding

1

0

Critical

1

0

SHARP AM

SHARP

SHARP

1533

Agg Node Active

1

0

Info

1

0

SHARP AM

SHARP

SHARP

1534

Agg Node Inactive

1

0

Warning

1

0

SHARP AM

SHARP

SHARP

1535

Trap AMKey Violation Triggered by AM

1

0

Warning

1

0

SHARP AM

SHARP

SHARP

1550

Guids Were Added to Pkey

1

0

Info

1

0

Port

Fabric Notification

UFM

1551

Guids Were Removed from Pkey

1

0

Info

1

0

Port

Fabric Notification

UFM

1600

VS/CC Classes Key Violation

Security

SM

1602

PCI Speed Degradation Warning

1

1

Warning

1

0

Port

Fabric Notification

UFM

1603

PCI Width Degradation Warning

1

1

Warning

1

0

Port

Fabric Notification

UFM

© Copyright 2023, NVIDIA. Last updated on Mar 12, 2024.