What can I help you with?
NVIDIA UFM-SDN Appliance User Manual v4.18.4

Supported Traps and Events

Device events are listed as VDM or CDM in the Source column of the Events table in the UFM Web UI. For information about defining event policy, see Events Policy.

Alarm ID

Alarm Name

To Log

Alarm

Default Severity

Default Threshold

Default TTL

Related Object

Category

Description/Message

116

Port Xmit Discards

1

1

Minor

200

300

Port

Communication Error

Total number of outbound packets discarded by the port when the port is down or congested. Reasons include:

  • Output port is not in the active state

  • Packet length exceeded NeighborMTU

  • Switch Lifetime Limit exceeded

  • Switch HOQ Lifetime Limit exceeded

  • Packets discarded while in VLStalled State

117

Port Xmit Constraint Errors

1

1

Minor

200

300

Port

Communication Error

Total number of packets not transmitted from the switch physical port for the following reasons:

  • FilterRawOutbound is true and packet is raw

  • PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

120

Excessive Buffer Overrun Errors

1

1

Minor

100

300

Port

Communication Error

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error.

Message: ExcessiveBufferOverrunErrors counter threshold exceeded. Threshold is %d, received value is %d.

121

VL15 Dropped

1

1

Minor

50

300

Port

Communication Error

Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port.

Message: VL15Dropped counter threshold exceeded. Threshold is %d, received value is %d.

118

Port Receive Constraint Errors

1

1

Minor

200

300

Port

Communication Error

Total number of packets received on the switch physical port that are discarded for the following reasons:

  • FilterRawInbound is true and packet is raw

  • PartitionEnforcementInbound is true and packet fails partition key check or IP version check

145

System Image GUID changed

1

0

Info

1

300

Port

Communication Error

System GUID is changed for the specific LID

115

Port Receive Switch Relay Errors

1

1

Minor

9999

300

Port

Fabric Configuration

Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay.

Reasons for this include:

  • DLID mapping

  • VL mapping

  • Looping (output port = input port)

256

Bad M_Key

1

0

Minor

1

300

Port

Fabric Configuration

Found bad M_Key. Check your HCA driver or partition settings.

SM Trap. Management Key (M_Key): Enforces the control of a master subnet manager. Administered by the subnet manager and used in certain subnet management packets.

Message: Bad M_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

257

Bad P_Key

1

0

Minor

1

300

Port

Fabric Configuration

Found a bad P_Key. Check your partitioning settings.

SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM).

Message: Bad P_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

258

Bad Q_Key

1

0

Minor

1

300

Port

Fabric Configuration

Found bad Q_Key. Security error.

SM Trap. Queue Key (Q_Key): Enforces access rights for reliable and unreliable datagram service (RAW datagram service type not included).

Message: Bad Q_Key: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

259

Bad P_Key Switch External Port

1

0

Critical

1

300

Port

Fabric Configuration

Found a bad P_Key. Check your partitioning settings.

SM Trap. Partition Key (P_Key): Enforces membership. Administered through the subnet manager by the partition manager (PM).

Message: Bad P_Key switch external port: port1(lid %(lid)d, #%(portn)d) %(pkey)08x, port2(lid%(lid2)d #%(portn2)d)

64

GID Address In Service

1

0

Info

1

300

Port

Fabric Notification

New GID is connected to the Fabric

65

GID Address Out of Service

1

0

Warning

1

300

Port

Fabric Notification

Existing GID is disconnected from the Fabric

66

New MCast Group Created

1

0

Info

1

300

Port

Fabric Notification

New Multicast Group is created in SM

67

MCast Group Deleted

1

0

Info

1

300

Port

Fabric Notification

Multicast Group is removed from SM.

328

Link is Up

1

0

Info

1

0

Link

Fabric Topology

Event is sent upon discovery of a new link

328

Link is Down

1

0

Warning

1

0

Link

Fabric Topology

Event is sent when exiting link is removed

144

Capability Mask Modified

0

0

Info

1

300

Port

Fabric Notification

Capability Mask of the specific LID is modified

602

UFM Server Failover

1

1

Critical

1

0

Site

Fabric Notification

Failover in UFM Server (in HA mode)

391

Switch Module Removed

1

0

Info

1

0

Switch

Fabric Notification

Module (line card, FAN or PS) is removed from the switch

331

Node is Down

1

0

Warning

1

0

Site

Fabric Topology

Node is disconnected or down

332

Node is Up

1

0

Info

1

300

Site

Fabric Topology

Node is connected or up

907

Switch is Down

1

1

Critical

1

0

Site

Fabric Topology

Switch is disconnected or down

908

Switch is Up

1

1

Info

1

300

Site

Fabric Topology

Switch is connected or up

370

Gateway Ethernet Link State Changed

1

0

Warning

1

0

Gateway

Gateway

Gateway Ethernet Physical link has changed state

371

Gateway Re-register Event Received

1

0

Warning

1

0

Gateway

Gateway

10GbE Gateway received a re-register event from the SM.

372

Number of Gateways is Changed

1

0

Warning

1

0

Gateway

Gateway

Change in the number of 10GbE Gateways has been detected

373

Gateway will be Rebooted

1

0

Warning

1

0

Gateway

Gateway

10GbE Gateway is about to reboot

374

Gateway Reloading Finished

1

0

Info

1

0

Gateway

Gateway

10GbE Gateway has finished reloading.

110

Symbol Error

1

1

Warning

200

300

Port

Hardware

Total number of minor link errors detected on one or more physical lanes

111

Link Error Recovery

1

1

Minor

1

300

Port

Hardware

Total number of times the Port Training state machine has successfully completed the link error recovery process

112

Link Downed

1

1

Critical

1

300

Port

Hardware

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

113

Port Receive Errors

1

1

Minor

5

300

Port

Hardware

Total number of packets containing an error that were received on a port. These errors include:

  • Local physical errors (CRC, VCRC, FCCRC and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

  • Data packet errors

  • Link packet errors

  • Packets discarded due to buffer overrun

114

Port Receive Remote Physical Errors

1

0

Minor

5

300

Port

Hardware

Total number of packets marked with the EBP delimiter received on the port

119

Local Link Integrity Errors

1

1

Minor

5

300

Port

Hardware

The number of times that the frequency of packets containing local physical errors has exceeded LocalPhyErrors.

Message: LocalLinkIntegrityErrors counter threshold exceeded. Threshold is %d, received value is %d

122

Congested Bandwidth (%) Threshold Reached

1

1

Minor

10

300

Port

Hardware

Percent of Congested Bandwidth has exceeded defined threshold.

Note: a different threshold can be set specifically for Tier 4 ports.

131

Non-optimal link width (1X instead of 4X)

1

1

Minor

1

0

Port

Hardware

4X link operates as 1X link

132

Non-optimal link width (1X or 4X instead of 12X)

1

1

Minor

1

0

Port

Hardware

12X links operates as 4X or 1X link

701

Non-optimal Link Speed

1

1

Minor

1

0

Port

Hardware

DDR link operates as SDR or

QRD link operates as DDR or QDR

or

EDR link operates as FDR,QDR,DDR or SDR

or

FDR link operates as QDR,DDR or SDR

140

Excessive Buffer Overrun Threshold Reached

1

0

Minor

1

300

Port

Hardware

SM Trap. This error is detected when the number of consecutive flow control update periods with at least one overrun error in each period exceeds the OverrunErrors threshold given in the PortInfo attribute.

Message: Excessive Buffer Overrun Threshold is reached: lid %(lid)d, port #%(portn)d

141

Flow Control Update Watchdog Timer Expired

1

0

Warning

1

300

Port

Hardware

SM Trap. The error indicates a failure of the flow control machine at the other end of the link. If the timer expires without receiving an update, a flow control update error has occurred.

Message: Flow Control Update watchdog timer has expired: lid %(lid)d, port #%(portn)d

392

Module Temperature Threshold Reached

1

0

Info

40

0

Module

Hardware

Temperature detected by module sensor is too high, has exceeded the defined threshold.

350

Environment Added

1

0

Info

1

0

Env

Logical Model

New Logical Environment is created

351

Environment Removed

1

0

Info

1

0

Env

Logical Model

Logical Environment is deleted

306

Logical Server Added

1

0

Info

1

0

Logical Server

Logical Model

New Logical Server or Logical Servers Group is created

307

Logical Server Removed

1

0

Info

1

0

Logical Server

Logical Model

Logical Server or Logical Servers Group is deleted

352

Network Added

1

0

Info

1

0

Network

Logical Model

New Network is created

353

Network Removed

1

0

Info

1

0

Network

Logical Model

Network is deleted

340

Network Interface Added

1

0

Info

1

0

Logical Server

Logical Model

New Network Interface is created

341

Network Interface Removed

1

0

Info

1

0

Logical Server

Logical Model

Network Interface is deleted

313

Compute Resource Allocated

1

0

Info

1

0

Logical Server

Logical Model

A resource is allocated to the Logical Server

312

Compute Resource Released

1

0

Info

1

0

Logical Server

Logical Model

A resource is released from the Logical Server

317

Logical Server Compute Resource is Up

1

1

Warning

1

0

Logical Server

Logical Model

An allocated resource is Down or Disconnected

316

Logical Server Compute Resource is Down

1

1

Critical

1

0

Logical Server

Logical Model

An allocated resources is Up or Connected back

301

Logical Server State Changed

1

0

Info

1

0

Logical Server

Logical Model

Logical Server state is changed

302

Logical Server State Change Failed

1

0

Minor

1

0

Logical Server

Logical Model

Logical Server has failed to change the state.

RM (Resource Manager) Event. Indicates error in Logical Server state change. This error might be caused by any error condition related to the Logical Server resources allocation.

Message: Logical Server changed state from %s to %s

308

Logical Server Resources Allocated

1

0

Info

1

0

Logical Server

Logical Model

New resources are allocated to the Logical Server

314

Logical Server Additional Resources Allocated

1

0

Info

1

0

Logical Server

Logical Model

Additional resources are allocated to the Logical Server

315

Logical Server Resources Released

1

0

Info

1

0

Logical Server

Logical Model

Resources were released from the Logical Server

336

Port Action Succeeded

1

0

Info

1

0

Port

Maintenance

Port Management Action (reset, disable) succeeded

337

Port Action Failed

1

0

Minor

1

0

Port

Maintenance

Port Management Action (reset, disable) failed

338

Device Action Succeeded

1

0

Info

1

0

Port

Maintenance

Device Management Action succeeded

339

Device Action Failed

1

0

Minor

1

0

Port

Maintenance

Device Management Action failed

385

Switch FW Upgrade Started

1

0

Info

1

0

Switch

Maintenance

Switch FW Upgrade process has started

386

Switch SW Upgrade Started

1

0

Info

1

0

Switch

Maintenance

Switch SW Upgrade process has started

381

Switch Upgrade Failed

1

0

Info

1

0

Switch

Maintenance

Switch SW or FW Upgrade process failed

388

Host FW Upgrade Started

1

0

Info

1

0

Computer

Maintenance

Host FW Upgrade process has started

389

Host SW Upgrade Started

1

0

Info

1

0

Computer

Maintenance

Host SW Upgrade process has started

383

Host Upgrade Failed

1

0

Info

1

0

Computer

Maintenance

Host SW or FW Upgrade process failed

502

Device Upgrade Finished

1

0

Info

1

300

Device

Maintenance

Device SW or FW Upgrade has finished

909

Director Switch is Down

1

1

Critical

1

300

Site

Fabric Topology

Director Switch is disconnected or down

910

Director Switch is Up

1

1

Info

1

0

Site

Fabric Topology

Director Switch is connected or up

911

Module Temperature Low Threshold Reached

1

1

Warning

60

300

Module

Hardware

Temperature detected by module sensor is too high, has exceeded the low threshold

912

Module Temperature High Threshold Reached

1

1

Critical

60

300

Module

Hardware

Temperature detected by module sensor is too high, has exceeded the high threshold

913

Module High Voltage

1

1

Warning

10

420

Switch

Module Status

Sensor Voltage Threshold Exceeded

914

Module High Current

1

1

Warning

10

420

Switch

Module Status

Sensor Current Threshold Exceeded

394

Module Status FAULT

1

1

Critical

1

420

Switch

Module Status

Module Status FAULT

545

SM is not responding

1

1

Critical

1

300

Grid

Maintenance

SM is not responding

915

BER_ERROR

1

1

Critical

1e-8

420

Port

Hardware

Effective BER Error on port exceeded the threshold

916

BER_WARNING

1

1

Warning

1e-13

420

Port

Hardware

Effective BER Warning on port exceeded the threshold

1300

SM_SAKEY_VIOLATION

1

1

Warning

5300

Port

Fabric Notification

"SA Key Violation Committed"

1301

SM_SGID_SPOOFED

1

1

Warning

5300

Port

Fabric Notification

"SGID spoofed by VPort/port"

1302

SM_RATE_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Rate Limit Exceeded"

1303

SM_MULTICAST_GROUPS_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Multicast Groups Limit Exceeded"

1304

SM_SERVICES_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Services, Limit Exceeded"

1305

SM_EVENT_SUBSCRIPTION_LIMIT_EXCEEDED

1

1

Warning

5300

Port

Fabric Notification

"Event Subscription Limit Exceeded"

© Copyright 2025, NVIDIA. Last updated on Jul 3, 2025.