NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.10.2

NVIDIA SHARP Traps

Sharp traps are notifications sent by the switches to sharp_am. The traps alert sharp_am about various sharp-related events. Some traps can indicate a possible error in the client application logic, some traps can indicate a potential error in the Switch and require investigation by Nvidia experts.

All the traps received by sharp_am are displayed in sharp_am log file and in UFM events.

All traps are reported with the relevant switch LID. Some traps provide additional information, such as the relevant job, relevant QPs and a syndrome that gives a more precise reason for the trap.

List of traps

Trap Name

Trap Number

Description

Possible root cause

AMKeyViolation

257

Tells that someone tried to send a MAD with wrong AM-KEY. This is a security warning, libsharp is using the right AM-KEY and Job-Key the sender of the MADs should be checked.

QPError

132

QPAllocation-

Timeout

133

This trap tells that a QP Allocation request was received by the switch, but a QP Allocation Confirmation was not received by the switch afterward, and the timeout expired for waiting for the confirmation.

In case the client application terminated abnormally during the initialization phase, there is a slight probability that the application managed to send the QP Allocation MAD and terminated before sending the Confirmation MAD.

SharpInvalidRequest

134

This trap tells that there is an error while trying to aggregate the received data.

Depending on the syndrome, this error can be a result of a wrong logic of the client application. Syndrome values that can hint to a problem in the client app:

2 - Invalid Opcode: The clients of the application are not using the same aggregation logic.

3 - Invalid Vector Size / Invalid Payload Size: The clients of the application are not using the same buffer size.

8 - Child not in group: A request arrived from a none member in the group.

9 - Bad Target HDR: TBD?

14 - Sharp Payload not Aligned: TBD?

15 - ANDR request on non-SAT: A request was made to perform ReduceScatter, but the sharp job was asked without SAT support.

16 - Group context doesn't exist: TBD?

SharpError

135

FlushComplete

136

© Copyright 2025, NVIDIA. Last updated on Dec 23, 2024.