NVIDIA SHARP Traps
Sharp traps are notifications sent by the switches to sharp_am. The traps alert sharp_am about various sharp-related events. Some traps can indicate a possible error in the client application logic, some traps can indicate a potential error in the Switch and require investigation by Nvidia experts.
All the traps received by sharp_am are displayed in sharp_am log file and in UFM events.
All traps are reported with the relevant switch LID. Some traps provide additional information, such as the relevant job, relevant QPs and a syndrome that gives a more precise reason for the trap.
List of traps
Trap Name |
Trap Number |
Description |
Possible root cause |
AMKeyViolation |
257 |
Tells that someone tried to send a MAD with wrong AM-KEY. This is a security warning, libsharp is using the right AM-KEY and Job-Key the sender of the MADs should be checked. |
|
QPError |
132 |
||
QPAllocation- Timeout |
133 |
This trap tells that a QP Allocation request was received by the switch, but a QP Allocation Confirmation was not received by the switch afterward, and the timeout expired for waiting for the confirmation. |
In case the client application terminated abnormally during the initialization phase, there is a slight probability that the application managed to send the QP Allocation MAD and terminated before sending the Confirmation MAD. |
SharpInvalidRequest |
134 |
This trap tells that there is an error while trying to aggregate the received data. |
Depending on the syndrome, this error can be a result of a wrong logic of the client application. Syndrome values that can hint to a problem in the client app: 2 - Invalid Opcode: The clients of the application are not using the same aggregation logic. 3 - Invalid Vector Size / Invalid Payload Size: The clients of the application are not using the same buffer size. 8 - Child not in group: A request arrived from a none member in the group. 9 - Bad Target HDR: TBD? 14 - Sharp Payload not Aligned: TBD? 15 - ANDR request on non-SAT: A request was made to perform ReduceScatter, but the sharp job was asked without SAT support. 16 - Group context doesn't exist: TBD? |
SharpError |
135 |
||
FlushComplete |
136 |