What can I help you with?
NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.11.0

NVIDIA SHARP Traps

Sharp traps are notifications sent by the switches to sharp_am. The traps alert sharp_am about various sharp-related events. Some traps can indicate a possible error in the client application logic, some traps can indicate a potential error in the Switch or in the Sharp software and require investigation by Nvidia experts.

All the traps received by sharp_am are displayed in sharp_am log file and in UFM events.

All traps are reported with the relevant switch LID. Some traps provide additional information, such as the relevant job, relevant QPs and a syndrome that gives a more precise reason for the trap.

In many cases when traps are sent to sharp_am, an error is also detected by the client app, and the following error is displayed in the client log file/console: "ERROR SHArP Error detected. err code:<> type:<> desc: <>".

The following table lists all the Sharp Traps, and their properties.

Trap Name

Trap Number

Description

Possible Root Cause

AMKeyViolation

257

Tells that someone tried to send a MAD with wrong AM-KEY. This is a security warning, libsharp is using the right AM-KEY and Job-Key the sender of the MADs should be checked.

Check for unauthorized app in the system.

Ina public cloud system, it can suggest one of the tenants is trying to hack the system.

QPError

132

The TRAP means that the QP has entered an error state.

There are 2 syndromes:

3 - means that the QP has received a higher amount of RNR than it was configured to allow.

4 - The QP didn’t receive an ACK on a packet, and it has reached the maximum amount of ‘retries’ that was configured for it.

This trap can occur when there is a physical link issue.

QPAllocation-Timeout

133

This trap tells that a QP Allocation request was received by the switch, but a QP Allocation Confirmation was not received by the switch afterward, and the timeout expired for waiting for the confirmation.

In case the client application terminated abnormally during the initialization phase, there is a slight probability that the application managed to send the QP Allocation MAD and terminated before sending the Confirmation MAD.

SharpInvalidRequest

134

This trap tells that there is an error while trying to aggregate the received data.

Syndromes:

0 - Quota Violation: Occurs when a job exceeds its allocated quota limits, indicates resource allocation violation in the SHARP protocol.

1 - Invalid Version: Indicates that the SHARP protocol version in the request is not supported, used for protocol version compatibility checking.

2- Invalid Opcode: Indicates that the operation code in the request is not valid, used to validate SHARP operation types.

3 - Invalid Vector/Payload Size: Indicates that there is a mismatch between the header information and the actual payload size.

4 - Job ID Violation: Occurs when there's a mismatch between the job ID in the QPC and the header tuple.

5 - Invalid Tree ID: Indicates that the specified tree is not active, used for tree state validation.

6 - Tree Violation: Indicates that there is a mismatch between the tree ID in the QPC and the header tuple.

7 - OST Mismatch: Indicates that at least two packets in a transaction have mismatch in their properties (For example: not the same payload size).

8 - Child Not In Group: Occurs when a request comes from a non-member of the group, used for group membership validation.

9 - Bad Target Header: Indicates an invalid target header in the request.

10 - Job Lock Violation: SAT lock for a job failed, because the port semaphore is already locked for another job.

11 - SAT Operation Without Lock: Occurs when a SAT operation is received without proper lock, ensures proper locking sequence.

12 - Unsupported Job ID: Job ID is above the maximal allowed number.

13 - Bad SAT Lock Operation: Indicates that a job tries to lock the same semaphore twice (can take place due to SLB packets retries).

14 - (LLT Only) Sharp Payload Not Aligned: Indicates that the SHARP payload is not properly aligned, used for memory alignment validation.

15 - ANDR Request On Non SAT: Occurs when an ANDR request is made on a non-SAT operation, used for operation type validation.

16 - Group Context Does Not Exist: Indicates that the requested group context does not exist, used for group context validation.

17 - Bad SAT Unlock Operation: Indicates that a SAT unlock operation failed, because the switch is still handling packets related to the job.

18 - No Free Buffers: Indicates that there are no available memory buffers. Can hint to a memory leak.

255 - General Error: Triggered by default case in error handling, used when the error cannot be associated with any specific syndrome.

Depending on the syndrome, this error can be a result of a wrong logic of the client application or an internal Sharp issue. The following syndromes hint to a problem in the client app:

2 - Invalid Opcode: The different ranks of the application are not using the same aggregation/reduction logic.

17 - Bad SAT Unlock Operation: Can occur when client application terminates abnormally, and there are still pending operations related to this application.

All the other syndromes, hint to an issue in Sharp logic.

SharpError

135

Reports streaming aggregation errors (SAT). This trap focuses on streaming aggregation issues.

Syndromes:

0 - Lock Semaphore Timeout: A lock was acquired, but was not used before timeout expired.

1 - OST No Progress Timeout: Started to receive a message, but not all packets of the message arrived before timeout expired.

2 - Bad SAT Request Classification: A transaction arrived with a request to perform an operation that is not supported by SAT.

3 - SAT ANDR Got Multiple Targets: Ambiguous routing for Reduce Scatter. The reduce operation has more than 1 target, but should have only 1.

4 - SAT Bad Target Header: SAT operation with an invalid SAT properties in the Target Header.

5 - SAT No Destination On Root: ANDR request with no destination on root switch.

6 - SAT Exceeds Number Of Outstanding Operations: Max number of transactions 'in the air' per tuple is 64, and this limit is breached.

7 - SAT Unbalanced FIFO Data: Unbalanced payload issue.

8 - SAT Unbalanced OST Address: Out-of-sequence issue.

9 - SAT Data Corruption: Data integrity issue.

10 - SAT Bad Data Granularity: Data size issue.

Depending on the syndrome, this error can be a result of a wrong logic of the client application or an internal Sharp issue. The following syndromes hint to a problem in the client app:

1 - OST No Progress Timeout: Can occur due to physical link issues, or when a client application stalls for a long time while transmitting data, for example getting no CPU resources due to high CPU load of another application.

7 - SAT Unbalanced FIFO Data : The different ranks of the application are not using the same aggregation/reduction length.

10 - SAT Bad Data Granularity - This can take place if the application sent truncated data.

All the other syndromes, hint to an issue in Sharp logic.

© Copyright 2025, NVIDIA. Last updated on May 8, 2025.