NVIDIA SHARP Traps
Sharp traps are notifications sent by the switches to sharp_am. The traps alert sharp_am about various sharp-related events. Some traps can indicate a possible error in the client application logic, some traps can indicate a potential error in the Switch or in the Sharp software and require investigation by Nvidia experts.
All the traps received by sharp_am are displayed in sharp_am log file and in UFM events.
All traps are reported with the relevant switch LID. Some traps provide additional information, such as the relevant job, relevant QPs and a syndrome that gives a more precise reason for the trap.
In many cases when traps are sent to sharp_am, an error is also detected by the client app, and the following error is displayed in the client log file/console: "ERROR SHArP Error detected. err code:<> type:<> desc: <>".
The following table lists all the Sharp Traps, and their properties.
Trap Name | Trap Number | Description | Possible Root Cause |
AMKeyViolation | 257 | Tells that someone tried to send a MAD with wrong AM-KEY. This is a security warning, libsharp is using the right AM-KEY and Job-Key the sender of the MADs should be checked. | Check for unauthorized app in the system. Ina public cloud system, it can suggest one of the tenants is trying to hack the system. |
QPError | 132 | The TRAP means that the QP has entered an error state. There are 2 syndromes: 3 - means that the QP has received a higher amount of RNR than it was configured to allow. 4 - The QP didn’t receive an ACK on a packet, and it has reached the maximum amount of ‘retries’ that was configured for it. | This trap can occur when there is a physical link issue. |
QPAllocation-Timeout | 133 | This trap tells that a QP Allocation request was received by the switch, but a QP Allocation Confirmation was not received by the switch afterward, and the timeout expired for waiting for the confirmation. | In case the client application terminated abnormally during the initialization phase, there is a slight probability that the application managed to send the QP Allocation MAD and terminated before sending the Confirmation MAD. |
SharpInvalidRequest | 134 | This trap tells that there is an error while trying to aggregate the received data. Syndromes: 0 - Quota Violation: Occurs when a job exceeds its allocated quota limits, indicates resource allocation violation in the SHARP protocol. 1 - Invalid Version: Indicates that the SHARP protocol version in the request is not supported, used for protocol version compatibility checking. 2- Invalid Opcode: Indicates that the operation code in the request is not valid, used to validate SHARP operation types. 3 - Invalid Vector/Payload Size: Indicates that there is a mismatch between the header information and the actual payload size. 4 - Job ID Violation: Occurs when there's a mismatch between the job ID in the QPC and the header tuple. 5 - Invalid Tree ID: Indicates that the specified tree is not active, used for tree state validation. 6 - Tree Violation: Indicates that there is a mismatch between the tree ID in the QPC and the header tuple. 7 - OST Mismatch: Indicates that at least two packets in a transaction have mismatch in their properties (For example: not the same payload size). 8 - Child Not In Group: Occurs when a request comes from a non-member of the group, used for group membership validation. 9 - Bad Target Header: Indicates an invalid target header in the request. 10 - Job Lock Violation: SAT lock for a job failed, because the port semaphore is already locked for another job. 11 - SAT Operation Without Lock: Occurs when a SAT operation is received without proper lock, ensures proper locking sequence. 12 - Unsupported Job ID: Job ID is above the maximal allowed number. 13 - Bad SAT Lock Operation: Indicates that a job tries to lock the same semaphore twice (can take place due to SLB packets retries). 14 - (LLT Only) Sharp Payload Not Aligned: Indicates that the SHARP payload is not properly aligned, used for memory alignment validation. 15 - ANDR Request On Non SAT: Occurs when an ANDR request is made on a non-SAT operation, used for operation type validation. 16 - Group Context Does Not Exist: Indicates that the requested group context does not exist, used for group context validation. 17 - Bad SAT Unlock Operation: Indicates that a SAT unlock operation failed, because the switch is still handling packets related to the job. 18 - No Free Buffers: Indicates that there are no available memory buffers. Can hint to a memory leak. 255 - General Error: Triggered by default case in error handling, used when the error cannot be associated with any specific syndrome. | Depending on the syndrome, this error can be a result of a wrong logic of the client application or an internal Sharp issue. The following syndromes hint to a problem in the client app: 2 - Invalid Opcode: The different ranks of the application are not using the same aggregation/reduction logic. 17 - Bad SAT Unlock Operation: Can occur when client application terminates abnormally, and there are still pending operations related to this application. All the other syndromes, hint to an issue in Sharp logic. |
SharpError | 135 | Reports streaming aggregation errors (SAT). This trap focuses on streaming aggregation issues. Syndromes: 0 - Lock Semaphore Timeout: A lock was acquired, but was not used before timeout expired. 1 - OST No Progress Timeout: Started to receive a message, but not all packets of the message arrived before timeout expired. 2 - Bad SAT Request Classification: A transaction arrived with a request to perform an operation that is not supported by SAT. 3 - SAT ANDR Got Multiple Targets: Ambiguous routing for Reduce Scatter. The reduce operation has more than 1 target, but should have only 1. 4 - SAT Bad Target Header: SAT operation with an invalid SAT properties in the Target Header. 5 - SAT No Destination On Root: ANDR request with no destination on root switch. 6 - SAT Exceeds Number Of Outstanding Operations: Max number of transactions 'in the air' per tuple is 64, and this limit is breached. 7 - SAT Unbalanced FIFO Data: Unbalanced payload issue. 8 - SAT Unbalanced OST Address: Out-of-sequence issue. 9 - SAT Data Corruption: Data integrity issue. 10 - SAT Bad Data Granularity: Data size issue. | Depending on the syndrome, this error can be a result of a wrong logic of the client application or an internal Sharp issue. The following syndromes hint to a problem in the client app: 1 - OST No Progress Timeout: Can occur due to physical link issues, or when a client application stalls for a long time while transmitting data, for example getting no CPU resources due to high CPU load of another application. 7 - SAT Unbalanced FIFO Data : The different ranks of the application are not using the same aggregation/reduction length. 10 - SAT Bad Data Granularity - This can take place if the application sent truncated data. All the other syndromes, hint to an issue in Sharp logic. |