TensorRT-RTX 1.1.1
|
Helper for constructing an attention that consumes query, key and value tensors. More...
#include <NvInfer.h>
Public Member Functions | |
bool | setNormalizationOperation (AttentionNormalizationOp op) noexcept |
Set the normalization operation for the attention. More... | |
AttentionNormalizationOp | getNormalizationOperation () const noexcept |
Get the normalization operation for the attention. More... | |
bool | setMask (ITensor &mask) noexcept |
Set whether a mask will be used for the normalization operation. More... | |
ITensor * | getMask () noexcept |
Get the optional mask in attention. More... | |
bool | setCausal (bool isCausal) noexcept |
Set whether the attention will run a causal inference. Cannot be used together with setMask(). More... | |
bool | getCausal () const noexcept |
Get whether the attention will run a causal inference. More... | |
bool | setDecomposable (bool decomposable) noexcept |
Set whether the attention can be decomposed to use multiple kernels if no fused kernel support found. More... | |
bool | getDecomposable () const noexcept |
Get whether the attention can be decomposed to use multiple kernels if no fused kernel support found. More... | |
bool | setInput (int32_t index, ITensor &input) noexcept |
Append or replace an input of this layer with a specific tensor. More... | |
int32_t | getNbInputs () const noexcept |
Get the number of inputs of IAttention. IAttention has three inputs. More... | |
ITensor * | getInput (int32_t index) const noexcept |
Get the IAttention input corresponding to the given index. More... | |
int32_t | getNbOutputs () const noexcept |
Get the number of outputs of a layer. IAttention has one output. More... | |
ITensor * | getOutput (int32_t index) const noexcept |
Get the IAttention output corresponding to the given index. IAttention has only one output. More... | |
bool | setName (char const *name) noexcept |
Set the name of the attention. More... | |
char const * | getName () const noexcept |
Return the name of the attention. More... | |
Protected Member Functions | |
virtual | ~IAttention () noexcept=default |
![]() | |
INoCopy ()=default | |
virtual | ~INoCopy ()=default |
INoCopy (INoCopy const &other)=delete | |
INoCopy & | operator= (INoCopy const &other)=delete |
INoCopy (INoCopy &&other)=delete | |
INoCopy & | operator= (INoCopy &&other)=delete |
Protected Attributes | |
apiv::VAttention * | mImpl |
Helper for constructing an attention that consumes query, key and value tensors.
An attention subgraph implicitly includes three main components, two MatrixMultiply layers known as BMM1 and BMM2, and one normalization operation which defaults to be a Softmax. By default, IAttention is not decomposable and TensorRT will try to use a single fused kernel, which may be more efficient than if the subgraph is expressed without IAttention. Setting the IAttention to decomposable=True can allow IAttention to be decomposed to use multiple kernels if no fused kernel support found.
Query Key Value Mask (optional) | | | | | Transpose | | | | | | -—BMM1-— | | | | | *------------------------— | | Normalization | | | ----—BMM2---— | Output
The attention has the following inputs, in order of input index:
|
protectedvirtualdefaultnoexcept |
|
inlinenoexcept |
Get whether the attention will run a causal inference.
|
inlinenoexcept |
Get whether the attention can be decomposed to use multiple kernels if no fused kernel support found.
|
inlinenoexcept |
Get the IAttention input corresponding to the given index.
index | The index of the input tensor. |
|
inlinenoexcept |
Get the optional mask in attention.
|
inlinenoexcept |
|
inlinenoexcept |
Get the number of inputs of IAttention. IAttention has three inputs.
|
inlinenoexcept |
Get the number of outputs of a layer. IAttention has one output.
|
inlinenoexcept |
Get the normalization operation for the attention.
|
inlinenoexcept |
Get the IAttention output corresponding to the given index. IAttention has only one output.
index | The index of the output tensor. |
|
inlinenoexcept |
|
inlinenoexcept |
Set whether the attention can be decomposed to use multiple kernels if no fused kernel support found.
|
inlinenoexcept |
Append or replace an input of this layer with a specific tensor.
index | the index of the input to modify. |
input | the new input tensor. |
The indices are as follows:
Input 0 is the input query tensor. Input 1 is the input key tensor. Input 2 is the input value tensor.
|
inlinenoexcept |
Set whether a mask will be used for the normalization operation.
mask | the mask tensor of type kBOOL or the same data type of BMM1 output with shape [batchSize, sequenceLengthQuery, sequenceLengthKeyValue]. For a kBOOL mask, a True value indicates that the corresponding position is allowed to attend. For other data types, the mask values will be added to the BMM1 output, known as an add mask. |
|
inlinenoexcept |
Set the name of the attention.
The name is used in error diagnostics. This method copies the name string.
|
inlinenoexcept |
Set the normalization operation for the attention.
|
protected |
Copyright © 2024 NVIDIA Corporation
Privacy Policy |
Manage My Privacy |
Do Not Sell or Share My Data |
Terms of Service |
Accessibility |
Corporate Policies |
Product Security |
Contact