|
TensorRT 11.0.0
|
Helper for constructing an attention that consumes query, key and value tensors. More...
#include <NvInfer.h>
Public Member Functions | |
| bool | setNormalizationOperation (AttentionNormalizationOp op) noexcept |
| Set the normalization operation for the attention. More... | |
| AttentionNormalizationOp | getNormalizationOperation () const noexcept |
| Get the normalization operation for the attention. More... | |
| bool | setMask (ITensor &mask) noexcept |
| Set whether a mask will be used for the normalization operation. More... | |
| ITensor * | getMask () noexcept |
| Get the optional mask in attention. More... | |
| TRT_DEPRECATED bool | setCausal (bool isCausal) noexcept |
| Set whether the attention will run a causal inference. Cannot be used together with setMask(). More... | |
| TRT_DEPRECATED bool | getCausal () const noexcept |
| Get whether the attention will run a causal inference. More... | |
| bool | setCausalKind (CausalMaskKind kind) noexcept |
| Set the causal mask alignment orientation for the attention. More... | |
| CausalMaskKind | getCausalKind () const noexcept |
| Get the causal mask alignment orientation for the attention. More... | |
| bool | setDecomposable (bool decomposable) noexcept |
| Set whether the attention can be decomposed to use multiple kernels if no fused kernel support found. More... | |
| bool | getDecomposable () const noexcept |
| Get whether the attention can be decomposed to use multiple kernels if no fused kernel support found. More... | |
| bool | setInput (int32_t index, ITensor &input) noexcept |
| Append or replace an input of this layer with a specific tensor. More... | |
| int32_t | getNbInputs () const noexcept |
| Get the number of inputs of IAttention. IAttention has three inputs. More... | |
| ITensor * | getInput (int32_t index) const noexcept |
| Get the IAttention input corresponding to the given index. More... | |
| int32_t | getNbOutputs () const noexcept |
| Get the number of outputs of a layer. IAttention has one output. More... | |
| ITensor * | getOutput (int32_t index) const noexcept |
| Get the IAttention output corresponding to the given index. IAttention has only one output. More... | |
| bool | setName (char const *name) noexcept |
| Set the name of the attention. More... | |
| char const * | getName () const noexcept |
| Return the name of the attention. More... | |
| bool | setNormalizationQuantizeScale (ITensor &tensor) noexcept |
| Set the quantization scale for the attention normalization output. More... | |
| ITensor * | getNormalizationQuantizeScale () const noexcept |
| Get the quantization scale for the attention normalization output. More... | |
| bool | setNormalizationQuantizeToType (DataType type) noexcept |
| Set the datatype the attention normalization is quantized to. More... | |
| DataType | getNormalizationQuantizeToType () const noexcept |
| Get the datatype the attention normalization is quantized to. More... | |
| bool | setMetadata (char const *metadata) noexcept |
| Set the metadata for IAttention. More... | |
| char const * | getMetadata () const noexcept |
| Get the metadata of IAttention. More... | |
| bool | setNbRanks (int32_t nbRanks) noexcept |
| Set the number of ranks for multi-device attention execution. More... | |
| int32_t | getNbRanks () const noexcept |
| Get the number of ranks for multi-device execution. More... | |
| TRT_NODISCARD bool | setQueryForm (AttentionIOForm form) noexcept |
| Set the query form. More... | |
| TRT_NODISCARD AttentionIOForm | getQueryForm () const noexcept |
| Get the query form. More... | |
| TRT_NODISCARD bool | setKeyValueForm (AttentionIOForm form) noexcept |
| Set the key-value form. More... | |
| TRT_NODISCARD AttentionIOForm | getKeyValueForm () const noexcept |
| Get the key-value form. More... | |
| TRT_NODISCARD bool | setQueryLengths (ITensor *lengths) noexcept |
| Set the query lengths tensor. More... | |
| TRT_NODISCARD ITensor * | getQueryLengths () const noexcept |
| Get the query lengths tensor. More... | |
| TRT_NODISCARD bool | setKeyValueLengths (ITensor *lengths) noexcept |
| Set the key-value lengths tensor. More... | |
| TRT_NODISCARD ITensor * | getKeyValueLengths () const noexcept |
| Get the key-value lengths tensor. More... | |
Protected Member Functions | |
| virtual | ~IAttention () noexcept=0 |
Protected Member Functions inherited from nvinfer1::INoCopy | |
| INoCopy ()=default | |
| virtual | ~INoCopy ()=default |
| INoCopy (INoCopy const &other)=delete | |
| INoCopy & | operator= (INoCopy const &other)=delete |
| INoCopy (INoCopy &&other)=delete | |
| INoCopy & | operator= (INoCopy &&other)=delete |
Protected Attributes | |
| apiv::VAttention * | mImpl |
Helper for constructing an attention that consumes query, key and value tensors.
An attention subgraph implicitly includes three main components, two MatrixMultiply layers known as BMM1 and BMM2, and one normalization operation which defaults to be a Softmax. By default, IAttention is not decomposable and TensorRT will try to use a single fused kernel, which may be more efficient than if the subgraph is expressed without IAttention. Setting the IAttention to decomposable=True can allow IAttention to be decomposed to use multiple kernels if no fused kernel support found.
Query Key Value Mask (optional) NormalizationQuantizeScale (optional) | | | | | | Transpose | | | | | | | | -—BMM1-— | | | | | | | *------------------------— | | | | Normalization | | | | | *---------------------------------------------— | | ----—BMM2---— | Output
The attention has the following inputs, in order of input index:
The attention has one output:
|
inlineprotectedpure virtualdefaultnoexcept |
|
inlinenoexcept |
Get whether the attention will run a causal inference.
|
inlinenoexcept |
Get the causal mask alignment orientation for the attention.
|
inlinenoexcept |
Get whether the attention can be decomposed to use multiple kernels if no fused kernel support found.
|
inlinenoexcept |
Get the IAttention input corresponding to the given index.
| index | The index of the input tensor. |
|
inlinenoexcept |
Get the key-value form.
|
inlinenoexcept |
Get the key-value lengths tensor.
|
inlinenoexcept |
Get the optional mask in attention.
|
inlinenoexcept |
Get the metadata of IAttention.
|
inlinenoexcept |
|
inlinenoexcept |
Get the number of inputs of IAttention. IAttention has three inputs.
|
inlinenoexcept |
Get the number of outputs of a layer. IAttention has one output.
|
inlinenoexcept |
Get the number of ranks for multi-device execution.
|
inlinenoexcept |
Get the normalization operation for the attention.
|
inlinenoexcept |
Get the quantization scale for the attention normalization output.
|
inlinenoexcept |
Get the datatype the attention normalization is quantized to.
|
inlinenoexcept |
Get the IAttention output corresponding to the given index. IAttention has only one output.
| index | The index of the output tensor. |
|
inlinenoexcept |
Get the query form.
|
inlinenoexcept |
Get the query lengths tensor.
|
inlinenoexcept |
Set whether the attention will run a causal inference. Cannot be used together with setMask().
| isCausal | True to enable causal masking with kUPPER_LEFT alignment, false to disable causal masking. |
|
inlinenoexcept |
Set the causal mask alignment orientation for the attention.
When set to kUPPER_LEFT or kLOWER_RIGHT, an implicit causal mask is applied. When set to kNONE, no causal masking is applied.
Cannot be used together with setMask(). Building with both a mask tensor and a causal orientation other than kNONE will fail validation.
| kind | The causal mask alignment to apply. |
|
inlinenoexcept |
Set whether the attention can be decomposed to use multiple kernels if no fused kernel support found.
|
inlinenoexcept |
Append or replace an input of this layer with a specific tensor.
| index | the index of the input to modify. |
| input | the new input tensor. |
The indices are as follows:
Input 0 is the input query tensor. Input 1 is the input key tensor. Input 2 is the input value tensor.
|
inlinenoexcept |
Set the key-value form.
Default is kPADDED_BHND.
| form | The key-value form. |
|
inlinenoexcept |
Set the key-value lengths tensor.
An optional tensor to specify per-batch key-value lengths. The semantics depend on the key-value form:
| lengths | A 1D tensor of type kINT32. If nullptr, clears a previously set key-value lengths tensor. |
|
inlinenoexcept |
Set whether a mask will be used for the normalization operation.
| mask | the mask tensor of type kBOOL or the same data type of BMM1 output with 4d shape broadcastable to [batchSize, numHeadsQuery, sequenceLengthQuery, sequenceLengthKeyValue]. For a kBOOL mask, a True value indicates that the corresponding position is allowed to attend. For other data types, the mask values will be added to the BMM1 output, known as an add mask. |
|
inlinenoexcept |
Set the metadata for IAttention.
The metadata is emitted in the JSON returned by IEngineInspector with ProfilingVerbosity set to kDETAILED.
| metadata | The per-layer metadata. |
|
inlinenoexcept |
Set the name of the attention.
The name is used in error diagnostics. This method copies the name string.
|
inlinenoexcept |
Set the number of ranks for multi-device attention execution.
When nbRanks > 1, this hints attention to perform multi-device attention.
| nbRanks | The number of ranks. Must be >= 1. |
|
inlinenoexcept |
Set the normalization operation for the attention.
|
inlinenoexcept |
Set the quantization scale for the attention normalization output.
| tensor | for quantization scale. Data type must be DataType::kFLOAT, DataType::kHALF or DataType::kBF16. Must be a 0-d or 1-d. |
|
inlinenoexcept |
Set the datatype the attention normalization is quantized to.
| type | the datatype the attention normalization is quantized to. Must be one of DataType::kFP8, DataType::kINT8. |
|
inlinenoexcept |
Set the query form.
Default is kPADDED_BHND.
| form | The query form. |
|
inlinenoexcept |
Set the query lengths tensor.
An optional tensor to specify the cumulative number of tokens per batch element. Must be set when query form is kPACKED_NHD. Ignored when query form is kPADDED_BHND. When set, contains cumulative token counts with shape [batchSize + 1]. The first element should be 0 and the last element equals totalTokens. The number of tokens for batch i is lengths[i + 1] - lengths[i]. The total_tokens dimension of the query tensor must be >= the last element of this tensor.
| lengths | A 1D tensor of type kINT32 with shape [batchSize + 1]. If nullptr, clears a previously set query lengths tensor. |
|
protected |
Copyright © 2024 NVIDIA Corporation
Privacy Policy |
Manage My Privacy |
Do Not Sell or Share My Data |
Terms of Service |
Accessibility |
Corporate Policies |
Product Security |
Contact