I/O Formats#

TensorRT optimizes a network using many different data formats. To allow efficient data passing between TensorRT and a client application, these underlying data formats are exposed at network I/O boundaries, for Tensors marked as network input or output, and when passing data to and from plugins. For other tensors, TensorRT picks formats that result in the fastest overall execution and can insert reformats to improve performance.

You can assemble an optimal data pipeline by profiling the available I/O formats in combination with the formats most efficient for the operations preceding and following TensorRT.

To specify I/O formats, you specify one or more formats as a bitmask.

The following example sets the input tensor format to TensorFormat::kHWC8. Note that this format only works for DataType::kHALF, so the data type must be set accordingly.

1auto formats = 1U << TensorFormat::kHWC8;
2network->getInput(0)->setAllowedFormats(formats);
3network->getInput(0)->setType(DataType::kHALF);
1formats = 1 << int(tensorrt.TensorFormat.HWC8)
2network.get_input(0).allowed_formats = formats
3network.get_input(0).dtype = tensorrt.DataType.HALF

Note that calling setAllowedFormats() or setType() on a tensor that is not a network input/output has no effect and is ignored by TensorRT.

sampleIOFormats illustrates how to specify I/O formats using C++.

The following table shows the supported formats.

Table 9 Supported I/O Formats#

Format

kINT32

kFLOAT

kHALF

kINT8

kBOOL

kUINT8

kINT64

BF16

FP8

FP4/INT4

kLINEAR

Only for GPU

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

kCHW2

No

No

Only for GPU

No

No

No

No

Yes

No

No

kCHW4

No

No

No

Yes

No

No

No

No

No

No

kHWC8

No

No

Only for GPU

No

No

No

No

Only for GPU

No

No

kCHW16

No

No

Only for DLA

No

No

No

No

No

No

No

kCHW32

No

Only for GPU

Only for GPU

Yes

No

No

No

No

No

No

kDHWC8

No

No

Only for GPU

No

No

No

No

Only for GPU

No

No

kCDHW32

No

No

Only for GPU

Only for GPU

No

No

No

No

No

No

kHWC

No

Only for GPU

No

No

No

Yes

No

No

No

No

kDLA_LINEAR

No

No

Only for DLA

Only for DLA

No

No

No

No

No

No

kDLA_HWC4

No

No

Only for DLA

Only for DLA

No

No

No

No

No

No

kHWC16

No

No

Only for NVIDIA Ampere GPUs and later

Only for GPU

No

No

No

No

Only for GPU

No

kDHWC

No

Only for GPU

No

No

No

No

No

No

No

No

Note that for the vectorized formats, the channel dimension must be zero-padded to the multiple of the vector size. For example, if an input binding has dimensions of [16,3,224,224], kHALF data type, and kHWC8 format, then the actual-required size of the binding buffer would be 16*8*224*224*sizeof(half) bytes, even though the engine->getBindingDimension() API will return tensor dimensions as [16,3,224,224]. The values in the padded part (that is, where C=3,4,...,7 in this example) must be filled with zeros.

Refer to the Data Format Descriptions section for how the data are laid out in memory for these formats.

Sparsity#

NVIDIA Ampere Architecture GPUs support Structured Sparsity. The weights must have at least 2 zeros in every four-entry vector to use this feature to achieve higher inference performance. For TensorRT, the requirements are:

  • For Convolution, for each output channel and each spatial pixel in the kernel weights, every four input channels must have at least two zeros. In other words, assuming that the kernel weights have the shape [K, C, R, S] and C % 4 == 0, then the requirement is verified using the following algorithm:

    hasSparseWeights = True
    for k in range(0, K):
        for r in range(0, R):
            for s in range(0, S):
                for c_packed in range(0, C // 4):
                    if numpy.count_nonzero(weights[k, c_packed*4:(c_packed+1)*4, r, s]) > 2 :
                        hasSparseWeights = False
    
  • For MatrixMultiply, of which Constant produces an input, every four elements of the reduction axis (K) must have at least two zeros.

Polygraphy (polygraphy inspect sparsity) can detect whether the operation weights in an ONNX model follow the 2:4 structured sparsity pattern.

To enable the sparsity feature, set the kSPARSE_WEIGHTS flag in the builder config and make sure that kFP16 or kINT8 modes are enabled. For example:

1config->setFlag(BuilderFlag::kSPARSE_WEIGHTS);
2config->setFlag(BuilderFlag::kFP16);
3config->setFlag(BuilderFlag::kINT8);
1config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)
2config.set_flag(trt.BuilderFlag.FP16)
3config.set_flag(trt.BuilderFlag.INT8)

At the end of the TensorRT logs, when the TensorRT engine is built, TensorRT reports which layers contain weights that meet the structured sparsity requirement and which layers TensorRT selects tactics that use the structured sparsity. Sometimes, tactics with structured sparsity can be slower than normal, and TensorRT will choose normal tactics. The following output shows an example of TensorRT logs showing information about sparsity:

Note

You need to enable the VERBOSE level log.

[03/23/2021-00:14:05] [V] [TRT] (Sparsity) Found 3 layer(s) eligible to use sparse tactics: conv1, conv2, conv3
[03/23/2021-00:14:05] [V] [TRT] (Sparsity) Chose 2 layer(s) using sparse tactics: conv2, conv3

Forcing kernel weights to have structured sparsity patterns can lead to accuracy loss. Refer to the Automatic Sparsity tool in PyTorch section to recover lost accuracy with further fine-tuning.

To measure inference performance with structured sparsity using trtexec, refer to the trtexec section.

Empty Tensors#

TensorRT supports empty tensors. A tensor is an empty tensor if it has no elements, that is, has one or more dimensions with a length of zero. Zero-length dimensions usually get no special treatment. If a rule works for a dimension of length L for an arbitrary positive value of L, it usually works for L=0, too.

For example, when concatenating two tensors with dimensions [x,y,z] and [x,y,w] along the last axis, the result has dimensions [x,y,z+w], regardless of whether x, y, z, or w is zero.

Implicit broadcast rules remain unchanged since only unit-length dimensions are special for broadcast. For example, given two tensors with dimensions [1,y,z] and [x,1,z], their sum computed by IElementWiseLayer has dimensions [x,y,z], regardless of whether x, y, or z is zero.

If an engine binding is an empty tensor, it still needs a non-null memory address, and different tensors should have different addresses. This is consistent with the C++ rule that every object has a unique address. For example, new float[0] returns a non-null pointer. If using a memory allocator that might return a null pointer for zero bytes, ask for at least one byte instead.

A 0-dimension tensor is a scalar. Since it has one element, it is never empty. Scalars cannot be broadcasted.

Refer to the TensorRT Operator documentation for any special handling of empty tensors on a per-layer basis.

Reusing Input Buffers#

TensorRT allows specifying a CUDA event to be signaled when the input buffers are free to be reused. This allows the application to immediately refill the input buffer region for the next inference in parallel with finishing the current inference. For example:

1context->setInputConsumedEvent(&inputReady);
1context.set_input_consumed_event(inputReady)