Building and Launching the Loadable#

There are several ways to build and launch a DLA loadable, either embedded in a TensorRT engine or a standalone form.

Refer to the DLA Standalone Mode section to generate a standalone DLA loadable outside TensorRT.

Using trtexec#

To allow trtexec to use the DLA, you can use the –useDLACore flag. For example, to run the ResNet-50 network on DLA core 0 in FP16 mode, with GPU Fallback Mode for unsupported layers, run:

./trtexec --onnx=data/resnet50/ResNet50.onnx --useDLACore=0 --fp16 --allowGPUFallback

The trtexec tool has additional arguments for running networks on DLA. For more information, refer to The trtexec Command-Line Tool section.

Using the TensorRT API#

You can use the TensorRT API to build and run inference with DLA and to enable DLA at the layer level. The relevant APIs and samples are provided in the following sections.

Running on DLA during TensorRT Inference#

The TensorRT builder can be configured to enable inference on DLA. DLA support is currently limited to networks running in FP16 and INT8 mode. The DeviceType enumeration is used to specify the device on which the network or layer executes. The following API functions in the IBuilderConfig class can be used to configure the network to use DLA:

  • setDeviceType(ILayer* layer, DeviceType deviceType): This function sets the deviceType on which the layer must execute.

  • getDeviceType(const ILayer* layer): This function can be used to return the deviceType that this layer executes on. If the layer is executing on the GPU, this returns DeviceType::kGPU.

  • canRunOnDLA(const ILayer* layer): This function checks whether a layer can run on DLA.

  • setDefaultDeviceType(DeviceType deviceType): This function sets the builder’s default deviceType. It ensures that all the layers that can run on DLA run on DLA unless setDeviceType is used to override the deviceType for a layer.

  • getDefaultDeviceType(): This function returns the default deviceType set by setDefaultDeviceType.

  • isDeviceTypeSet(const ILayer* layer): This function checks whether the deviceType has been explicitly set for this layer.

  • resetDeviceType(ILayer* layer): This function resets the deviceType for this layer. The value is reset to the deviceType specified by setDefaultDeviceType or DeviceType::kGPU if none is specified.

  • allowGPUFallback(bool setFallBackMode): This function notifies the builder to use GPU if a layer that was supposed to run on DLA cannot run on DLA. For more information, refer to the GPU Fallback Mode section.

  • reset(): This function can reset the IBuilderConfig state, which sets the deviceType for all layers to DeviceType::kGPU. After reset, the builder can be reused to build another network with a different DLA config.

The following API functions in the IBuilder class can be used to help configure the network for using the DLA:

  • getMaxDLABatchSize(): This function returns the maximum batch size DLA can support.

    Note

    For any tensor, the total volume of index dimensions combined with the requested batch size must not exceed the value returned by this function.

  • getNbDLACores(): This function returns the number of DLA cores available to the user.

If the builder is not accessible, such as when a plan file is being loaded online in an inference application, then the DLA to be used can be specified differently using DLA extensions to the IRuntime. The following API functions in the IRuntime class can be used to configure the network to use DLA:

  • getNbDLACores(): This function returns the number of DLA cores accessible to the user.

  • setDLACore(int dlaCore): The DLA core to execute on. Where dlaCore is a value between 0 and getNbDLACores() - 1. The default value is 0.

  • getDLACore(): The DLA core to which the runtime execution is assigned. The default value is 0.

Example: Run Samples With DLA#

This section details how to run a TensorRT sample with DLA enabled.

  1. Create the builder and configure the workspace memory pool. Batch dimensions are controlled via the input tensor shape and optimization profiles, not a builder-level batch size.

    1auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(gLogger));
    2if (!builder) return false;
    3config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, 16_MB);
    
    1builder = trt.Builder(TRT_LOGGER)
    2if not builder:
    3    return False
    4config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 16 << 20)
    
  2. Enable GPUFallback mode. In TensorRT 11.0 and later, layer precisions (FP16 or INT8) are specified at the network level via a strongly typed network (created with NetworkDefinitionCreationFlag::kSTRONGLY_TYPED) or by importing a pre-quantized ONNX model. Refer to Migrating from TensorRT 10.x to 11.x for the precision conversion path.

    1config->setFlag(BuilderFlag::kGPU_FALLBACK);
    
    1config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
    
  3. Enable execution on DLA, where dlaCore specifies the DLA core to execute on.

    1config->setDefaultDeviceType(DeviceType::kDLA);
    2config->setDLACore(dlaCore);
    
    1config.default_device_type = trt.DeviceType.DLA
    2config.DLA_core = dla_core
    
  4. With these additional changes, the sample is ready to execute on DLA. To run samples with DLA Core 1, append --useDLACore=0 to the sample command.

Example: Enable DLA Mode for a Layer during Network Creation#

In this example, let us create a simple network with Input, Convolution, and Output.

  1. Create the builder, builder configuration, and the network.

    1IBuilder* builder = createInferBuilder(gLogger);
    2IBuilderConfig* config = builder.createBuilderConfig();
    3INetworkDefinition* network = builder->createNetworkV2(0U);
    
    1builder = trt.Builder(TRT_LOGGER)
    2config = builder.create_builder_config()
    3network = builder.create_network(0)
    
  2. Add the Input layer to the network with the input dimensions.

    1auto data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
    
    1data = network.add_input(INPUT_BLOB_NAME, dt, (1, INPUT_H, INPUT_W))
    
  3. Add the Convolution layer with hidden layer input nodes, strides, and weights for filter and bias.

    1auto conv1 = network->addConvolution(*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
    2conv1->setStride(DimsHW{1, 1});
    
    1conv1 = network.add_convolution(
    2    data.get_output(0), 20, (5, 5),
    3    weight_map["conv1filter"], weight_map["conv1bias"],
    4)
    5conv1.stride = (1, 1)
    
  4. Set the Convolution layer to run on DLA. As above, layer precision (FP16 or INT8) is configured at the network level via a strongly typed network or pre-quantized ONNX import; the per-precision BuilderFlag values have been removed in TensorRT 11.0.

    1if(canRunOnDLA(conv1))
    2{
    3    builder->setDeviceType(conv1, DeviceType::kDLA);
    4}
    
    1if builder.can_run_on_DLA(conv1):
    2    builder.set_device_type(conv1, trt.DeviceType.DLA)
    
  5. Mark the output.

    1network->markOutput(*conv1->getOutput(0));
    
    1network.mark_output(conv1.get_output(0))
    
  6. Set the DLA core to execute on.

    1config->setDLACore(0)
    
    1config.DLA_core = 0
    

Enable DLA Mode when Parsing ONNX Networks#

By default, when parsing an ONNX model into a TensorRT network to build a DLA engine, all ONNX operators are marked as supported under the assumption that GPU fallback is enabled. For users who disable GPU fallback and want better diagnostics on which ONNX operators are supported on DLA, a configuration flag is available when creating the parser.

  1. Create the builder, builder config, network, and parser.

    1IBuilder* builder = createInferBuilder(gLogger);
    2IBuilderConfig* config = builder.createBuilderConfig();
    3INetworkDefinition* network = builder->createNetworkV2(0U);
    4IParser* parser = createParser(*network, logger);
    
    1builder = trt.Builder(TRT_LOGGER)
    2config = builder.create_builder_config()
    3network = builder.create_network(0)
    4parser = trt.OnnxParser(network, TRT_LOGGER)
    
  2. Attach the builder config and set the flag REPORT_DLA_CAPABILITY for the parser.

    1parser->setBuilderConfig(config);
    2parser->setFlag(OnnxParserFlag::kREPORT_CAPABILITY_DLA);
    
    1parser.set_builder_config(config)
    2parser.set_flag(trt.OnnxParserFlag.REPORT_CAPABILITY_DLA)
    

After setting the flag and builder config, the parser will report an error if it encounters an ONNX node that is not natively supported on DLA.

Using the cuDLA API#

cuDLA is an extension of the CUDA programming model that integrates DLA runtime software with CUDA. This integration makes it possible to launch DLA loadables using CUDA programming constructs such as streams and graphs.

CuDLA transparently manages shared buffers and synchronizes the tasks between GPU and DLA. Refer to the NVIDIA cuDLA documentation on how the cuDLA APIs can be used for these use cases while writing a cuDLA application.

Refer to the DLA Standalone Mode section for more information on using TensorRT to build a standalone DLA loadable usable with cuDLA.