Simultaneous Compute and Graphics#

TensorRT-RTX provides support for inference while simultaneously running graphics workloads (simultaneous compute and graphics = SCG). The most common use case are video games, where generative AI features such as NPC dialog generation typically execute concurrently with 3D rendering. The following steps must be taken when preparing a model to run in SCG mode:

Shared Memory Limitation#

NVIDIA CUDA compute kernels running concurrently with graphics workloads are limited to 48 KiB or less shared memory on the NVIDIA Ampere and Ada Lovelace architectures (compute capability 8.x). On the NVIDIA Blackwell architecture and beyond (compute capability 10.x and later), the limits may be increased to 84 KiB, but at the cost of L1 cache. Whether this happens, depends on the priority of the DirectX compute command queue:

  • When running a low-priority queue for computation (D3D12_COMMAND_QUEUE_PRIORITY_NORMAL), the limit is 48 KiB, even on Blackwell. This is appropriate for heavy concurrent rendering workloads that require full access to the L1 cache.

  • When running a high-priority computation queue (D3D12_COMMAND_QUEUE_PRIORITY_HIGH), the limit is 84 KiB on Blackwell and subsequent architectures. This situation is appropriate for relatively light graphics workloads that do not require a large L1 cache.

Note

  • Simultaneous graphics and compute workloads are not supported for the NVIDIA Turing architecture (compute capability 7.5 or less).

  • This shared memory limitation restricts the choice of CUDA kernels for performing inference, which will generally result in reduced performance compared to running standalone inference.

Choose an option below to implement this constraint.

Option 1: Using the tensorrt_rtx Executable#

When building a TensorRT-RTX engine with the tensorrt_rtx executable for deploying in SCG mode, add the following flag:

  • When deploying on NVIDIA Ampere or NVIDIA Ada Lovelace:

    --memPoolSize=tacticSharedMem:48K
    
  • When deploying on NVIDIA Blackwell and later, assuming light concurrent graphics workload:

    --memPoolSize=tacticSharedMem:84K
    

These are the maximum shared memory limits that will allow the engine to work in SCG mode. Lower limits are possible (for example, --memPoolSize=tacticSharedMem:32K) but will typically degrade performance and are therefore not recommended.

If you want to set the limits programmatically (for example, as part of an installation script), the nvidia-smi command-line tool may be used to query the compute capability of the current system, for example in Microsoft Windows Powershell:

$nvsmiPath = (Get-Command nvidia-smi.exe -ErrorAction SilentlyContinue).Source

if (-not $nvsmiPath){
    Throw "'nvidia-smi' not found: please install or add to PATH"
}

$ccString = & $nvsmiPath --query-gpu=compute_cap --format=csv,noheader
if (-not $ccString){
    Throw "Error querying compute capability: ensure GPU is installed"
}
# Take first GPU if multiple are installed
$ccString = $ccString.Trim().Split("`n")[0]

# Parse the major version
$majorVersion = [int]($ccString.Split(".")[0])

# Assume light concurrent graphics workload otherwise use 48K limit even on Blackwell
if ($majorVersion -ge 10){
    Write-Host "Blackwell or later (CC $ccString)"
    $scgFlag = "--memPoolSize=tacticSharedMem:84K"
} else if ($majorVersion -eq 8){
    Write-Host "Ampere / Ada (CC $ccString)"
    $scgFlag = "--memPoolSize=tacticSharedMem:48K"
} else {
    Throw "Unsupported CC $ccString"
}

tensorrt_rtx.exe --onnx=example_model.onnx --saveEngine=example_engine.trt $scgFlag

Option 2: Using the C++ or Python API#

Both the C++ and the Python API allow setting the same memory limits programmatically. These APIs accept limits in bytes rather than MiB.

// C++, assuming a single installed GPU
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 0);
unsigned memLimit{};
if (deviceProp.major >= 10){ // Blackwell
    memLimit = 84 * 1024;
} else if (deviceProp.major == 8){ // Ada or Ampere
    memLimit = 48 * 1024;
} else {
    throw "SCG not supported for this compute capability";
}
auto logger = getLogger();
auto builder = createInferBuilder(*logger);
auto config = builder->createBuilderConfig();
config->setMemoryPoolLimit(MemoryPoolType::kTACTIC_SHARED_MEMORY, memLimit);
// assemble your network…
if (builder.isNetworkSupported(*network, *config)){
    auto hostMem = builder->buildSerializedNetwork(*network, *config);
    // write to a file
    std::ostream output_file("example_engine.trt", std::ios::binary);
    output_file.write(hostMem->data(), hostMem->size());
    output_file.close();
}

The Python API is completely analogous.

import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt_rtx as trt_rtx

device = cuda.Context.get_device()
major, minor = device.compute_capability()
if major >= 10: # Blackwell
    memLimit = 84 * 1024
elif major >= 8: # Ada or Ampere
    memLimit = 48 * 1024
else:
    raise("SCG not supported for this Compute Capability")
logger = trt_rtx.Logger()
builder = trt_rtx.Builder(logger)
config = builder.create_builder_config()
config.set_memory_pool_limit(trt_rtx.MemoryPoolType.TACTIC_SHARED_MEMORY, memLimit)
if builder.is_network_supported(network, config):
    hostMem = builder.build_serialized_network(network, config)
    with open("example_engine.trt", "wb") as fout:
        fout.write(hostMem)

CUDA Context in CiG Mode#

Before starting inference at runtime, care must be taken to ensure that the current CUDA context runs in CUDA-in-Graphics (CiG) mode (refer to the CUDA Driver API documentation for more information).

CiG mode is available in CUDA version 12.6 and later (NVIDIA RTX driver 555 and later) and will not work with earlier CUDA or driver versions.

While recent driver versions support CiG with the Vulkan graphics API, TensorRT-RTX currently only supports DirectX operability, and Vulkan support will only be added in future versions.

The following code demonstrates how to enable CiG in the CUDA context.

Note

The cuCtxCreate_v4 function automatically associates the newly created CUDA context with the current thread, therefore, the following code can be used for normal setup of TensorRT-RTX for inference.

// Get CUDA device ID, assuming single device on current system
int deviceIndex;
cudaGetDevice(&deviceIndex);
CUdevice device;
cuDeviceGet(&device, deviceIndex);
// Check that CiG is supported for the current hardware with DirectX
int isSupported;
cuDeviceGetAttribute(&isSupported, CU_DEVICE_ATTRIBUTE_D3D12_CIG_SUPPORTED, device);
if (!isSupported){
    throw "Simultaneous graphics / compute not supported for current hardware";
}
// Create DirectX device
ID3D12Device *dxDevice;
D3D12CreateDevice(nullptr, D3D_FEATURE_LEVEL_11_0, IID_PPV_ARGS(&dxDevice));

// Create two DirectX command queues for graphics and computation
D3D12_COMMAND_QUEUE_DESC graphicsQueueDesc{};
graphicsQueueDesc.Flags = D3D12_COMMAND_QUEUE_FLAG_NONE;
graphicsQueueDesc.Type = D3D12_COMMAND_LIST_TYPE_DIRECT;
void* graphicsQueue;
dxDevice->CreateCommandQueue(&graphicsQueueDesc), IID_PPV_ARGS(&graphicsQueue));

D3D12_COMMAND_QUEUE_DESC computeQueueDesc{};
computeQueueDesc.Flags = D3D12_COMMAND_QUEUE_FLAG_NONE;
computeQueueDesc.Type = D3D12_COMMAND_LIST_TYPE_COMPUTE;
computeQueueDesc.Priority = D3D12_COMMAND_QUEUE_PRIORITY_HIGH;
void* dxComputeQueue;
dxDevice->CreateCommandQueue(&computeQueueDesc, IID_PPV_ARGS(&dxComputeQueue));

// Set context creation parameters
CUctxCigParam ctxCigParam;
ctxCigParam.sharedDataType = CIG_DATA_TYPE_D3D12_COMMAND_QUEUE;
ctxCigParam.sharedData = dxComputeQueue;
CUcontextCreateParams ctxCreateParams;
// Exactly one of execAffinityParams and cigParams must be non-null
ctxCreateParams.execAffinityParams = nullptr;
ctxCreateParams.numExecAffinityParams = 0;
ctxCreateParams.cigParams = &ctxCigParam;
// Create a CUDA context with CiG enabled
CUcontext ctx;
cuCtxCreate_v4(&ctx, &ctxCreateParams, 0, device);
// Verify that the current CUDA context has CiG enabled
size_t limit;
cuCtxGetLimit(&limit, CU_LIMIT_CIG_ENABLED);
if (limit == 0){
    throw "Enabling CiG in the current CUDA context failed";
}