GPU Scheduling for AI

Games typically optimize graphics to maximize GPU utilization, so when you add AI compute to a game it is essential to configure the GPU scheduler to minimize the effect on the game’s frame rate. This document describes how to do this.

Supported plugins

Optimized GPU scheduling is currently utilized by the following plugins.

CUDA

  • plugins/sdk in current SDK

    • Automatic Speech Recognition (nvigi::plugin::asr::ggml::cuda)

    • Embed (nvigi::plugin::embed::ggml::cuda)

    • Generative Pre-Trained Transformer (nvigi::plugin::gpt::ggml::cuda)

    • Riva Magpie Flow / ASqFlow Text to Speech (nvigi::plugin::tts::asqflow_trt, nvigi::plugin::tts::asqflow_ggml)

  • ACE SDK plugins - not available in all packs

    • Audio To Emotion (nvigi::plugin::a2e::trt::cuda)

    • Audio To Face (nvigi::plugin::a2f::trt::cuda)

    • Audio To X Pipeline (nvigi::plugin::a2x::pipeline)

D3D

  • Automatic Speech Recognition (nvigi::plugin::asr::ggml::d3d12)

  • Generative Pre-Trained Transformer (nvigi::plugin::gpt::ggml::d3d12)

Vulkan

Optimized GPU scheduling for selected Vulkan plugins is planned for a future release.

Supported graphics APIs

Where optimized GPU scheduling is available, inference compute and graphics workloads may run in parallel on the GPU. NVIGI’s GPU scheduling uses D3D async compute for D3D plugins and CUDA in Graphics (CIG) for CUDA plugins. It also takes advantage of a driver interface for setting relative priorities of CUDA vs D3D graphics workloads and D3D compute vs D3D graphics workloads. Optimized GPU scheduling support depends on both the compute API used by the NVIGI plugin, and the Graphics API used by the application. The following table shows the currently supported combinations. In parenthesis is the minimum driver required to get full functionality, including settable priorities and correct interaction with other libraries such as NVIDIA DLFG.

Plugin's compute API App's graphics API
D3D Graphics Vulkan Graphics
CUDA Optimized (575.00)* Optimized (580.88)
D3D Compute Optimized (580.88) Not optimized
Vulkan Compute Not optimized Not optimized**

(*) CUDA requires that Hardware scheduling be enabled in Windows 10 and 11. Windows 11 has hardware scheduling enabled by default.

(**) Although running Vulkan Compute in parallel with Vulkan Graphics is possible in general, it isn’t currently supported in NVIGI. We plan to fix this in a future release.

A note on performance measurement

It is important not to use GPU begin/end event queries (such as D3D12_QUERY_TYPE_TIMESTAMP or CUDA events) to measure the cost of AI features. The reason is that such measurements do not take into account how much graphics work is running in parallel with the AI compute during the measured interval. For example, if AI is using 10% of the execution units of the GPU for 1ms, and graphics is running in parallel using the other 90%, then queries would report the AI cost as 1ms, which is not correct. Instead it’s better to add a way to toggle the AI feature (for example a command line parameter), run a benchmark twice with the feature enabled and disabled, and report the difference in total frame time.

How to use

In order to schedule graphics and compute efficiently, NVIGI needs to know the D3D direct queue that your game is using for graphics. To do this, fill in a D3D12Parameters struct and chain it to the parameters of all NVIGI instances that you create. The following example shows how to do this for the ASR plugin.

nvigi::ASRWhisperCreationParameters asrParams{};
asrParams.language = "en";
asrParams.common = &common;
asrParams.common->numThreads = myNumCPUThreads;
asrParams.common->vramBudgetMB = myVRAMBudget;
<etc>

D3D12Parameters d3d12Parameters{};
d3d12Parameters.queue = myGraphicsQueue;

// Tell ASR to run in parallel with graphics
if(NVIGI_FAILED(asrParams.chain(d3d12Parameters)))
{
    // Handle error
}

iasr->createInstance(asrParams, &asrInstance);

This ensures that compute work submitted by NVIGI will run efficiently with graphics running in parallel.

How to set the relative priority of compute and graphics

In the past GPU compute workloads in games were often directly coupled to graphics frame generation, for example raytracing and animation. In contrast, NVIGI pipelines, consisting of automatic speech recognition (ASR), language models (LM), text to speech (TTS) etc. can run asynchronously to frame generation. Their latency requirements are dictated not by human visual perception (measured in milliseconds), but how fast humans listen, think and speak (measured in hundreds of milliseconds). Each invocation of the pipeline (for example to process a question) can run across many graphics frames. For this reason the workload can be thought of as floating relative to the game’s graphics and compute workload.

For past compute workloads coupled to graphics generation, the compute work was often on the critical path of the graphics frame, and so the hardware executed compute with high priority to try to maximize graphics frame rate.

As NVIGI pipelines run independently to graphics generation we now have two variables we might want to optimize, graphics frame rate, and AI inference latency (and throughput). As the rate at which an unloaded GPU can produce AI tokens can be many times higher than a human’s ability to listen to and understand them, it often doesn’t make sense to run AI at maximum rate at the expense of graphics frame rate.

However, different game situations have different needs, and they might change at runtime. For example, a player might want snappy interaction with an NPC between quests (where graphics load might be light), but maximum FPS during gameplay. For this reason, NVIGI exposes the following three GPU scheduling modes:

  • SchedulingMode::kPrioritizeGraphics - try to maximize game FPS

  • SchedulingMode::kPrioritizeCompute - try to minimize AI inference latency

  • SchedulingMode::kBalance - balance scheduling between graphics and compute

The default scheduling mode in NVIGI is kBalance. The default in normal CUDA usage is kPrioritizeCompute. Note that these modes are hints to the GPU scheduling hardware. Results may vary according to workload composition and GPU type.

These settings are currently supported by the following CUDA plugins, when run in parallel with D3D12 graphics:

  • Automatic Speech Recognition (nvigi::plugin::asr::ggml::cuda)

  • Generative Pre-Trained Transformer (nvigi::plugin::gpt::ggml::cuda)

  • ASqFlow Text to Speech (nvigi::plugin::tts::asqflow_trt, nvigi::plugin::tts::asqflow_ggml::cuda)

These settings are currently supported by the following D3D12 plugins, when run in parallel with D3D12 graphics:

  • Automatic Speech Recognition (nvigi::plugin::asr::ggml::d3d12)

  • Generative Pre-Trained Transformer (nvigi::plugin::gpt::ggml::d3d12)

  • ASqFlow Text to Speech (nvigi::plugin::tts::asqflow_ggml::d3d12)

These settings are not currently supported for Vulkan graphics.

API

You can set the priority at any time using SetGpuInferenceSchedulingMode(). The evaluate() call of supported plugins applies the priority to all CUDA kernels it launches, and those kernels may execute after evaluate() returns. The following example shows a game that has two phases, the first is before gameplay starts, and prioritizes NPC responsiveness, the second prioritizes FPS during gameplay.

nvigi::IHWICommon* ihwiCommon{};
nvigiGetInterfaceDynamic(plugin::hwi::common::kId, &ihwiCommon,nvigiLoadInterface);

ihwiCommon->SetGpuInferenceSchedulingMode(SchedulingMode::kPrioritizeCompute);
// evaluate() calls for pre-quest NPC interaction go here

ihwiCommon->SetGpuInferenceSchedulingMode(SchedulingMode::kPrioritizeGraphics);
// evaluate() calls for high FPS quest gameplay go here

CiG and D3D Wrappers (e.g. Streamline)

Care should be taken when integrating NVIGI into an existing application that is also using a D3D object wrapper like Streamline. The queue/device parameters passed to NVIGI must be the native objects, not the app-level wrappers. In the case of Streamline, this means using slGetNativeInterface to retrieve the base interface object before passing it to NVIGI.

Limitations

Note that CUDA Compute In Graphics contexts are not compatible with compute-only CUDA contexts. In particular, accessing CUDA pointers outside of the CUDA context that created them may cause CUDA API errors, which if not handled can cause crashes. So using multiple CUDA contexts in a game is not recommended. CUDA virtual memory and cudaMallocAsync are also not supported.

UnrealEngine 5 example code

The following example shows how to initialize NVIGI’s GPU scheduling in UnrealEngine5 (UE5). It uses UE5’s global dynamic RHI to get the D3D device and command queue that the game uses for graphics.

// UE5 specific code
#include "ID3D12DynamicRHI.h" // For GDynamicRHI
ID3D12DynamicRHI* RHI = nullptr;
if (GDynamicRHI && 
    GDynamicRHI->GetInterfaceType()==ERHIInterfaceType::D3D12)
{
    RHI = static_cast<ID3D12DynamicRHI*>(GDynamicRHI);
}
ID3D12CommandQueue* CmdQ = nullptr;
ID3D12Device* D3D12Device = nullptr;
if (RHI)
{
    CmdQ = RHI->RHIGetCommandQueue();
    int DeviceIndex = 0;
    D3D12Device = RHI->RHIGetDevice(DeviceIndex);
}
// Engine-independent code
nvigi::D3D12Parameters d3d12Params;
d3d12Params.device = D3D12Device;
d3d12Params.queue = CmdQ;

// Code to chain the d3d12Parameters to each NVIGI instance's parameters goes here