NVIGI - Programming Guide For Local And Cloud Inference
This guide primarily focuses on the general use of the AI plugins performing local or cloud inference. Before reading this please read the general programming guide
IMPORTANT: This guide might contain pseudo code, for the up to date implementation and source code which can be copy pasted please see the basic sample
Version 1.1.0 Release
Table of Contents
INTRODUCTION
NVIGI AI plugins provide unified API for both local and cloud inference. This ensures easy transition between local and cloud services and full flexibility. The AI API is located in nvigi_ai.h
header.
Key Concepts
Each AI plugin implements certain feature with the specific backend and underlying API, here are some examples:
nvigi.plugin.gpt.ggml.cuda -> implements GPT feature using GGML backend and CUDA API for local execution
nvigi.plugin.gpt.cloud.rest -> implements GPT feature using CLOUD backed and REST API for remote execution
Models used by the AI local plugins are stored in a specific NVIGI model repository (more details in sections below)
All AI plugins implement and export the same generic interface
InferenceInterface
AI plugins can act as “parent” plugins encapsulating multiple features but still exposing the same unified API
InferenceInterface
is used to obtain capabilities and requirements (VRAM etc.) and also create and destroy instance(s)Each created instance is represented by the
InferenceInstance
interfaceInferenceInstance
contains generic API for running the inference given theInferenceExecutionContext
which contains input slots, callbacks to get results, runtime parameters etc.All inputs and outputs use generic
data slots
, like for exampleInferenceDataByteArray
,InferenceDataText
,InferenceDataAudio
etc.Host application obtains results (output slots) either via registered callback or polling mechanism
Model Repository
For consistency, all NVIGI AI inference plugins store their models using the following directory structure:
$ROOT/
├── nvigi.plugin.$name.$backend
└── {MODEL_GUID}
└── files
Here is an example structure for the existing NVIGI plugins and models:
$ROOT/
├── nvigi.plugin.gpt.ggml
│ └── {175C5C5D-E978-41AF-8F11-880D0517C524}
│ ├── gpt-7b-chat-q4.gguf
│ └── nvigi.model.config.json
└── nvigi.plugin.asr.ggml
└── {5CAD3A03-1272-4D43-9F3D-655417526170}
├── ggml-asr-small.gguf
└── nvigi.model.config.json
NOTE: Each plugin can have as many different models (GUIDs) as needed
Each model must provide nvigi.model.config.json
file containing:
model’s card (name, file extension(s) and instructions on how to obtain it)
vram consumption (if local)
other model specific information (e.g. prompt template for LLM models)
Here is an example from nvigi.plugin.gpt.ggml
, model information for llama3.1 8B instruct
:
{
"name": "llama-3.1-8b-instruct",
"vram": 5124,
"prompt_template": [
"<|begin_of_text|>",
"<|start_header_id|>system<|end_header_id|>\n\n",
"$system",
"\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n",
"$user",
"\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"$assistant"
],
"turn_template": [
"<|start_header_id|>user<|end_header_id|>\n\n",
"$user",
"\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
],
"model":
{
"ext" : "gguf",
"notes": "Must use .gguf extension and format, model(s) can be obtained for free on huggingface",
"file":
{
"command": "curl -L -o llama-3.1-8b-instruct.gguf 'https://huggingface.co/ArtyLLaMa/LLaMa3.1-Instruct-8b-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf?download=true'"
}
}
}
NOTE: Some models will only be accessible via NGC and require special token for access. These models are normally licensed differently and require developers to contact NVIDIA to obtain access.
An optional configs
subfolder, with the identical subfolder structure, can be added under $ROOT
to provide nvigi.model.config.json
overrides as shown below:
$ROOT/
├── configs
├── nvigi.plugin.$name.$backend
└── {MODEL_GUID}
├── nvigi.model.config.json
| etc
├── nvigi.plugin.$name.$backend
└── {MODEL_GUID}
├── files
NOTE: This allows quick way of modifying only model config settings in JSON without having to reupload the entire model which can be several GBs in size
Obtaining Models
Local Execution
As mentioned in the above section, each model configuration JSON contains model card with instructions on how to obtain each model.
To add new local model to the model repository please follow these steps:
Generate new GUID in the
registry format
, like for example{2467C733-2936-4187-B7EE-B53C145288F3}
Create new folder under
nvigi.plugins.$feature.$backend
and name it the above GUID (like for example{2467C733-2936-4187-B7EE-B53C145288F3}
)Copy an existing
nvigi.model.config.json
from already available modelsModify
name
field in the JSON to match your modelModify
vram
field to match your model’s VRAM requirements in MB (NVIGI logs VRAM consumption per instance in release/debug build configurations)Modify any custom model specific bits (for example, each GPT/LLM model requires specific prompt setup)
Download model from Hugging Face or other source (NGC etc.)
Unzip any archives and ensure correct extension(s) are used (for example if using GGML backend all models must use
.gguf
extension)
NOTE: Please keep in mind that some latest models might not work with the backends provided in this version of NVIGI SDK hence plugin(s) need to be upgraded.
Prompt Templates for LLM Models
Each LLM required correct prompt template. These templates are stored in the above mentioned model configuration JSON and look something like this:
"prompt_template": [
"<|begin_of_text|>",
"<|start_header_id|>system<|end_header_id|>\n\n",
"$system",
"\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n",
"$user",
"\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"$assistant"
],
"turn_template": [
"<|start_header_id|>user<|end_header_id|>\n\n",
"$user",
"\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
],
To make these prompt templates, one needs to find the jinja (very simplistic programming language) that the model uses to format itself (or more trivially, look for “
Next step is to find where the system, user, and assistant start and stop markers are
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ .Content }}{{ if not $last }}<|im_end|>
Now we know the LLM wants something that looks like this:
<|im_start|>system
You are a helpful Ai agent...
<|im_end|>
<|im_start|>user
Can you tell me a story to help my daughter go to sleep?
<|im_end|>
<|im_start|>assistant
Sure, Once upon a time...
<|im_end|>
Hence we make the prompt template like this
"prompt_template": [
"<|im_start|>system\n",
"$system",
"<|im_end|>\n\n",
"<|im_start|>user\n",
"$user",
"<|im_end|>\n<|im_start|>assistant\n",
"$assistant"
],
NOTE: IGI does NOT automatically add newlines, special care must be taken to include
\n
correctly
IGI uses the “prompt_template” on either the first chat message or if we’re using the LLM in instruct mode (no chat history). For the chat mode, the “turn_template” is required for all remaining turns. This ensures that the last assistant message is terminated properly (if necessary, sometimes the model auto puts this in), and then replicating the user/assistant turn from the prompt template.
So turn template for SMOLLM 2 becomes
"turn_template": [
"<|im_start|>user\n",
"$user",
"<|im_end|>\n<|im_start|>assistant\n",
"$assistant"
]
Remote Execution
To add new remote (cloud) model to the model repository please follow these steps:
Generate new GUID in the
registry format
, like for example{2467C733-2936-4187-B7EE-B53C145288F3}
Create new folder under
nvigi.plugins.gpt.cloud
and name it the above GUID (like for example{2467C733-2936-4187-B7EE-B53C145288F3}
)Copy an existing
nvigi.model.config.json
from already available models (for example frommodel/llama-3.2-3b
located at$ROOT\nvigi.plugin.gpt.cloud\{01F43B70-CE23-42CA-9606-74E80C5ED0B6}\nvigi.model.config.json
)Modify
name
field in the JSON to match your modelModify
request_body
field to match your model’s JSON body for the REST request
If using NVIDIA NIM APIs search and navigate to the model you want to use then copy paste request into the above mentioned JSON. For example, when selecting llama-3_1-70b-instruct the completion code in Python looks like this:
completion = client.chat.completions.create(
model="meta/llama-3.1-70b-instruct",
messages=[{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
temperature=0.2,
top_p=0.7,
max_tokens=1024,
stream=True # NOTE: This is NOT supported by the current version of GPT cloud plugin so it must be set to false (see below)
)
which then translates to the nvigi.model.config.json
file looking like this:
{
"name": "llama-3.1-70b-instruct",
"vram": 0,
"request_body": {
"model": "meta/llama-3.1-70b-instruct",
"messages": [
{"role":"system","content":"$system"},
{"role":"user","content":"$user"}
],
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 1024,
"stream": false
}
}
IMPORTANT: Current version of
nvigi.plugin.gpt.cloud.rest
does not support cloud streaming so that option needs to be set tofalse
For 3rd party cloud solutions, like for example OpenAI, please have a look at the model with GUID {E9102ACB-8CD8-4345-BCBF-CCF6DC758E58}
which contains configuration for gpt-3.5-turbo
. This model can be used the exact same way as any other NIMS based model provided by NVIDIA and also can be used as a template to clone other models which are based on the OpenAI API (just don’t forget to generate new GUID). Here is the config file:
{
"name": "openai/gpt-3.5-turbo",
"vram": 0,
"request_body": {
"model": "gpt-3.5-turbo",
"messages": [
{"role":"system","content":"$system"},
{"role":"user","content":"$user"}
],
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 1024,
"stream": false
}
}
IMPORTANT: If your custom model does not follow the OpenAI API standard for the REST cloud requests, the
request_body
section must be modified to match specific server requirements.
Common And Custom Capabilities and Requirements
IMPORTANT NOTE: This section covers a scenario where the host application can instantiate models that were not included when the application was packaged and shipped. If the models and their capabilities are predefined and there is no need for dynamically downloaded models, you can skip to the next section.
If host application needs to find out more information about the available models in the above mentioned repository, the InferenceInterface
provides getCapsAndRequirements
API which returns feature specific (if any) caps and requirements and common caps and requirements shown below:
//! Generic caps and requirements - apply to all plugins
//!
//! {1213844E-E53B-4C46-A303-741789060B3C}
struct alignas(8) CommonCapabilitiesAndRequirements {
CommonCapabilitiesAndRequirements() {};
NVIGI_UID(UID({ 0x1213844e, 0xe53b, 0x4c46,{ 0xa3, 0x3, 0x74, 0x17, 0x89, 0x6, 0xb, 0x3c } }), kStructVersion1);
size_t numSupportedModels{};
const char** supportedModelGUIDs{};
const char** supportedModelNames{};
size_t* modelMemoryBudgetMB{}; //! IMPORTANT: Provided if known, can be 0 if fully dynamic and depends on inputs
InferenceBackendLocations supportedBackends{};
//! NEW MEMBERS GO HERE, BUMP THE VERSION!
};
Each AI interface provides a generic API (located in nvigi_ai.h
) used to obtain CommonCapabilitiesAndRequirements
and any custom capabilities and requirements (if any).
//! Returns model information
//!
//! Call this method to find out about the available models and their capabilities and requirements.
//!
//! @param modelInfo Pointer to a structure containing supported model information
//! @param params Optional pointer to the setup parameters (can be null)
//! @return nvigi::kResultOk if successful, error code otherwise (see NVIGI_result.h for details)
//!
//! NOTE: It is recommended to use the templated 'getCapsAndRequirements' helper (see below in this header).
//!
//! This method is NOT thread safe.
nvigi::Result(*getCapsAndRequirements)(nvigi::NVIGIParameter** modelInfo, const nvigi::NVIGIParameter* params);
When obtaining information about specific model(s) host application can provide CommonCreationParameters
(see next section for more details) as input so there are several options based on selected backend etc.
Local Plugins
provide specific model GUID and VRAM budget and check if that particular model can run within the budget
provide null model GUID and VRAM budget to get a list of models that can run within the budget
provide null model GUID and “infinite” (MAX_INT) VRAM budget to get a list of ALL models
Cloud Plugins
provide specific model GUID to obtain CloudCapabilities which include URL and other information for the endpoint used by the model
provide null model GUID to get a list of ALL models (CloudCapabilities in this case will NOT provide any info)
If specific feature has custom capabilities and requirements the common ones will always be either chained together or returned as a pointer within the custom caps. In addition, if feature is a pipeline (parent plugin encapsulating two or more plugins) it will return caps and requirements for ALL enclosed plugins. These can be queryied from what is returned by the parent plugin’s getCapsAndRequirements
using nvigi::findStruct<structtype>(rootstruct)
.
IMPORTANT: Always have a look at plugin’s public header
nvigi_$feature.h
to find out if plugin has custom caps etc.
Creation Parameters
Common and Custom
Before creating an instance, host application must specify certain properties like which model should be used, how much VRAM is available, how many CPU threads etc. Similar to the previous section, plugins can have custom creation parameters but they always have to provide common ones. Here is how common creation parameters look like:
//! Generic creation parameters - apply to all plugins
//!
//! {CC8CAD78-95F0-41B0-AD9C-5D6995988B23}
struct alignas(8) CommonCreationParameters {
CommonCreationParameters() {};
NVIGI_UID(UID({ 0xcc8cad78, 0x95f0, 0x41b0,{ 0xad, 0x9c, 0x5d, 0x69, 0x95, 0x98, 0x8b, 0x23 } }), kStructVersion1);
int32_t numThreads{};
size_t vramBudgetMB = SIZE_MAX;
const char* modelGUID{};
const char* utf8PathToModels{};
//! Optional - additional models downloaded on the system (if any)
const char* utf8PathToAdditionalModels{};
//! NEW MEMBERS GO HERE, BUMP THE VERSION!
};
Plugins cannot create an instance unless they know where NVIGI models repository is located, what model GUID to use, how much VRAM is OK to use etc. All this information is provided in common creation parameters. Each plugin can define custom ones, this obviously depends on what parameters are needed to create an instance.
NOTE: Same model GUID can be used by different plugins if they are implementing different backends, for example Whisper GGUF model can be loaded by the
nvigi.plugin.asr.ggml.cuda
andnvigi.plugin.asr.ggml.cpu
plugins
Cloud Plugins
When it comes to cloud plugins they can use two different protocols, REST and gRPC. In addition to the above mentioned common creation parameters cloud plugin require either RESTParameters
or RPCParameters
to be chained together with the common ones. Here is an example:
//! Obtain GPT CLOUD REST interface
nvigi::IGeneralPurposeTransformer* igpt{};
nvigiGetInterface(plugin::gpt::cloud::rest::kId, &igpt);
//! Common parameters
CommonCreationParameters common{};
common.utf8PathToModels = params.modelDir.c_str();
common.modelGUID = "{E9102ACB-8CD8-4345-BCBF-CCF6DC758E58}"; // gpt-3.5-turbo
//! GPT parameters
nvigi::GPTCreationParameters gptCreationParams{};
// TODO: Set some GPT specific items here
if(NVIGI_FAILED(gptCreationParams.chain(common)))
{
// Handle error
}
//! Cloud parameters
RESTParameters cloudParams{};
std::string token;
getEnvVar("OPENAI_TOKEN", token);
cloudParams.url = "https://api.openai.com/v1/chat/completions";
cloudParams.authenticationToken = token.c_str();
cloudParams.verboseMode = true;
if(NVIGI_FAILED(gptCreationParams.chain(cloudParams))) // Chaining cloud parameters!
{
// Handle error
}
nvigi::InferenceInstance* instance{};
igpt->createInstance(gptCreationParams, &instance);
Local Plugins
With local plugins there are few key points to consider when selecting which plugin to use:
Selecting backend and API
How much VRAM can be used?
What is the expected latency?
For example fully GPU bottle-necked application or application which does not have enough VRAM left could do the following and run GPT inference completely on the CPU:
//! Obtain GPT GGML CPU interface
nvigi::IGeneralPurposeTransformer* igpt{};
nvigiGetInterface(plugin::gpt::ggml::cpu::kId, &igpt);
On the other hand, CPU bottle-necked application could do the following and run GPT inference completely on the GPU:
//! Obtain GPT GGML CUDA interface
nvigi::IGeneralPurposeTransformer* igpt{};
nvigiGetInterface(plugin::gpt::ggml::cuda::kId, &igpt);
Compute In Graphics (CIG)
When selecting local inference plugins which utilize CUDA API it is essential to enable CIG if application is using D3D12 or Vulkan rendering APIs. This ensures optimal performance and minimizes latency for local inference execution. CIG is enabled via special interface and here are the steps:
// Obtain special HW interface for CUDA
nvigi::IHWICuda* icig = nullptr;
nvigiGetInterface(nvigi::plugin::hwi::cuda::kId, &icig);
// Specify your device and queue information
nvigi::D3D12Parameters d3d12Params;
d3d12Params.device = <your ID3D12Device*>
d3d12Params.queue = <your (graphics) ID3D12CommandQueue*>
// Chain the D3D12 parameters to any creation parameters when generating local instance
// For example, local GPT using GGML backed and CUDA API (NOT CPU)
if(NVIGI_FAILED(gptCreationParams.chain(d3d12Parameters)))
{
// Handle error
}
Data Types
Here are the common inference data types as declared in nvigi_ai.h
:
InferenceDataText
InferenceDataAudio
InferenceDataTextByteArray
The underlying raw data can be either on the CPU or GPU so it is represented by the types declared in nvigi_cpu.h
, nvigi_d3d12.h
, nvigi_vulkan.h
etc.
CpuData
D3D12Data
VulkanData
For example, this is how one would setup some audio data located on the CPU using the STL helpers from nvigi_stl_helpers.h
std::vector<int16> my_audio = recordMyMonoAudio();
// Auto convert to the underlying `InferenceDataAudio`, single channel, PCM16
nvigi::InferenceDataAudioSTLHelper audioData(my_audio, 1);
Another example, this time setting up a prompt for the GPT plugin:
std::string text = "Hello World!";
nvigi::InferenceDataTextSTLHelper userPrompt(text);
Input Slots
Once we have our instance we need to provide input data slots that match the input signature for the given instance. The InferenceInstance
provides and API to obtain input and output signatures at runtime but they can also be obtained from plugin’s headers and source code. In this guide we will use the Automated Speech Recognition (ASR) as an example.
//! Audio data slot is coming from our previous step, note that we are using operator to convert audioData to InferenceDataAudio*
std::vector<nvigi::InferenceDataSlot> slots = { {nvigi::kASRDataSlotAudio, audioData} };
nvigi::InferenceDataSlotArray inputs = { slots.size(), slots.data() }; // Input slots
NOTE: STL helpers provide an operator which automatically converts data to the underlying low level type used by NVIGI
Execution Context
Before instance can be evaluated the InferenceExectionContext
must be created and populated with all the necessary information. This context contains:
pointer to the instance to use
pointer to the optional callback to receive results
pointer to the optional context for the above callback
pointer to input data slots
pointer to the output data slots (optional and normally provided by plugins)
pointer to any runtime parameters (again can be chained together as needed)
The following sections contain examples showing how to utilize execution context.
Blocking Vs Asynchronous Evaluation
Each plugin can opt to implement blocking and/or non-blocking API used to evaluate instance (essentially run an inference pass). For example:
nvigi::InferenceExecutionContext ctx{};
ctx.runtimeParameters = runtime;
ctx.instance = instance;
ctx.callback = myCallback; // called on a different thread
ctx.callbackUserData = &myCtx;
ctx.inputs = &inputs;
Async approach, returns immediately, callback is triggered from a different thread managed by the instance
ctx.instance->evaluateAsync(&ctx)
Blocking approach, returns when done, callback is triggered on this thread
ctx.instance->evaluate(&ctx)
Obtaining Results
There are two ways to obtain results:
By providing callback in the
InferenceExectionContext
and receiving results either on host’s or NVIGI’s threadBy NOT providing a callback and forcing
evaluateAsync
path, which results in requiring host app to poll for result.
Callback Approach
This is the simplest and easiest way to obtain results. Callback function of the following type must be provided via InferenceExectionContext
before calling evaluate:
auto inferenceCallback = [](const nvigi::InferenceExecutionContext* execCtx, nvigi::InferenceExecutionState state, void* userData)->nvigi::InferenceExecutionState
{
//! Optional user context to control execution
auto userCtx = (HostProvidedCallbackCtx*)userData;
if (execCtx->outputs)
{
const nvigi::InferenceDataText* text{};
execCtx->outputs->findAndValidateSlot(nvigi::kASRDataSlotTranscribedText, &text);
std::string transcribedText = text->getUtf8Text();
//! Do something with the received text
}
if (state == nvigi::InferenceExecutionStateDone)
{
//! This is all the data we can expect to receive
}
else if(userCtx->needToInterruptInference)
{
//! Inform NVIGI that inference should be cancelled
return nvigi::InferenceExecutionStateCancel;
}
return state;
};
nvigi::InferenceExecutionContext asrContext{};
asrContext.instance = asrInstanceLocal; // The instance we created and we want to run inference on
asrContext.callback = asrCallback; // Callback to receive transcribed text
asrContext.callbackUserData = &asrCallback; // Optional context for the callback, can be null if not needed
asrContext.inputs = &inputs;
// BLOCKING
if(NVIGI_FAILED(res, asrContext.instance->evaluate(asrContext)))
{
LOG("NVIGI call failed, code %d", res);
}
IMPORTANT: To cancel inference simply return
nvigi::InferenceExecutionStateCancel
in the callback
Polling Approach
IMPORTANT: This is an optional way to obtain results and each individual plugin must implement special interface
nvigi::IPolledInferenceInterface
in order to enable this functionality. In addition, when using polling,evaluateAsync
is the ONLY viable inference model since we cannot have blocking calls.
Before proceeding any further it is necessary to obtain polling interface from the plugin, assuming it is actually implemented:
nvigi::IPolledInferenceInterface* ipolled{};
nvigiGetInterface(feature, &ipolled);
Upon successful retrieval of the polling interface the next step is to skip providing callback in the execution context. This will automatically make evaluate call async (plugin will generate and manage a thread) and host application will need to check if results are ready before consuming them.
nvigi::InferenceExecutionContext asrContext{};
asrContext.instance = asrInstanceLocal; // The instance we created and we want to run inference on
asrContext.callback = nullptr; // NO CALLBACK WHEN POLLING RESULTS
asrContext.callbackUserData = nullptr;
asrContext.inputs = &inputs;
// ASYNC, note that in this mode one CANNOT use blocking evaluate call
if(NVIGI_FAILED(res, asrContext.instance->evaluateAsync(asrContext)))
{
LOG("NVIGI call failed, code %d", res);
}
// Poll for results on host's thread
nvigi::InferenceExecutionState state = nvigi::InferenceExecutionStateDataPending;
while (state == nvigi::InferenceExecutionStateDataPending)
{
if (blocking)
{
// Block and wait for results
ipolled->getResults(&ctx, true, &state);
// Process results and release them
inferenceCallback(&ctx, state, nullptr);
ipolled->releaseResults(&ctx, state);
}
else
{
// Check if there are some results, if not move on
if(ipolled->getResults(&ctx, false, &state) == nvigi::ResultOk)
{
// Process results and release them
inferenceCallback(&ctx, state, nullptr);
ipolled->releaseResults(&ctx, state);
}
}
}
NOTE: Even with polling we still ultimately use the callback function to process output slots in the execution context, simply for convenience