The C++ Basic GPT Sample
The C++ GPT sample, nvigi.basic.gpt.cxx, demonstrates language model inference using the NVIGI SDK with modern C++ interfaces. It showcases both local and cloud-based LLM inference with support for chat conversations, streaming responses, and non-blocking polling-based operations perfect for game integration.
NOTE: This sample supports multiple backends (D3D12, CUDA, Vulkan, and Cloud), providing maximum flexibility for different deployment scenarios.
Features
Chat Interface: Interactive multi-turn conversations with message history
Streaming Responses: Token-by-token streaming for real-time display
Async/Polling API: Non-blocking operations perfect for game loops
Multiple Backends: D3D12, CUDA, Vulkan (local), or Cloud (REST API)
Cloud Provider Support: OpenAI, NVIDIA NIM, and other OpenAI-compatible APIs
LoRA Support: Load and apply LoRA adapters for model customization
KV Cache Quantization: FP32, FP16, Q4_0, Q8_0 for memory optimization
Custom JSON Requests: Full control over cloud API requests
Download Required Models
Local Models
For local inference, the sample requires one of the following models:
Plugin |
Model Name |
GUID |
|---|---|---|
nvigi.plugin.gpt.ggml.* |
Llama3.2 3b Instruct |
01F43B70-CE23-42CA-9606-74E80C5ED0B6 |
nvigi.plugin.gpt.ggml.* |
Phi-3.5 Mini Instruct |
8E31808B-C182-4016-9ED8-64804FF5B40D |
Cloud Models
For cloud inference, configure the model in your cloud provider dashboard:
Plugin |
Model Name |
GUID |
URL |
|---|---|---|---|
nvigi.plugin.gpt.cloud.rest |
gpt-3.5-turbo |
E9102ACB-8CD8-4345-BCBF-CCF6DC758E58 |
https://api.openai.com/v1/chat/completions |
nvigi.plugin.gpt.cloud.rest |
Llama 3.2 3B Instruct |
01F43B70-CE23-42CA-9606-74E80C5ED0B6 |
https://integrate.api.nvidia.com/v1/chat/completions |
See the top-level documentation that shipped with your development pack for information on how to download these models.
Building the Sample
The sample is built as part of the SDK build process. After building, copy the binaries using:
copy_sdk_binaries.bat <cfg>
This ensures all DLLs and the executable are in the same directory (bin\x64).
How to Use the GPT Sample
Basic Chat (Local Model)
Open a command prompt in
<SDK_ROOT>Run the command:
bin\x64\nvigi.basic.gpt.cxx.exe --models <SDK_MODELS> --sdk bin\x64
In a standard layout binary development pack or GitHub source tree:
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64
Wait for the initial system prompt to complete
Type your message at the
User>prompt and press EnterThe AI response will stream token-by-token to the console
Continue the conversation or type
quit/exitto exit
Cloud Provider Usage
To use a cloud provider (e.g., OpenAI):
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 --backend cloud --token YOUR_API_KEY --guid {E9102ACB-8CD8-4345-BCBF-CCF6DC758E58}
To use NVIDIA NIM:
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 --backend cloud --token YOUR_NVIDIA_API_KEY --guid {01F43B70-CE23-42CA-9606-74E80C5ED0B6}
Command Line Options
Usage: nvigi.basic.gpt.cxx [options]
--sdk sdk location, if none provided assuming exe location
--plugin plugin location, if none provided assuming sdk location
-m, --models model repo location (REQUIRED)
-t, --threads number of threads (default: 1)
--backend backend to use - d3d12, cuda, vulkan, cloud (default: d3d12)
--guid gpt model guid in registry format (default: {8E31808B-C182-4016-9ED8-64804FF5B40D})
--url URL to use, if none provided default is taken from model JSON
--json custom JSON body for cloud request (path to JSON file)
--token authorization token for the cloud provider
--vram the amount of vram to use in MB (default: 8192)
--cache-type KV cache quantization type: fp16, fp32, q4_0, q8_0 (default: fp16)
--log-level logging level 0-2 (default: 0)
--print-system-info print system information
Examples
Use a different local model:
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 --guid {01F43B70-CE23-42CA-9606-74E80C5ED0B6}
Use Vulkan backend with 4-bit cache quantization:
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 --backend vulkan --cache-type q4_0
Custom VRAM budget for large models:
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 --vram 16384
Cloud with custom JSON request body:
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 --backend cloud --token YOUR_KEY --json custom_request.json
API Patterns Demonstrated
Blocking Chat
The sample demonstrates a simple blocking chat pattern where each message blocks until the full response is generated:
chat.send_message(
{ .role = User, .content = "Hello!" },
[](std::string_view response, ExecutionState state) -> ExecutionState {
std::cout << response; // Stream tokens to console
return state; // Continue
}
);
Non-Blocking Polling (Game Loop Pattern)
The sample also demonstrates a polling-based async pattern perfect for game loops:
auto op = chat.send_message_polled({ .role = User, .content = "Hello!" });
// Game loop
while (game_running) {
// Poll for tokens (non-blocking)
if (auto result = op.try_get_results()) {
std::cout << result->tokens; // Display immediately
if (result->state == ExecutionState::Done) {
chat.finalize_async_response(op);
break;
}
}
// Continue game logic
render_frame();
update_physics();
process_input();
// Can cancel at any time
if (user_pressed_cancel) {
op.cancel();
}
}
KV Cache Quantization
The sample supports different KV cache quantization types to balance memory usage and quality:
fp32: Full precision (highest quality, most memory)
fp16: Half precision (default, good balance)
q4_0: 4-bit quantization (significant memory savings)
q8_0: 8-bit quantization (moderate memory savings)
Use --cache-type to select:
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 --cache-type q4_0
Cloud Provider Configuration
OpenAI
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 ^
--backend cloud ^
--token sk-... ^
--guid {E9102ACB-8CD8-4345-BCBF-CCF6DC758E58}
NVIDIA NIM
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 ^
--backend cloud ^
--token nvapi-... ^
--guid {01F43B70-CE23-42CA-9606-74E80C5ED0B6}
Custom OpenAI-Compatible API
bin\x64\nvigi.basic.gpt.cxx.exe --models data/nvigi.models --sdk bin\x64 ^
--backend cloud ^
--url https://your-api.com/v1/chat/completions ^
--token your-token ^
--guid {YOUR-MODEL-GUID}
Troubleshooting
“Failed to create inference instance”
Verify the model GUID exists in your models directory
Check that you have sufficient VRAM (try reducing
--vram)Ensure the backend is properly installed (e.g., D3D12 requires Windows 10+)
Cloud authentication errors
Verify your API token is valid and not expired
Check that the URL is correct for your provider
Ensure you have an active internet connection
Out of memory errors
Reduce VRAM budget:
--vram 4096Use more aggressive KV cache quantization:
--cache-type q4_0Try a smaller model
Slow inference
Ensure GPU is being used (check Task Manager > Performance > GPU)
Try increasing threads for CPU models:
--threads 8For cloud, check network latency
Programming Notes
This sample demonstrates:
Modern C++ wrapper interfaces (
nvigi::gpt::Instance)Chat conversation management with message history
Streaming token-by-token responses
Blocking callbacks for simple use cases
Non-blocking async operations with polling (perfect for games)
Builder pattern for runtime configuration
RAII-based resource management
std::expectedfor error handlingCloud and local inference with unified API
Run in Debugger
To run nvigi.basic.gpt.cxx in the debugger:
One-time setup in the project file:
In the MSVC IDE, edit the project config settings for
nvigi/samples/nvigi.basic.cxx/gptNavigate to the “Debugging” settings
Set “Command” to
<SDK_ROOT>\bin\x64\nvigi.basic.gpt.cxx.exeSet “Command Arguments” as needed (see command line options above)
Set “Working Directory” to
<SDK_ROOT>/bin/x64
Build the desired non-production config; Release is recommended
After each (re-)build, re-run
copy_sdk_binaries.bat <cfg>The sample can now be run in the debugger