Run Inference
Route inference requests through the gateway using model entity routing, provider routing, or OpenAI-compatible routing.
This tutorial assumes you have an external provider registered. The platform pre-configures a system/nvidia-build provider on startup, which most of the examples below use. To register your own provider, see About Models and Inference.
CLI
Python SDK
Listing Models
There are two distinct ways to list models, depending on what you need:
Use nemo models list to manage model configurations. Use nemo inference models list to discover what’s ready to serve requests right now.
CLI
Python SDK
Route by Model Entity
Route requests using the model entity name. The gateway selects an available provider automatically.
Find available model entities:
CLI
Python SDK
Run inference by passing the Model Entity.
CLI
Python SDK
Route by Provider
Route to a specific provider instance. Use for A/B testing or targeting specific deployments.
Find available providers:
CLI
Python SDK
Pass inference request using provider routing.
CLI
Python SDK
Route using OpenAI SDK
Use the OpenAI-compatible endpoint for drop-in SDK replacement. The model field uses the format {workspace}/{model_entity}.
Model entity naming
The {model_entity} segment is validated against a strict regex
(lowercase letters, digits, and hyphens; 2–63 characters; no slashes).
Vendor-style IDs such as meta/llama-3.3-70b-instruct are rejected with
HTTP 422 — register them as a hyphen-only entity (for example
meta-llama-3-3-70b-instruct) and reference them as
{workspace}/meta-llama-3-3-70b-instruct in the request body. Run
nemo inference models list to see the IDs the gateway accepts.
List Available Models
List models currently reachable via the Inference Gateway. Returns results in OpenAI-compatible format; each id is workspace/model_entity_name.
CLI
Python SDK
Using SDK
Make requests to the OpenAI-compatible Inference Gateway route with the NeMo Platform SDK.
Using OpenAI Python SDK
Models service provides a convenient helper method client.models.get_openai_client() that can generate an OpenAI SDK client for the configured workspace. The SDK also offers helper methods for generating OpenAI-compatible URL strings for Inference Gateway. Refer to SDK Helper Methods for more info.
Using Common OpenAI-Compatible Clients
If you construct third-party clients manually, pass the headers returned by client.models.get_client_default_headers() so auth and identity headers are preserved when authorization is enabled.
OpenAI (manual client)
LiteLLM
Using curl
The OpenAI-compatible chat-completions endpoint is reachable directly over HTTP.
The canonical path is
/apis/inference-gateway/v2/workspaces/{workspace}/openai/-/v1/chat/completions
— resolve the base URL with the CLI rather than hand-assembling it:
If authorization is enabled on your platform, also forward the auth headers
the CLI is configured with (nemo config view shows the configured base URL
and token).
Streaming
The OpenAI compatible Inference Gateway endpoint also supports streaming.