Run Inference

View as Markdown

Route inference requests through the gateway using model entity routing, provider routing, or OpenAI-compatible routing.

This tutorial assumes you have an external provider registered. The platform pre-configures a system/nvidia-build provider on startup, which most of the examples below use. To register your own provider, see About Models and Inference.

$# Configure CLI (if not already done)
$nemo config set --base-url "$NMP_BASE_URL" --workspace default

Listing Models

There are two distinct ways to list models, depending on what you need:

CommandWhat it returns
nemo models listAll model entities registered in the platform, regardless of whether they have an active deployment.
nemo inference models listOnly models currently reachable via the Inference Gateway, in OpenAI-compatible format. Model IDs use the format workspace/model_entity_name.

Use nemo models list to manage model configurations. Use nemo inference models list to discover what’s ready to serve requests right now.

$# All registered model entities (may include models without active deployments)
$nemo models list
$
$# Models available for inference via the gateway (OpenAI-compatible IDs)
$nemo inference models list

Route by Model Entity

Route requests using the model entity name. The gateway selects an available provider automatically.

Find available model entities:

$nemo models list

Run inference by passing the Model Entity.

$# Model entities are auto-discovered from deployments
$# Use the model entity name from 'nemo models list'
$nemo chat meta-llama-3-2-1b-instruct 'Hello!' --max-tokens 100

Route by Provider

Route to a specific provider instance. Use for A/B testing or targeting specific deployments.

Find available providers:

$nemo inference providers list
$
$# List models available on a provider if its API is OpenAI compliant
$nemo inference gateway provider get v1/models --name llama-3-2-1b-deployment

Pass inference request using provider routing.

$# Provider name matches deployment name for auto-created providers
$nemo chat meta/llama-3.2-1b-instruct 'Hello!' \
>--provider llama-3-2-1b-deployment \
>--max-tokens 100

Route using OpenAI SDK

Use the OpenAI-compatible endpoint for drop-in SDK replacement. The model field uses the format {workspace}/{model_entity}.

Model entity naming

The {model_entity} segment is validated against a strict regex (lowercase letters, digits, and hyphens; 2–63 characters; no slashes). Vendor-style IDs such as meta/llama-3.3-70b-instruct are rejected with HTTP 422 — register them as a hyphen-only entity (for example meta-llama-3-3-70b-instruct) and reference them as {workspace}/meta-llama-3-3-70b-instruct in the request body. Run nemo inference models list to see the IDs the gateway accepts.

List Available Models

List models currently reachable via the Inference Gateway. Returns results in OpenAI-compatible format; each id is workspace/model_entity_name.

$nemo inference models list

Using SDK

Make requests to the OpenAI-compatible Inference Gateway route with the NeMo Platform SDK.

1response = client.inference.gateway.openai.post(
2 "v1/chat/completions",
3 body={
4 "model": "default/meta-llama-3-2-1b-instruct", # workspace/model-entity
5 "messages": [{"role": "user", "content": "Hello!"}],
6 "max_tokens": 100,
7 },
8)

Using OpenAI Python SDK

Models service provides a convenient helper method client.models.get_openai_client() that can generate an OpenAI SDK client for the configured workspace. The SDK also offers helper methods for generating OpenAI-compatible URL strings for Inference Gateway. Refer to SDK Helper Methods for more info.

1# Get pre-configured OpenAI client
2oai_client = client.models.get_openai_client()
3
4response = oai_client.chat.completions.create(
5 model="default/meta-llama-3-2-1b-instruct",
6 messages=[{"role": "user", "content": "Hello!"}],
7 max_tokens=100,
8)

Using Common OpenAI-Compatible Clients

If you construct third-party clients manually, pass the headers returned by client.models.get_client_default_headers() so auth and identity headers are preserved when authorization is enabled.

1from openai import OpenAI
2
3base_url = client.models.get_openai_route_base_url()
4headers = client.models.get_client_default_headers()
5
6oai_client = OpenAI(
7 base_url=base_url,
8 api_key="not-needed",
9 default_headers=headers,
10)
11
12response = oai_client.chat.completions.create(
13 model="default/meta-llama-3-2-1b-instruct",
14 messages=[{"role": "user", "content": "Hello!"}],
15)

Using curl

The OpenAI-compatible chat-completions endpoint is reachable directly over HTTP. The canonical path is /apis/inference-gateway/v2/workspaces/{workspace}/openai/-/v1/chat/completions — resolve the base URL with the CLI rather than hand-assembling it:

$BASE_URL=$(nemo inference get-url)
$
$curl -s "$BASE_URL/chat/completions" \
> -H 'content-type: application/json' \
> -d '{
> "model": "default/meta-llama-3-2-1b-instruct",
> "messages": [{"role": "user", "content": "Hello!"}],
> "max_tokens": 100
> }' | jq

If authorization is enabled on your platform, also forward the auth headers the CLI is configured with (nemo config view shows the configured base URL and token).

Streaming

The OpenAI compatible Inference Gateway endpoint also supports streaming.

1oai_client = client.models.get_openai_client()
2
3stream = oai_client.chat.completions.create(
4 model="default/meta-llama-3-2-1b-instruct",
5 messages=[{"role": "user", "content": "Write a haiku about coding."}],
6 max_tokens=100,
7 stream=True,
8)
9
10for chunk in stream:
11 if chunk.choices[0].delta.content:
12 print(chunk.choices[0].delta.content, end="")