Download this tutorial as a Jupyter notebook
Run Inference#
Route inference requests through the gateway using model entity routing, provider routing, or OpenAI-compatible routing.
Note
This tutorial assumes you have a model deployed or external provider registered. Refer to Deploy Models to set up inference endpoints.
# Configure CLI (if not already done)
nmp config set --base-url "$NMP_BASE_URL" --workspace default
import os
from nemo_platform import NeMoPlatform
sdk = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
Listing Models#
There are two distinct ways to list models, depending on what you need:
Command |
What it returns |
|---|---|
|
All model entities registered in the platform, regardless of whether they have an active deployment. |
|
Only models currently reachable via the Inference Gateway, in OpenAI-compatible format. Model IDs use the format |
Use nmp models list to manage model configurations. Use nmp inference models list to discover what’s ready to serve requests right now.
# All registered model entities (may include models without active deployments)
nmp models list
# Models available for inference via the gateway (OpenAI-compatible IDs)
nmp inference models list
# All registered model entities
for model in sdk.models.list():
print(model.name)
# Models available for inference via the gateway
models = sdk.inference.models.list()
for model in models.data:
print(model.id) # format: workspace/model_entity_name
Route by Model Entity#
Route requests using the model entity name. The gateway selects an available provider automatically.
Find available model entities:
nmp models list
for model in sdk.models.list():
print(model.name)
Run inference by passing the Model Entity.
# Model entities are auto-discovered from deployments
# Use the model entity name from 'nmp models list'
nmp chat meta-llama-3-2-1b-instruct 'Hello!' --max-tokens 100
# Model entities are auto-discovered from deployments
response = sdk.inference.gateway.model.post(
"v1/chat/completions",
name="meta-llama-3-2-1b-instruct", # Model entity name
body={
"model": "meta/llama-3.2-1b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
)
Route by Provider#
Route to a specific provider instance. Use for A/B testing or targeting specific deployments.
Find available providers:
nmp inference providers list
# List models available on a provider if its API is OpenAI compliant
nmp inference gateway provider get v1/models --name llama-3-2-1b-deployment
for provider in sdk.inference.providers.list():
print(f"{provider.name}: {provider.status}")
# List models available on a provider if its API is OpenAI compliant
models = sdk.inference.gateway.provider.get(
"v1/models",
name="llama-3-2-1b-deployment"
)
print(models)
Pass inference request using provider routing.
# Provider name matches deployment name for auto-created providers
nmp chat meta/llama-3.2-1b-instruct 'Hello!' \
--provider llama-3-2-1b-deployment \
--max-tokens 100
# Provider name matches deployment name for auto-created providers
response = sdk.inference.gateway.provider.post(
"v1/chat/completions",
name="llama-3-2-1b-deployment", # Provider name
body={
"model": "meta/llama-3.2-1b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
)
Route using OpenAI SDK#
Use the OpenAI-compatible endpoint for drop-in SDK replacement. The model field uses the format {workspace}/{model_entity}.
List Available Models#
List models currently reachable via the Inference Gateway. Returns results in OpenAI-compatible format; each id is workspace/model_entity_name.
nmp inference models list
models = sdk.inference.models.list()
for model in models.data:
print(model.id) # format: workspace/model_entity_name
Using SDK#
Make requests to the OpenAI-compatible Inference Gateway route with the NeMo Platform SDK.
response = sdk.inference.gateway.openai.post(
"v1/chat/completions",
body={
"model": "default/meta-llama-3-2-1b-instruct", # workspace/model-entity
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
)
Using OpenAI Python SDK#
Models service provides a convenient helper method sdk.models.get_openai_client() that can generate an OpenAI SDK client for the configured workspace. The SDK also offers helper methods for generating OpenAI-compatible URL strings for Inference Gateway. Refer to SDK Helper Methods for more info.
# Get pre-configured OpenAI client
openai_client = sdk.models.get_openai_client()
response = openai_client.chat.completions.create(
model="default/meta-llama-3-2-1b-instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
Using Common OpenAI-Compatible Clients#
If you construct third-party clients manually, forward sdk.default_headers so auth and identity headers are preserved when authorization is enabled.
from openai import OpenAI
base_url = sdk.models.get_openai_route_base_url()
headers = sdk.models.get_client_default_headers()
client = OpenAI(
base_url=base_url,
api_key="not-needed",
default_headers=headers,
)
response = client.chat.completions.create(
model="default/meta-llama-3-2-1b-instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
# make sure litellm is installed
pip install litellm
from litellm import completion
base_url = sdk.models.get_openai_route_base_url()
headers = sdk.models.get_client_default_headers()
response = completion(
model="openai/meta-llama-3-2-1b-instruct",
messages=[{"role": "user", "content": "Hello!"}],
api_base=base_url,
api_key="not-needed",
extra_headers=headers,
)
Streaming#
The OpenAI compatible Inference Gateway endpoint also supports streaming.
openai_client = sdk.models.get_openai_client()
stream = openai_client.chat.completions.create(
model="default/meta-llama-3-2-1b-instruct",
messages=[{"role": "user", "content": "Write a haiku about coding."}],
max_tokens=100,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")