Download this tutorial as a Jupyter notebook

Run Inference#

Route inference requests through the gateway using model entity routing, provider routing, or OpenAI-compatible routing.

Note

This tutorial assumes you have a model deployed or external provider registered. Refer to Deploy Models to set up inference endpoints.

# Configure CLI (if not already done)
nmp config set --base-url "$NMP_BASE_URL" --workspace default
import os
from nemo_platform import NeMoPlatform

sdk = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Listing Models#

There are two distinct ways to list models, depending on what you need:

Command

What it returns

nmp models list

All model entities registered in the platform, regardless of whether they have an active deployment.

nmp inference models list

Only models currently reachable via the Inference Gateway, in OpenAI-compatible format. Model IDs use the format workspace/model_entity_name.

Use nmp models list to manage model configurations. Use nmp inference models list to discover what’s ready to serve requests right now.

# All registered model entities (may include models without active deployments)
nmp models list

# Models available for inference via the gateway (OpenAI-compatible IDs)
nmp inference models list
# All registered model entities
for model in sdk.models.list():
    print(model.name)

# Models available for inference via the gateway
models = sdk.inference.models.list()
for model in models.data:
    print(model.id)  # format: workspace/model_entity_name

Route by Model Entity#

Route requests using the model entity name. The gateway selects an available provider automatically.

Find available model entities:

nmp models list
for model in sdk.models.list():
    print(model.name)

Run inference by passing the Model Entity.

# Model entities are auto-discovered from deployments
# Use the model entity name from 'nmp models list'
nmp chat meta-llama-3-2-1b-instruct 'Hello!' --max-tokens 100
# Model entities are auto-discovered from deployments
response = sdk.inference.gateway.model.post(
    "v1/chat/completions",
    name="meta-llama-3-2-1b-instruct",  # Model entity name
    body={
        "model": "meta/llama-3.2-1b-instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }
)

Route by Provider#

Route to a specific provider instance. Use for A/B testing or targeting specific deployments.

Find available providers:

nmp inference providers list

# List models available on a provider if its API is OpenAI compliant
nmp inference gateway provider get v1/models --name llama-3-2-1b-deployment
for provider in sdk.inference.providers.list():
    print(f"{provider.name}: {provider.status}")

# List models available on a provider if its API is OpenAI compliant
models = sdk.inference.gateway.provider.get(
    "v1/models",
    name="llama-3-2-1b-deployment"
)
print(models)

Pass inference request using provider routing.

# Provider name matches deployment name for auto-created providers
nmp chat meta/llama-3.2-1b-instruct 'Hello!' \
    --provider llama-3-2-1b-deployment \
    --max-tokens 100
# Provider name matches deployment name for auto-created providers
response = sdk.inference.gateway.provider.post(
    "v1/chat/completions",
    name="llama-3-2-1b-deployment",  # Provider name
    body={
        "model": "meta/llama-3.2-1b-instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }
)

Route using OpenAI SDK#

Use the OpenAI-compatible endpoint for drop-in SDK replacement. The model field uses the format {workspace}/{model_entity}.

List Available Models#

List models currently reachable via the Inference Gateway. Returns results in OpenAI-compatible format; each id is workspace/model_entity_name.

nmp inference models list
models = sdk.inference.models.list()
for model in models.data:
    print(model.id)  # format: workspace/model_entity_name

Using SDK#

Make requests to the OpenAI-compatible Inference Gateway route with the NeMo Platform SDK.

response = sdk.inference.gateway.openai.post(
    "v1/chat/completions",
    body={
        "model": "default/meta-llama-3-2-1b-instruct",  # workspace/model-entity
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }
)

Using OpenAI Python SDK#

Models service provides a convenient helper method sdk.models.get_openai_client() that can generate an OpenAI SDK client for the configured workspace. The SDK also offers helper methods for generating OpenAI-compatible URL strings for Inference Gateway. Refer to SDK Helper Methods for more info.

# Get pre-configured OpenAI client
openai_client = sdk.models.get_openai_client()

response = openai_client.chat.completions.create(
    model="default/meta-llama-3-2-1b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

Using Common OpenAI-Compatible Clients#

If you construct third-party clients manually, forward sdk.default_headers so auth and identity headers are preserved when authorization is enabled.

from openai import OpenAI

base_url = sdk.models.get_openai_route_base_url()
headers = sdk.models.get_client_default_headers()

client = OpenAI(
    base_url=base_url,
    api_key="not-needed",
    default_headers=headers,
)

response = client.chat.completions.create(
    model="default/meta-llama-3-2-1b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
# make sure litellm is installed
pip install litellm
from litellm import completion

base_url = sdk.models.get_openai_route_base_url()
headers = sdk.models.get_client_default_headers()

response = completion(
    model="openai/meta-llama-3-2-1b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base=base_url,
    api_key="not-needed",
    extra_headers=headers,
)

Streaming#

The OpenAI compatible Inference Gateway endpoint also supports streaming.

openai_client = sdk.models.get_openai_client()

stream = openai_client.chat.completions.create(
    model="default/meta-llama-3-2-1b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about coding."}],
    max_tokens=100,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")