About Models and Inference#
The NeMo Platform provides APIs for deploying models, registering external providers, and routing inference requests through a unified gateway.
flowchart TB
subgraph Self-Hosted
MDC[ModelDeploymentConfig] --> MD[ModelDeployment]
MD --> MP1[ModelProvider]
end
subgraph External
API[External API] --> MP2[ModelProvider]
end
MP1 --> IG[Inference Gateway]
MP2 --> IG
IG --> Client
Model Registry#
Models service manages model entities, deployment configurations, and deployments.
Core Objects#
ModelDeploymentConfig — A versioned blueprint for deploying a NIM container. Specifies GPU count, container image, model name, and optional settings like LoRA support, chat templates, tool calling configuration, or custom environment variables. Configs are reusable. You can create multiple deployments from the same config, and updating a config creates a new version without affecting existing deployments.
ModelDeployment — A running instance of a NIM container based on a ModelDeploymentConfig. Deployments progress through lifecycle states (CREATED → PENDING → READY or FAILED) as the container is pulled, started, and initialized. A status history (recent status changes with timestamps) is maintained for troubleshooting; the CLI wait commands and SDK use it to show progression. When a deployment reaches the READY state, a ModelProvider is automatically created.
Model — A registered model within the platform, referencing a specific model like nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. The model can be hosted locally via a NIM, with weights stored in FileService or HuggingFace, or it can be an external model made available by a hosted provider (NVIDIA Build, OpenAI, and so on). All Models are served via a ModelProvider.
ModelProvider — A routable inference host. The provider may be manually registered for external APIs (NVIDIA Build, OpenAI, and so on) or auto-created by Models Service for ModelDeployments. All inference requests route through a ModelProvider which serves one or more Models.
Deployment Lifecycle#
stateDiagram-v2
[*] --> CREATED
CREATED --> PENDING
PENDING --> READY
PENDING --> FAILED
READY --> DELETING
FAILED --> DELETING
DELETING --> DELETED
DELETED --> [*]
When a deployment reaches READY:
ModelProvider is auto-created pointing to the NIM internal service URL
Models are discovered from the NIM
/v1/modelsendpointModel entities are created for each discovered model
Model Providers#
Auto-created providers: Created automatically when a ModelDeployment becomes ready. Named the same as the deployment.
flowchart TB
subgraph Auto-Created
MD[ModelDeployment] --> MP1[ModelProvider]
MP1 --> NIM[Self-hosted NIM]
end
Manual providers: Created by users for external inference endpoints. Requires storing the API key in the Secrets service first.
flowchart TB
subgraph User-Created
User --> MP2[ModelProvider]
MP2 --> External[External API]
end
Chat Templates and Tool Calling#
Models can be configured with custom chat templates and tool calling support. Configuration flows through two layers:
Fileset
metadata.tool_calling— Setchat_template,tool_call_parser,tool_call_plugin, andauto_tool_choiceon the fileset’s metadata. The model-spec background task propagates these into the model entity’sspecautomatically.Deployment config — Set
chat_templateandtool_call_configon thenim_deployment. These override fileset-level values per-deployment.
At deployment time, the platform translates these settings into NIM environment variables (NIM_CHAT_TEMPLATE, NIM_TOOL_CALL_PARSER, NIM_TOOL_PARSER_PLUGIN, NIM_ENABLE_AUTO_TOOL_CHOICE).
For step-by-step instructions, see Chat Templates and Tool Calling.
Inference Gateway#
Inference Gateway is a Layer 7 reverse proxy providing unified access to all inference endpoints. It supports three routing patterns:
Routing Patterns#
Pattern |
Endpoint |
Use Case |
|---|---|---|
Model Entity |
|
Route by model name |
Provider |
|
Route to specific provider (A/B testing) |
OpenAI |
|
OpenAI SDK compatibility (model in body) |
All patterns use /-/ as a separator. Everything after /-/ is forwarded to the backend unchanged.
Path Examples#
# Model entity routing
/v2/workspaces/default/inference/gateway/model/llama-3-2-1b/-/v1/chat/completions
# Provider routing
/v2/workspaces/default/inference/gateway/provider/my-deployment/-/v1/chat/completions
# OpenAI routing (model specified in request body as "workspace/model-entity")
/v2/workspaces/default/inference/gateway/openai/-/v1/chat/completions
SDK Helper Methods#
Set up the CLI or Python SDK first:
# Configure CLI (if not already done)
nmp config set --base-url "$NMP_BASE_URL" --workspace default
import os
from nemo_platform import NeMoPlatform
sdk = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
The SDK provides convenience methods for OpenAI compatibility:
# Get pre-configured OpenAI client
model_name = "my-model"
provider_name = "my-provider"
deployment_name = "my-deployment"
workspace = "default"
openai_client = sdk.models.get_openai_client()
# Get base URLs for different routing patterns
sdk.models.get_openai_route_base_url()
entity = sdk.models.retrieve(model_name, workspace=workspace)
entity_url = sdk.models.get_model_entity_route_openai_url(entity)
print(entity_url)
provider = sdk.inference.providers.retrieve(provider_name, workspace=workspace)
provider_url = sdk.models.get_provider_route_openai_url(provider)
print(provider_url)
deployment = sdk.inference.deployments.retrieve(deployment_name, workspace=workspace)
deployment_url = sdk.models.get_provider_route_openai_url_for_deployment(deployment)
print(deployment_url)
API Reference#
For complete API details, refer to the API Reference and SDK Reference.