About Models and Inference#

The NeMo Platform provides APIs for deploying models, registering external providers, and routing inference requests through a unified gateway.

        flowchart TB
    subgraph Self-Hosted
        MDC[ModelDeploymentConfig] --> MD[ModelDeployment]
        MD --> MP1[ModelProvider]
    end

    subgraph External
        API[External API] --> MP2[ModelProvider]
    end

    MP1 --> IG[Inference Gateway]
    MP2 --> IG
    IG --> Client
    

Model Registry#

Models service manages model entities, deployment configurations, and deployments.

Core Objects#

ModelDeploymentConfig — A versioned blueprint for deploying a NIM container. Specifies GPU count, container image, model name, and optional settings like LoRA support, chat templates, tool calling configuration, or custom environment variables. Configs are reusable. You can create multiple deployments from the same config, and updating a config creates a new version without affecting existing deployments.

ModelDeployment — A running instance of a NIM container based on a ModelDeploymentConfig. Deployments progress through lifecycle states (CREATEDPENDINGREADY or FAILED) as the container is pulled, started, and initialized. A status history (recent status changes with timestamps) is maintained for troubleshooting; the CLI wait commands and SDK use it to show progression. When a deployment reaches the READY state, a ModelProvider is automatically created.

Model — A registered model within the platform, referencing a specific model like nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. The model can be hosted locally via a NIM, with weights stored in FileService or HuggingFace, or it can be an external model made available by a hosted provider (NVIDIA Build, OpenAI, and so on). All Models are served via a ModelProvider.

ModelProvider — A routable inference host. The provider may be manually registered for external APIs (NVIDIA Build, OpenAI, and so on) or auto-created by Models Service for ModelDeployments. All inference requests route through a ModelProvider which serves one or more Models.

Deployment Lifecycle#

        stateDiagram-v2
    [*] --> CREATED
    CREATED --> PENDING
    PENDING --> READY
    PENDING --> FAILED
    READY --> DELETING
    FAILED --> DELETING
    DELETING --> DELETED
    DELETED --> [*]
    

When a deployment reaches READY:

  1. ModelProvider is auto-created pointing to the NIM internal service URL

  2. Models are discovered from the NIM /v1/models endpoint

  3. Model entities are created for each discovered model


Model Providers#

Auto-created providers: Created automatically when a ModelDeployment becomes ready. Named the same as the deployment.

        flowchart TB
    subgraph Auto-Created
        MD[ModelDeployment] --> MP1[ModelProvider]
        MP1 --> NIM[Self-hosted NIM]
    end
    

Manual providers: Created by users for external inference endpoints. Requires storing the API key in the Secrets service first.

        flowchart TB
    subgraph User-Created
        User --> MP2[ModelProvider]
        MP2 --> External[External API]
    end
    

Chat Templates and Tool Calling#

Models can be configured with custom chat templates and tool calling support. Configuration flows through two layers:

  1. Fileset metadata.tool_calling — Set chat_template, tool_call_parser, tool_call_plugin, and auto_tool_choice on the fileset’s metadata. The model-spec background task propagates these into the model entity’s spec automatically.

  2. Deployment config — Set chat_template and tool_call_config on the nim_deployment. These override fileset-level values per-deployment.

At deployment time, the platform translates these settings into NIM environment variables (NIM_CHAT_TEMPLATE, NIM_TOOL_CALL_PARSER, NIM_TOOL_PARSER_PLUGIN, NIM_ENABLE_AUTO_TOOL_CHOICE).

For step-by-step instructions, see Chat Templates and Tool Calling.


Inference Gateway#

Inference Gateway is a Layer 7 reverse proxy providing unified access to all inference endpoints. It supports three routing patterns:

Routing Patterns#

Pattern

Endpoint

Use Case

Model Entity

.../gateway/model/{name}/-/*

Route by model name

Provider

.../gateway/provider/{name}/-/*

Route to specific provider (A/B testing)

OpenAI

.../gateway/openai/-/*

OpenAI SDK compatibility (model in body)

All patterns use /-/ as a separator. Everything after /-/ is forwarded to the backend unchanged.

Path Examples#

# Model entity routing
/v2/workspaces/default/inference/gateway/model/llama-3-2-1b/-/v1/chat/completions

# Provider routing
/v2/workspaces/default/inference/gateway/provider/my-deployment/-/v1/chat/completions

# OpenAI routing (model specified in request body as "workspace/model-entity")
/v2/workspaces/default/inference/gateway/openai/-/v1/chat/completions

SDK Helper Methods#

Set up the CLI or Python SDK first:

# Configure CLI (if not already done)
nmp config set --base-url "$NMP_BASE_URL" --workspace default
import os
from nemo_platform import NeMoPlatform

sdk = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

The SDK provides convenience methods for OpenAI compatibility:

# Get pre-configured OpenAI client
model_name = "my-model"
provider_name = "my-provider"
deployment_name = "my-deployment"
workspace = "default"
openai_client = sdk.models.get_openai_client()


# Get base URLs for different routing patterns
sdk.models.get_openai_route_base_url()

entity = sdk.models.retrieve(model_name, workspace=workspace)
entity_url = sdk.models.get_model_entity_route_openai_url(entity)
print(entity_url)

provider = sdk.inference.providers.retrieve(provider_name, workspace=workspace)
provider_url = sdk.models.get_provider_route_openai_url(provider)
print(provider_url)

deployment = sdk.inference.deployments.retrieve(deployment_name, workspace=workspace)
deployment_url = sdk.models.get_provider_route_openai_url_for_deployment(deployment)
print(deployment_url)

API Reference#

For complete API details, refer to the API Reference and SDK Reference.