Deploy Models
Deploy models from NGC or HuggingFace. Register external providers like OpenAI or NVIDIA Build.
Resource names for deployments, deployment configs, and providers must contain only letters (a-z, A-Z), digits (0-9), underscores, hyphens, and dots. For example: llama-3.1-8b, my-custom-model, qwen-fs-config.
CLI
Python SDK
Add External Providers
Register external inference APIs like NVIDIA Build or OpenAI.
NVIDIA Build
By default, the platform pre-configures an external provider for NVIDIA Build named nvidia-build in the system workspace.
The example below demonstrates how to recreate it in your own workspace.
For disambiguation purposes, this example names the manually-created version my-nvidia-build.
CLI
Python SDK
OpenAI
CLI
Python SDK
Anthropic
Anthropic’s /v1/messages API expects the API key in an X-Api-Key: header (not Authorization: Bearer) and requires an anthropic-version header on every request. Use --auth-header-format (Jinja2 template, must contain exactly one {{ auth_secret }} variable) to override the default Authorization: Bearer {{ auth_secret }} and pass the API-version pin via --default-extra-headers. Without these, Anthropic rejects every request with 401.
CLI
Python SDK
{{ auth_secret }} is substituted with the resolved secret value at request time.
Deploy from NGC
Deploy pre-built NIM containers from NGC.
Deploy Llama 3.2 1B
CLI
Python SDK
Deploy NemoGuard JailbreakDetect
Deploy classification NIMs like NemoGuard for content safety. Uses the /v1/classify endpoint instead of chat completions.
CLI
Python SDK
Deploy from HuggingFace
HuggingFace deployments use the Multi-LLM NIM (nvcr.io/nim/nvidia/llm-nim:1.13.1) by default, which only supports specific model architectures. Check the supported architectures list before deploying. If your model architecture is not listed, you will need a model-specific NIM image — see Deploy from NGC for that approach.
You can register a HuggingFace model through the Files service. This creates a fileset that acts as a proxy. The Files service handles authentication and caches the weights on first download, so subsequent deployments start faster.
CLI
Python SDK
The fileset format is <workspace>/<fileset-name>. This tells the deployment system to pull weights from the Files service, which proxies the download from HuggingFace using the fileset token_secret. For public models like Qwen, the token_secret on the fileset is optional.
Deploy with vLLM
Serve any HuggingFace-format model with vLLM by setting engine: "vllm" on the deployment config. The model is registered through the Files service exactly as in Deploy from HuggingFace — only the engine field on the deployment config changes. Tensor parallelism is computed automatically from the GPU count and the model’s architecture.
This example deploys Qwen3-1.7B with vLLM.
CLI
Python SDK
Deploy with LoRA Adapters
vLLM deployments support LoRA adapters with hot-reload: enable LoRA on the deployment config, then register one or more adapters against the base model entity. An adapter sidecar delivers each enabled adapter to the running server, and vLLM serves it on a per-request basis. Adapters are registered on the base model entity, not as separate models.
This example serves the linear-algebra LoRA on top of Qwen3-1.7B.
CLI
Python SDK
The default vLLM image is configurable by the platform operator. To override the image per deployment, set image_name / image_tag on executor_config. To pass raw vllm serve flags (for example --max-model-len), use additional_args on executor_config.
Deployment Cleanup
CLI
Python SDK
Multi-GPU Deployments
For larger models requiring multiple GPUs, parallelism configuration depends on the NIM type.
Parallelism Strategies
- Tensor Parallel (TP): Splits model layers across GPUs → best for latency
- Pipeline Parallel (PP): Splits model depth across GPUs → best for throughput
- Formula:
gpu=tp_size×pp_size
Model-Specific NIMs
Model-specific NIMs (for example, nvcr.io/nim/meta/llama-3.1-70b-instruct) have TP/PP settings derived from manifest profiles in the container. Configure enough GPUs and the NIM selects the appropriate profile automatically.
Multi-LLM NIM
The multi-LLM NIM (nvcr.io/nim/nvidia/llm-nim:1.13.1) requires explicit parallelism configuration via environment variables (NIM_TENSOR_PARALLEL_SIZE, NIM_PIPELINE_PARALLEL_SIZE). By default, it uses all GPUs for tensor parallelism (TP=gpu, PP=1).
This example deploys Qwen2.5-14B-Instruct across 2 GPUs using tensor parallelism.
CLI
Python SDK
Custom Parallelism Configuration
For larger models requiring more GPUs, you can configure specific TP/PP splits using additional_envs. The formula is: gpu = NIM_TENSOR_PARALLEL_SIZE × NIM_PIPELINE_PARALLEL_SIZE.
CLI
Python SDK
Choosing Parallelism Strategy
- TP=8, PP=1 (default): Lowest latency, best for real-time applications
- TP=4, PP=2: Balanced latency and throughput
- TP=2, PP=4: Highest throughput, best for batch processing
For custom models, match deployment parallelism to training parallelism for optimal performance.
Chat Templates and Tool Calling
Configure custom chat templates and tool calling for NIM deployments. These settings control how the model formats chat messages and handles function/tool calling.
For more information on chat templates, see the Hugging Face chat templating guide and the NeMo chat templates documentation.
Security consideration: Chat templates are Jinja2 programs that execute on every inference call. While NIM uses a sandboxed Jinja2 environment (mitigating arbitrary code execution), a malicious or misconfigured template can still alter model behavior — for example, by injecting hidden instructions, rewriting messages, or degrading output quality. Grant chat template permissions only to trusted users, and review templates before deploying to production. See Inference-Time Backdoors via Chat Templates (IEEE S&P 2026) for further background.
Configuration Sources
There are two ways to configure chat templates and tool calling, with deployment-level settings taking highest priority:
When both are set, deployment config values override fileset values.
How It Works
- User sets
chat_template,tool_call_parser,tool_call_plugin, and/orauto_tool_choiceon the fileset viametadata.model.tool_calling - The model-spec background task reads the fileset’s
metadata.model.tool_callingand writes them into the model entity’sspec - At deployment time, the platform reads
model_entity.specand any deployment-level overrides to set NIM environment variables
Option 1: Set via Fileset (Recommended)
Set chat_template and tool calling configuration with fileset metadata.model.tool_calling. The platform automatically propagates these into the model entity spec when the model-spec background task runs.
CLI
Python SDK
Option 2: Set via Deployment Config (Override)
Set chat_template and tool_call_config directly on the deployment config. These override any values from the fileset.
CLI
Python SDK
Change Tool Calling Config for an Existing Model
Updating a fileset’s metadata.model.tool_calling does not propagate changes to an existing model entity. The model entity’s spec is populated from the fileset only at creation time. To change the tool calling configuration, create a new fileset with the updated config and a new model entity that references it.
CLI
Python SDK
Custom Tool Call Plugin
For custom tool calling parsers, store the plugin Python file in a separate fileset and reference it via tool_call_plugin.
Because plugins execute arbitrary Python code inside the NIM container, tool_call_plugin is disabled by default at the platform level.
To enable it, set models.tool_call_plugin.enabled: true in the platform configuration and ensure the user has the models.tool-call-plugin.set permission (granted to Admin and PlatformAdmin roles by default).
Python SDK
The platform downloads the plugin fileset at deployment time and passes the .py file path to NIM via the NIM_TOOL_PARSER_PLUGIN environment variable.
Tool Call Config Reference
NIM Environment Variable Mapping
The platform translates tool calling configuration into NIM environment variables:
Controlling Reasoning at Request Level
Some models (for example, nvidia/nemotron-3-nano-30b-a3b) enable reasoning/thinking by default. (From the model card: “[nemotron-3] responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response”) You can disable it on a per-request basis by passing chat_template_kwargs directly in the request body:
CLI
Python SDK
chat_template_kwargs is passed directly in the request body, not nested under extra_body. For more details on request-level reasoning overrides, see the vLLM documentation.
This parameter only applies to models which use vLLM under the hood. For non-vLLM providers such as OpenAI or NVIDIA Build, the parameter that controls reasoning differs. Consult the provider’s documentation for the specific parameter.