Model-Free NIM#
Run any supported model without a model-specific container image.
By default, NIM containers ship with a built-in model manifest that defines which model to serve. In model-free mode, you point a generic NIM container at any supported model, such as a Hugging Face repo, an S3 bucket, or a local directory. NIM then generates a manifest at startup and serves that model.
Model-free NIM is useful in the following scenarios:
Flexible single-container deployments: One container image can pass security review and serve any supported model.
Day-zero model support: You can serve newly released models without waiting for a model-specific NIM container.
Custom and fine-tuned models: You can serve your own models from any supported source.
Note
Regardless of where the model is hosted, if its architecture is unsupported by the inference backend in your container (vLLM or SGLang), it will also be unsupported by NIM.
Configuring the Model#
Supply the model via either of these methods. If both are provided, the CLI positional argument takes precedence.
NIM_MODEL_PATH Environment Variable#
Use NIM_MODEL_PATH to point the container to the model at runtime:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE}
Backend CLI Positional Argument#
Use the backend positional argument to pass the model path at runtime. This passthrough works for both the vLLM and SGLang backends; the profile descriptions are prefixed vllm- or sglang- depending on the image you run.
docker run --gpus=all \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE} \
hf://meta-llama/Llama-3.1-8B-Instruct
Supported Model Sources#
The following table lists the model sources supported by model-free NIM, along with the required URI prefix and authentication method.
Prefix |
Source |
Example |
Authentication |
|---|---|---|---|
|
Hugging Face Hub |
|
|
|
NVIDIA NGC |
|
|
|
AWS S3 / S3-compatible |
|
|
|
ModelScope Hub |
|
|
|
Google Cloud Storage |
|
|
(absolute path) |
Local directory |
|
None |
For details on downloading models from each source (including URI formats, proxy configuration, and cache management), refer to Model Download.
Configuring Deployment Options#
Model-free NIM generates profiles for combinations of tensor parallelism (TP), pipeline parallelism (PP), and LoRA. To select a deployment configuration, use either of the following methods. If both are provided, backend CLI arguments take precedence.
NIM_MODEL_PROFILE Environment Variable#
Run list-model-profiles to see available profiles, then select one:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e NIM_MODEL_PROFILE=vllm-bf16-tp2-pp1 \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE}
On an SGLang model-free image, use the matching sglang- profile description (or its hash) instead, for example -e NIM_MODEL_PROFILE=sglang-bf16-tp2-pp1. Run list-model-profiles on your chosen image to see whether the generated profiles carry vllm- or sglang- descriptions.
Backend CLI Arguments (passthrough)#
Use backend CLI arguments to pass additional runtime options with the model path. The NIM passthrough mechanism is the same for both backends, but the flag names are backend-specific. The following example uses vLLM flags:
docker run --gpus=all \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE} \
hf://meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2
The following arguments are supported on the vLLM backend:
Argument |
Purpose |
Default |
|---|---|---|
|
Number of GPUs |
1 |
|
Number of nodes |
1 |
|
Enable LoRA adapter support |
Disabled |
Note
Flag names differ by backend. On an SGLang image, run list-model-profiles to confirm the available sglang-* profiles and refer to the SGLang documentation for the equivalent passthrough flags.
Listing Profiles#
Use the list-model-profiles command to view the profiles generated for a given model:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e HF_TOKEN=<yourtoken> \
${NIM_LLM_IMAGE} \
list-model-profiles
For more details on profile selection, refer to Model Profiles and Selection.
S3-Specific Environment Variables#
Variable |
Required |
Purpose |
|---|---|---|
|
Yes |
AWS access key |
|
Yes |
AWS secret key |
|
Yes |
AWS region (e.g., |
|
Only for S3-compatible |
Custom endpoint (e.g., |
|
Only for S3-compatible |
Set to |
For complete authentication and configuration details for each source, refer to Model Download.
Air-Gap Deployments#
How model-free NIM behaves in an air-gap environment depends on the type of NIM_MODEL_PATH value
you use.
Local Path (/abs/path/to/model)#
NIM reads the model directory directly with no network access. No manifest regeneration occurs, and no credentials are needed. This is the simplest air-gap workflow — ensure the model directory is pre-staged and mounted into the container.
Remote URI (ngc://, hf://, s3://, and so on)#
NIM generates a runtime manifest from the URI on the first deployment and automatically saves
a copy inside the container’s internal cache directory (NIM_CACHE_PATH). On subsequent restarts — including in a strict air-gap environment
— NIM finds the cached manifest in the same cache volume and reuses it without any outbound
network or authentication calls. No additional environment variables are required.
This enables the following workflow for air-gap redeployment:
First deploy (network-connected): Run the container with credentials and the remote URI. NIM downloads the model, generates the manifest, and saves a copy to
NIM_CACHE_PATH.Transfer: Ensure the NIM cache is on a PVC or persistent volume that survives pod restarts, or transfer the cache directory to the air-gap environment.
Redeploy (air-gapped): Mount the same cache volume. NIM finds the cached manifest and skips regeneration. No credentials or network access are required.
Tip
To force manifest regeneration after an upstream model update, delete nim_runtime_manifest.yaml
from your persistent cache directory (for example, the host machine directory in LOCAL_NIM_CACHE) mounted to /opt/nim/.cache (the default value of NIM_CACHE_PATH inside the container) before restarting.
For a complete walkthrough, refer to Air-Gap Deployment.
Examples#
Use the following examples to run model-free NIM with different model sources and profile selection methods.
Hugging Face Model with TP=2#
Use this example to inspect available profiles for a Hugging Face model and then run the model with TP=2.
Set the model path:
export MODEL=hf://meta-llama/Llama-3.2-1B
List the available profiles:
docker run --gpus=all \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -e HF_TOKEN \ -e NIM_MODEL_PATH=$MODEL \ ${NIM_LLM_IMAGE} \ list-model-profiles
Example output:
- Compatible with system and runnable:
- c214460d2ad7a379660126062912d2aeecaa74a3ce14ab9966cd135de49a73f2 (vllm-tp1-pp1-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73) [requires >=13 GB/gpu]
- With LoRA support:
- 289b03eb8c26104f416dd0a1055004e31fd9e4b0f84fe2e59754a3ceb710976a (vllm-tp1-pp1-feat_lora-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73) [requires >=13 GB/gpu]
Note
The example output above is from a vLLM model-free image. On an SGLang model-free image, the profile descriptions are prefixed sglang- instead (for example, sglang-tp1-pp1-...).
Start NIM by using one of the following methods:
Select a profile with
NIM_MODEL_PROFILE:docker run --gpus=all \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -e HF_TOKEN \ -e NIM_MODEL_PROFILE=vllm-tp1-pp1-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73 \ -e NIM_MODEL_PATH=$MODEL \ -p 8000:8000 \ ${NIM_LLM_IMAGE}
Override the profile through a vLLM CLI argument:
docker run --gpus=all \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -e HF_TOKEN \ -p 8000:8000 \ ${NIM_LLM_IMAGE} \ $MODEL --tensor-parallel-size 2
Model Hosted in S3 Using Default Profile from Automatic Profile Selection#
Use this example to serve a model from S3 and let NIM select the default compatible profile automatically.
Set the model path:
export MODEL=s3://my-bucket/my-org/my-fine-tuned-model
Start NIM with the required S3 credentials:
docker run --gpus=all \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -e NIM_MODEL_PATH=$MODEL \ -e AWS_ACCESS_KEY_ID=<key> \ -e AWS_SECRET_ACCESS_KEY=<secret> \ -e AWS_REGION=us-east-1 \ -p 8000:8000 \ ${NIM_LLM_IMAGE}
Local Model with TP=8 Profile Specified#
Use this example to serve a local model and select a TP=8 profile explicitly.
Set the model path:
export MODEL=/mnt/models/my-120b-model
Start NIM with the selected profile:
docker run --gpus=all \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -v /mnt/models:/mnt/models \ -e NIM_MODEL_PROFILE=<tp8_profile> \ -e NIM_MODEL_PATH=$MODEL \ -p 8000:8000 \ ${NIM_LLM_IMAGE}