Model-Free NIM#
Run any supported model without a model-specific container image.
By default, NIM containers ship with a built-in model manifest that defines which model to serve. In model-free mode, you point a generic NIM container at any supported model, such as a Hugging Face repo, an S3 bucket, or a local directory. NIM then generates a manifest at startup and serves that model.
Model-free NIM is useful in the following scenarios:
Flexible single-container deployments: One container image can pass security review and serve any supported model.
Day-zero model support: You can serve newly released models without waiting for a model-specific NIM container.
Custom and fine-tuned models: You can serve your own models from any supported source.
Note
Regardless of where the model is hosted, if its architecture is unsupported by vLLM, it will also be unsupported by NIM.
Configuring the Model#
Supply the model via either of these methods. If both are provided, the CLI positional argument takes precedence.
NIM_MODEL_PATH Environment Variable#
Use NIM_MODEL_PATH to point the container to the model at runtime:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE}
vLLM CLI Positional Argument#
Use the vLLM positional argument to pass the model path at runtime:
docker run --gpus=all \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE} \
hf://meta-llama/Llama-3.1-8B-Instruct
Supported Model Sources#
Prefix |
Source |
Example |
Authentication |
|---|---|---|---|
|
Hugging Face Hub |
|
|
|
NVIDIA NGC |
|
|
|
AWS S3 / S3-compatible |
|
|
|
ModelScope Hub |
|
|
|
Google Cloud Storage |
|
|
(absolute path) |
Local directory |
|
None |
For details on downloading models from each source (including URI formats, proxy configuration, and cache management), refer to Model Download.
Configuring Deployment Options#
Model-free NIM generates profiles for combinations of tensor parallelism (TP), pipeline parallelism (PP), and LoRA. To select a deployment configuration, use either of the following methods. If both are provided, vLLM CLI arguments take precedence.
NIM_MODEL_PROFILE Environment Variable#
Run list-model-profiles to see available profiles, then select one:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e NIM_MODEL_PROFILE=vllm-bf16-tp2-pp1 \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE}
vLLM CLI Arguments#
Use vLLM CLI arguments to pass additional runtime options with the model path:
docker run --gpus=all \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE} \
hf://meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2
The following vLLM CLI arguments are supported:
Argument |
Purpose |
Default |
|---|---|---|
|
Number of GPUs |
1 |
|
Number of nodes |
1 |
|
Enable LoRA adapter support |
Disabled |
Listing Profiles#
Use the list-model-profiles command to view the profiles generated for a given model:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e HF_TOKEN=<yourtoken> \
${NIM_LLM_IMAGE} \
list-model-profiles
For more details on profile selection, refer to Model Profiles and Selection.
S3-Specific Environment Variables#
Variable |
Required |
Purpose |
|---|---|---|
|
Yes |
AWS access key |
|
Yes |
AWS secret key |
|
Yes |
AWS region (e.g., |
|
Only for S3-compatible |
Custom endpoint (e.g., |
|
Only for S3-compatible |
Set to |
For complete authentication and configuration details for each source, refer to Model Download.
Air-Gap Deployments#
How model-free NIM behaves in an air-gap environment depends on the type of NIM_MODEL_PATH value
you use.
Local path (/abs/path/to/model)#
NIM reads the model directory directly with no network access. No manifest regeneration occurs, and no credentials are needed. This is the simplest air-gap workflow — ensure the model directory is pre-staged and mounted into the container.
Remote URI (ngc://, hf://, s3://, and so on)#
NIM generates a runtime manifest from the URI on the first deployment and automatically saves
a copy inside NIM_CACHE_PATH. On subsequent restarts — including in a strict air-gap environment
— NIM finds the cached manifest in the same cache volume and reuses it without any outbound
network or authentication calls. No additional environment variables are required.
This enables the following workflow for air-gap redeployment:
First deploy (network-connected): Run the container with credentials and the remote URI. NIM downloads the model, generates the manifest, and saves a copy to
NIM_CACHE_PATH.Transfer: Ensure the NIM cache is on a PVC or persistent volume that survives pod restarts, or transfer the cache directory to the air-gap environment.
Redeploy (air-gapped): Mount the same cache volume. NIM finds the cached manifest and skips regeneration. No credentials or network access are required.
Tip
To force manifest regeneration after an upstream model update, delete nim_runtime_manifest.yaml
from your persistent cache directory (NIM_CACHE_PATH) before restarting.
For a complete walkthrough, refer to Air-Gap Deployment.
Examples#
Hugging Face Model with TP=2#
Use this example to inspect available profiles for a Hugging Face model and then run the model with TP=2.
Set the model path:
export MODEL=hf://meta-llama/Llama-3.2-1B
List the available profiles:
docker run --gpus=all \ -v $(pwd)/local_cache:/opt/nim/.cache \ -e HF_TOKEN \ -e NIM_MODEL_PATH=$MODEL \ ${NIM_LLM_IMAGE} \ list-model-profiles
Example output:
- Compatible with system and runnable:
- c214460d2ad7a379660126062912d2aeecaa74a3ce14ab9966cd135de49a73f2 (vllm-tp1-pp1-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73) [requires >=13 GB/gpu]
- With LoRA support:
- 289b03eb8c26104f416dd0a1055004e31fd9e4b0f84fe2e59754a3ceb710976a (vllm-tp1-pp1-feat_lora-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73) [requires >=13 GB/gpu]
Start NIM by using one of the following methods:
Select a profile with
NIM_MODEL_PROFILE:docker run --gpus=all \ -v $(pwd)/local_cache:/opt/nim/.cache \ -e HF_TOKEN \ -e NIM_MODEL_PROFILE=vllm-tp1-pp1-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73 \ -e NIM_MODEL_PATH=$MODEL \ -p 8000:8000 \ ${NIM_LLM_IMAGE}
Override the profile through a vLLM CLI argument:
docker run --gpus=all \ -v $(pwd)/local_cache:/opt/nim/.cache \ -e HF_TOKEN \ -p 8000:8000 \ ${NIM_LLM_IMAGE} \ $MODEL --tensor-parallel-size 2
Model Hosted in S3 Using Default Profile from Automatic Profile Selection#
Use this example to serve a model from S3 and let NIM select the default compatible profile automatically.
Set the model path:
export MODEL=s3://my-bucket/my-org/my-fine-tuned-model
Start NIM with the required S3 credentials:
docker run --gpus=all \ -v $(pwd)/local_cache:/opt/nim/.cache \ -e NIM_MODEL_PATH=$MODEL \ -e AWS_ACCESS_KEY_ID=<key> \ -e AWS_SECRET_ACCESS_KEY=<secret> \ -e AWS_REGION=us-east-1 \ -p 8000:8000 \ ${NIM_LLM_IMAGE}
Local Model with TP=8 Profile Specified#
Use this example to serve a local model and select a TP=8 profile explicitly.
Set the model path:
export MODEL=/mnt/models/my-120b-model
Start NIM with the selected profile:
docker run --gpus=all \ -v $(pwd)/local_cache:/opt/nim/.cache \ -v /mnt/models:/mnt/models \ -e NIM_MODEL_PROFILE=<tp8_profile> \ -e NIM_MODEL_PATH=$MODEL \ -p 8000:8000 \ ${NIM_LLM_IMAGE}