Model-Free NIM#
Run any supported model without a model-specific container image.
Overview#
By default, NIM containers ship with a baked-in model manifest that defines which model to serve. Model-free mode lets you point a generic NIM container at any model — a Hugging Face repo, an NGC model, an S3 bucket, or a local directory — and NIM will generate a manifest at startup and serve that model.
Model-free NIM is useful for:
Flexible single-container deployments — one container image passes security review and serves any supported model.
Day-zero model support — serve newly released models without waiting for a model-specific NIM container.
Custom and fine-tuned models — serve your own models from any supported source.
Note
Regardless of where the model is hosted, if its architecture is unsupported by vLLM, it will also be unsupported by NIM.
Configuring the Model#
Supply the model via either of these methods. If both are provided, the CLI positional argument takes precedence.
NIM_MODEL_PATH Environment Variable#
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE}
vLLM CLI Positional Argument#
docker run --gpus=all \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE} \
hf://meta-llama/Llama-3.1-8B-Instruct
Supported Model Sources#
Prefix |
Source |
Example |
Authentication |
|---|---|---|---|
|
Hugging Face Hub |
|
|
|
NVIDIA NGC |
|
|
|
AWS S3 / S3-compatible |
|
|
|
ModelScope Hub |
|
|
|
Google Cloud Storage |
|
|
(absolute path) |
Local directory |
|
None |
For details on downloading models from each source (including URI formats, proxy configuration, and cache management), refer to Model Download.
Configuring Deployment Options#
Model-free NIM generates profiles for combinations of tensor parallelism (TP), pipeline parallelism (PP), and LoRA. To select a deployment configuration, use either of the following methods. If both are provided, vLLM CLI arguments take precedence.
NIM_MODEL_PROFILE Environment Variable#
Run list-model-profiles to see available profiles, then select one:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e NIM_MODEL_PROFILE=vllm-bf16-tp2-pp1 \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE}
vLLM CLI Arguments#
docker run --gpus=all \
-e HF_TOKEN=<yourtoken> \
-p 8000:8000 \
${NIM_LLM_IMAGE} \
hf://meta-llama/Llama-3.1-8B-Instruct \
-tp 2
The following vLLM CLI arguments are supported:
Argument |
Purpose |
Default |
|---|---|---|
|
Number of GPUs |
1 |
|
Number of nodes |
1 |
|
Enable LoRA adapter support |
Disabled |
Listing Profiles#
Use the list-model-profiles command to view the profiles generated for a given model:
docker run --gpus=all \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.1-8B-Instruct \
-e HF_TOKEN=<yourtoken> \
${NIM_LLM_IMAGE} \
list-model-profiles
For more details on profile selection, refer to Model Profiles and Selection.
S3-Specific Environment Variables#
Variable |
Required |
Purpose |
|---|---|---|
|
Yes |
AWS access key |
|
Yes |
AWS secret key |
|
Yes |
AWS region (e.g., |
|
Only for S3-compatible |
Custom endpoint (e.g., |
|
Only for S3-compatible |
Set to |
For complete authentication and configuration details for each source, refer to Model Download.
Examples#
HuggingFace Model with TP=2#
export MODEL=hf://meta-llama/Llama-3.2-1B
# List available profiles
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-e HF_TOKEN \
-e NIM_MODEL_PATH=$MODEL \
${NIM_LLM_IMAGE} \
list-model-profiles
# Example output:
# - Compatible with system and runnable:
# - c214460d2ad7a379660126062912d2aeecaa74a3ce14ab9966cd135de49a73f2 (vllm-tp1-pp1-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73) [requires >=13 GB/gpu]
# - With LoRA support:
# - 289b03eb8c26104f416dd0a1055004e31fd9e4b0f84fe2e59754a3ceb710976a (vllm-tp1-pp1-feat_lora-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73) [requires >=13 GB/gpu]
# Run with profile selected via NIM_MODEL_PROFILE
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-e HF_TOKEN \
-e NIM_MODEL_PROFILE=vllm-tp1-pp1-0bdd169fb413e457cef3feda64108b085f73d16b6252a860e5e9ee85f533de73 \
-e NIM_MODEL_PATH=$MODEL \
-p 8000:8000 \
${NIM_LLM_IMAGE}
# OR run with profile selected via vLLM CLI argument override
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-e HF_TOKEN \
-p 8000:8000 \
${NIM_LLM_IMAGE} \
$MODEL -tp 2
Model hosted in S3 using default profile from automatic profile selection#
export MODEL=s3://my-bucket/my-org/my-fine-tuned-model
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-e NIM_MODEL_PATH=$MODEL \
-e AWS_ACCESS_KEY_ID=<key> \
-e AWS_SECRET_ACCESS_KEY=<secret> \
-e AWS_REGION=us-east-1 \
-p 8000:8000 \
${NIM_LLM_IMAGE}
Local Model with TP=8 profile specified#
export MODEL=/mnt/models/my-120b-model
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-v /mnt/models:/mnt/models \
-e NIM_MODEL_PROFILE=<tp8_profile> \
-e NIM_MODEL_PATH=$MODEL \
-p 8000:8000 \
${NIM_LLM_IMAGE}