Supported Architectures for LLM-agnostic NIM#
Use this documentation to learn the details of supported architectures for the LLM-agnostic NIM container.
Note
If you are looking for the list of supported models for the LLM-specific NIM container, refer to Supported Models for NVIDIA NIM for LLMs instead.
Text-only Language Models#
The following table lists which text-only model architectures and inference engines have been verified to work with the LLM-agnostic NIM container. Each cell indicates whether that specific feature is supported (Yes) or not supported (No) for the given model architecture and inference engine.
Model Architecture |
vLLM |
- |
- |
- |
TRTLLM |
- |
- |
- |
SGLang |
- |
- |
- |
---|---|---|---|---|---|---|---|---|---|---|---|---|
With Verified HF Model(s) |
Base Model |
LoRA |
Function Calling |
Guided Decoding |
Base Model |
LoRA |
Function Calling |
Guided Decoding |
Base Model |
LoRA |
Function Calling |
Guided Decoding |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
No |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
No |
No |
|
Yes |
No |
Yes |
Yes |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
|
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
Warning
NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it, you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the Safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.
Model Format#
NVIDIA NIM for LLMs supports the following model formats:
HF safetensor
GGUF safetensor
Unified HF safetensor
TRTLLM checkpoint
TRTLLM engine
HuggingFace safetensors#
The HuggingFace safetensors checkpoint should have the following directory structure:
├── config.json # [Required] HuggingFace model confifuration
├── generation_config.json # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors # [Required] Model weights stored as safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── ...
├── model.safetensors.index.json # [Optional] Weights mapping
├── special_tokens_map.json # [Optional] Special tokens mapping
├── tokenizer.json # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc
└── tokenizer_config.json # [Optional] Configuration details for a specific model's tokenizer
Unified HuggingFace safetensors#
The Unified HuggingFace checkpoint should have the following directory structure:
├── config.json # [Required] HuggingFace model confifuration
├── generation_config.json # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors # [Required] Model weights stored as safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── ...
├── model.safetensors.index.json # [Optional] Weights mapping
├── special_tokens_map.json # [Optional] Special tokens mapping
├── tokenizer.json # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc
└── tokenizer_config.json # [Optional] Configuration details for a specific model's tokenizer
└── hf_quant_config.json # [Required] HuggingFace quantization configuration
GGUF#
The GGUF checkpoint should have the following directory structure:
├── config.json # [Required] HuggingFace model confifuration
├── generation_config.json # [Optional] Parameters to guide text generation
├── Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf # [Required] GGUF model weights stored as safetensors
├── ...
├── model.safetensors.index.json # [Optional] Weights mapping
├── special_tokens_map.json # [Optional] Special tokens mapping
├── tokenizer.json # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc
└── tokenizer_config.json # [Optional] Configuration details for a specific model's tokenizer
TRTLLM checkpoints#
The TRTLLM checkpoints should have the following directory structure:
├── config.json # [Required] HuggingFace model confifuration
├── generation_config.json # [Optional] Parameters to guide text generation
├── model.safetensors.index.json # [Optional] Weights mapping
├── special_tokens_map.json # [Optional] Special tokens mapping
├── tokenizer.json # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc
└── tokenizer_config.json # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
├── config.json # [Required] TRTLLM pretrained configuration
└── rank0.safetensors # [Required] TRTLLM checkpoint safetensors
├── ...
Note
The TRTLLM checkpoint root directory should have HF tokenizer and configuration files. The sub-folder, called trtllm_ckpt
in this example, should have TRTLLM checkpoint configuration file and weight tensors.
TRTLLM engines#
The TRTLLM engine should have the following directory structure:
├── config.json # [Required] HuggingFace model confifuration
├── generation_config.json # [Optional] Parameters to guide text generation
├── model.safetensors.index.json # [Optional] Weights mapping
├── special_tokens_map.json # [Optional] Special tokens mapping
├── tokenizer.json # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc
└── tokenizer_config.json # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
├── config.json # [Required] TRTLLM engine configuration
└── rank0.engine # [Required] TRTLLM serialized engine
├── ...
Note
Required files in each of the above folder structures are strictly validated in NIM. If one or more required files are missing, an exception will be thrown and user will be prompted to provide additional files. Tokenizer files are used for inference and each backend will default to using non-optimal (default) tokenizers if these files are not present. Refer to Troubleshooting for more details.
Users should ensure that the tensor parallel and pipeline parallel sizes configured at start-up time using NIM_TENSOR_PARALLEL_SIZE
and NIM_PIPELINE_PARALLEL_SIZE
matches the TRTLLM checkpoint and engine ranks. Otherwise, NIM throws an exception to prompt users to fix the inference configurations.
Model Precisions#
All full precision models can be deployed in NVIDIA NIM for LLMs given that the architecture and model formats are supported. For quantized models, refer to following section.
Quantization Formats#
NVIDIA NIM for LLMs supports the following quantization formats:
Quantization algorithm |
vLLM |
TRTLLM |
SGLang |
---|---|---|---|
|
Y |
Y |
Y |
|
Y |
Y |
Y |
|
N |
Y |
N |