Supported Architectures for LLM-agnostic NIM#

Use this documentation to learn the details of supported architectures for the LLM-agnostic NIM container.

Note

If you are looking for the list of supported models for the LLM-specific NIM container, refer to Supported Models for NVIDIA NIM for LLMs instead.

Text-only Language Models#

The following table lists which text-only model architectures and inference engines have been verified to work with the LLM-agnostic NIM container. Each cell indicates whether that specific feature is supported (Yes) or not supported (No) for the given model architecture and inference engine.

Model Architecture

vLLM

-

-

-

TRTLLM

-

-

-

SGLang

-

-

-

With Verified HF Model(s)

Base Model

LoRA

Function Calling

Guided Decoding

Base Model

LoRA

Function Calling

Guided Decoding

Base Model

LoRA

Function Calling

Guided Decoding

BartForConditionalGeneration

Verified Model:
facebook/bart-large-cnn

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

BloomForCausalLM

Verified Model:
bigscience/bloom-560m

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

ChatGLMModel

Verified Model:
THUDM/chatglm3-6b

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

DeciLMForCausalLM

Verified Model:
nvidia/Llama-3_3-Nemotron-Super-49B-v1

No

No

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

DeepseekV2ForCausalLM

Verified Model:
deepseek-ai/DeepSeek-V2-Lite-Chat

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

DeepseekV3ForCausalLM

Verified Model:
RedHatAI-DeepSeek-CoderV2

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

FalconForCausalLM

Verified Model:
tiiuae/falcon-7b

Yes

No

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

FalconMambaForCausalLM

Verified Model:
tiiuae/falcon-mamba-7b-instruct

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

GemmaForCausalLM

Verified Model:
google/gemma-1.1-2b-it

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

Gemma2ForCausalLM

Verified Model:
google/gemma-2-9b

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

GlmForCausalLM

Verified Model:
THUDM/glm-4-9b-chat-hf

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

GPTBigCodeForCausalLM

Verified Model:
bigcode/starcoder

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

GPT2LMHeadModel

Verified Models:
openai-community/gpt2
distilbert/distilgpt2

Yes

No

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

GPTNeoXForCausalLM

Verified Model:
EleutherAI/pythia-70

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

GraniteForCausalLM

Verified Models:
ibm-granite/granite-3.3-2b-instruct
ibm-granite/granite-3.3-8b-instruct

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

GraniteMoeForCausalLM

Verified Model:
ibm/PowerMoE-3b

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

GritLM

Verified Model:
GritLM-7B

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

InternLM2ForCausalLM

Verified Model:
internlm/internlm2-chat-7b

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

InternLM3ForCausalLM

Verified Model:
internlm/internlm3-8b-instruct

Yes

Yes

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

JambaForCausalLM

Verified Models:
ai21labs/AI21-Jamba-Mini-1.5
ai21labs/AI21-Jamba-Mini-1.6

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

LlamaForCausalLLM

Verified Models:
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3-8B

See More
openGPT-X/Teuken-7B-instruct-commercial-v0.4
RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8
utter-project/EuroLLM-9B-Instruct
meta-llama/Llama-2-7b-chat-hf
modularai/Llama-3.1-8B-Instruct-GGUF
petals-team/StableBeluga2
meta-llama/Llama-3.2-1B-Instruct
TinyLlama/TinyLlama-1.1B-Chat-v1.0
unsloth/phi-4 (base)
unsloth/phi-4 (Multi-LoRA)

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

MambaForCausalLM

Verified Model:
state-spaces/mamba-370m-hf

Yes

No

Yes

Yes

Yes

No

No

No

No

No

No

No

MistralForCausalLM

Verified Models:
mistralai/Codestral-22B-v0.1
mistralai/Mistral-7B-Instruct-v0.3

See More
mistralai/Mistral-Small-24B-Instruct-2501
speakleash/Bielik-11B-v2.3-Instruct
Fastweb/FastwebMIIA-7B
sarvamai/sarvam-m
PocketDoc/Dans-PersonalityEngine-V1.3.0-24b
Delta-Vector/Sol-Reaver-15B-Instruct
mistralai/Mistral-7B-Instruct-v0.2

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

MolmoForCausalLM

Verified Model:
allenai-molmo

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

Olmo2ForCausalLM

Verified Model:
allenai/OLMo-2-0425-1B

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

OlmoeForCausalLM

Verified Model:
allenai-OLMoe-1B

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

PhiMoEForCausalLM

Verified Model:
microsoft/Phi-3.5-MoE-instruct

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

Phi3ForCausalLM

Verified Model:
microsoft/Phi-3-mini-4k-instruct

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

Phi3SmallForCausalLM

Verified Model:
microsoft/Phi-3-small-8k-instruct

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

QWenLMHeadModel

Verified Model:
Qwn-1_8B-Chat

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

Qwen2ForCausalLM

Verified Models:
Qwen/Qwen2.5-0.5B-Instruct
Qwen/Qwen2.5-1.5B-Instruct

See More
Qwen/Qwen2.5-7B-Instruct
Qwen/Qwen2.5-14B-Instruct-AWQ
Qwen/Qwen2.5-7B-Instruct-1M

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Qwen2MoeForCausalLM

Verified Model:
Qwen/Qwen1.5-MoE-A2.7B-Chat

Yes

No

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

RWForCausalLM

Verified Model:
tiiuae/falcon-40b

Yes

No

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

SolarForCausalLM

Verified Model:
upstage/solar-pro-preview-instruct

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

StableLMEpochForCausalLM

Verified Model:
TroyDoesAI-Mermaid

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

StableLmForCausalLM

Verified Model:
AI4free-jarvis-3b

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

StarCoder2ForCausalLM

Verified Model:
bigcode/starcoder2-15b

Yes

No

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

Warning

NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it, you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the Safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.

Model Format#

NVIDIA NIM for LLMs supports the following model formats:

  • HF safetensor

  • GGUF safetensor

  • Unified HF safetensor

  • TRTLLM checkpoint

  • TRTLLM engine

HuggingFace safetensors#

The HuggingFace safetensors checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer

Unified HuggingFace safetensors#

The Unified HuggingFace checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── hf_quant_config.json                  # [Required] HuggingFace quantization configuration

GGUF#

The GGUF checkpoint should have the following directory structure:

├── config.json                             # [Required] HuggingFace model confifuration 
├── generation_config.json                  # [Optional] Parameters to guide text generation
├── Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf  # [Required] GGUF model weights stored as safetensors 
├── ...
├── model.safetensors.index.json            # [Optional] Weights mapping 
├── special_tokens_map.json                 # [Optional] Special tokens mapping 
├── tokenizer.json                          # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                   # [Optional] Configuration details for a specific model's tokenizer

TRTLLM checkpoints#

The TRTLLM checkpoints should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM pretrained configuration
    └── rank0.safetensors                 # [Required] TRTLLM checkpoint safetensors
    ├── ...

Note

The TRTLLM checkpoint root directory should have HF tokenizer and configuration files. The sub-folder, called trtllm_ckpt in this example, should have TRTLLM checkpoint configuration file and weight tensors.

TRTLLM engines#

The TRTLLM engine should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM engine configuration
    └── rank0.engine                      # [Required] TRTLLM serialized engine
    ├── ...

Note

Required files in each of the above folder structures are strictly validated in NIM. If one or more required files are missing, an exception will be thrown and user will be prompted to provide additional files. Tokenizer files are used for inference and each backend will default to using non-optimal (default) tokenizers if these files are not present. Refer to Troubleshooting for more details.

Users should ensure that the tensor parallel and pipeline parallel sizes configured at start-up time using NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE matches the TRTLLM checkpoint and engine ranks. Otherwise, NIM throws an exception to prompt users to fix the inference configurations.

Model Precisions#

All full precision models can be deployed in NVIDIA NIM for LLMs given that the architecture and model formats are supported. For quantized models, refer to following section.

Quantization Formats#

NVIDIA NIM for LLMs supports the following quantization formats:

Quantization algorithm

vLLM

TRTLLM

SGLang

INT4 AWQ

Y

Y

Y

FP8

Y

Y

Y

NVFP4

N

Y

N