Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Deploy NeMo Models by Exporting to Inference Optimized Libraries

NeMo Framework offers scripts and APIs to export models to two inference optimized libraries, TensorRT-LLM and vLLM, and to deploy the exported model with the NVIDIA Triton Inference Server. Please check the table below to see which models are supported.

Supported LLMs

The following table shows the supported LLMs and their inference-optimized libraries in the distributed NeMo checkpoint format.

Model Name	Model Parameters	TensorRT-LLM	vLLM
GPT	2B, 8B, 43B	✓	✗
Nemotron	8B, 22B	✓	✗
Llama 2	7B, 13B, 70B	✓	✓
Llama 3	8B, 70B	✓	✓
Llama 3.1	8B, 70B, 405B	✓	✗
Falcon	7B, 40B	✓	✗
Gemma	2B, 7B	✓	✓
StarCoder1	15B	✓	✗
StarCoder2	3B, 7B, 15B	✓	✓
MISTRAL	7B	✓	✓
MIXTRAL	8x7B	✓	✓

You can find details about TensorRT-LLM and vLLM-based deployment options below.