Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Deploy NeMo Models by Exporting to Inference Optimized Libraries#

NeMo Framework offers scripts and APIs to export models to two inference optimized libraries, TensorRT-LLM and vLLM, and to deploy the exported model with the NVIDIA Triton Inference Server. Please check the table below to see which models are supported.

Supported LLMs#

The following table shows the supported LLMs and their inference-optimized libraries in the distributed NeMo checkpoint format.

Model Name

Model Parameters

TensorRT-LLM

vLLM

GPT

2B, 8B, 43B

Nemotron

8B, 22B

Llama 2

7B, 13B, 70B

Llama 3

8B, 70B

Llama 3.1

8B, 70B, 405B

Falcon

7B, 40B

Gemma

2B, 7B

StarCoder1

15B

StarCoder2

3B, 7B, 15B

MISTRAL

7B

MIXTRAL

8x7B

You can find details about TensorRT-LLM and vLLM-based deployment options below.