Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Deploy NeMo Models by Exporting to Inference Optimized Libraries

NeMo Framework offers scripts and APIs to export models to two inference optimized libraries, TensorRT-LLM and vLLM, and to deploy the exported model with the NVIDIA Triton Inference Server. Please check the table below to see which models are supported.

Supported LLMs

The following table shows the supported LLMs and their inference-optimized libraries in the distributed NeMo checkpoint format.

Model Name

Model Parameters

TensorRT-LLM

vLLM

GPT

2B, 8B, 43B

Nemotron

8B, 22B

Llama 2

7B, 13B, 70B

Llama 3

8B, 70B

Llama 3.1

8B, 70B, 405B

Falcon

7B, 40B

Gemma

2B, 7B

StarCoder1

15B

StarCoder2

3B, 7B, 15B

MISTRAL

7B

MIXTRAL

8x7B

You can find details about TensorRT-LLM and vLLM-based deployment options below.