Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Deploy NeMo Models by Exporting to Inference Optimized Libraries#
NeMo Framework offers scripts and APIs to export models to two inference optimized libraries, TensorRT-LLM and vLLM, and to deploy the exported model with the NVIDIA Triton Inference Server. Please check the table below to see which models are supported.
Supported LLMs#
The following table shows the supported LLMs and their inference-optimized libraries in the distributed NeMo checkpoint format.
Model Name |
Model Parameters |
NeMo 1.0 to TensorRT-LLM |
NeMo 2.0 to TensorRT-LLM |
NeMo 1.0 to vLLM |
NeMo 2.0 to vLLM |
---|---|---|---|---|---|
GPT |
2B, 8B, 43B |
✓ |
✓ |
✗ |
✗ |
Nemotron |
8B, 22B |
✓ |
✓ |
✗ |
✗ |
Llama 2 |
7B, 13B, 70B |
✓ |
✗ |
✓ |
✓ |
Llama 3 |
8B, 70B |
✓ |
✓ |
✓ |
✓ |
Llama 3.1 |
8B, 70B, 405B |
✓ |
✓ |
✓ |
✓ |
Falcon |
7B, 40B |
✓ |
✗ |
✗ |
✗ |
Gemma |
2B, 7B |
✓ |
✗ |
✓ |
✓ |
StarCoder1 |
15B |
✓ |
✗ |
✗ |
✗ |
StarCoder2 |
3B, 7B, 15B |
✓ |
✗ |
✓ |
✓ |
Mistral |
7B |
✓ |
✗ |
✓ |
✓ |
Mixtral |
8x7B |
✓ |
✓ |
✓ |
✓ |
Note
As we transition support for deploying community models from NeMo 1.0 to NeMo 2.0, not all models are supported in NeMo 2.0 yet. The support matrix above shows which models are currently available. To use a model not yet supported in NeMo 2.0, please refer to the documentation for version 24.07, which uses NeMo 1.0 instead.
You can find details about TensorRT-LLM and vLLM-based deployment options below.