Important
NeMo 2.0 is an experimental feature and currently released in the dev container only:
nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Llama and CodeLlama
Meta’s Llama builds on the general transformer decoder framework with some key additions such as pre-normalization, SwiGLU activations, and Rotary Positional Embeddings (RoPE). More information is available in the companion paper “Llama: Open and Efficient Foundation Language Models”. With a wide variety of model sizes - Llama has options for every inference budget.
New: Llama 3.1 Support
Llama 3.1 is the latest generation of Llama models that is available in three sizes, namely 8B, 70B, and 405B.
Llama 3.1 uses a scaling method for rotary positional embedding, which is different from previous Llama versions.
To launch a pretraining job with Llama 3.1, one can use the “Training with Predefined Configurations” from the tabs below, but you one should use Llama3.1 predefined configuration
Llama/Llama3_1_<model_size>from NeMo-Framework-Launcher training configs: Llama Training Config.Use the “Parameter Efficient Fine-Tuning (PEFT)” tab below for LoRA and other PEFT methods, and use the checkpoints from Llama 3.1 from NVIDIA NGC Models.
For Llama 3.1, users should use the
nvcr.io/nvidia/nemo:24.05.Llama3.1container which has the latest changes for Llama 3.1. One can get this container from NVIDIA NGC Containers.Clone the latest NeMo-Framework-Launcher:
git clone git@github.com:NVIDIA/NeMo-Framework-Launcher.git
Launch the docker container mounted with the above repository:
docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/NeMo-Framework-Launcher:NeMo-Framework-Launcher nvcr.io/nvidia/nemo:24.05.Llama3.1
For a step-by-step tutorial on PEFT for Llama 3.1, one can use this tutorial This link has been archived.
The rest of this user guide is usable for Llama 3.1. without any changes.
Feature |
Status |
|---|---|
Data parallelism |
✓ |
Tensor parallelism |
✓ |
Pipeline parallelism |
✓ |
Interleaved Pipeline Parallelism Sched |
N/A |
Sequence parallelism |
✓ |
Selective activation checkpointing |
✓ |
Gradient checkpointing |
✓ |
Partial gradient checkpointing |
✓ |
FP32/TF32 |
✓ |
AMP/FP16 |
✗ |
BF16 |
✓ |
TransformerEngine/FP8 |
✗ |
Multi-GPU |
✓ |
Multi-Node |
✓ |
Inference |
N/A |
Slurm |
✓ |
Base Command Manager |
✓ |
Base Command Platform |
✓ |
Distributed data preprcessing |
✓ |
NVfuser |
✗ |
P-Tuning and Prompt Tuning |
✓ |
IA3 and Adapter learning |
✓ |
Distributed Optimizer |
✓ |
Distributed Checkpoint |
✓ |
Fully Shared Data Parallel |
✓ |