Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Llama and CodeLlama
Meta’s Llama builds on the general transformer decoder framework with some key additions such as pre-normalization, SwiGLU activations, and Rotary Positional Embeddings (RoPE). More information is available in the companion paper “Llama: Open and Efficient Foundation Language Models”. With a wide variety of model sizes - Llama has options for every inference budget.
New: Llama 3.1 Support
Llama 3.1 is the latest generation of Llama models that is available in three sizes, namely 8B, 70B, and 405B.
Llama 3.1 uses a scaling method for rotary positional embedding, which is different from previous Llama versions.
To launch a pretraining job with Llama 3.1, one can use the “Training with Predefined Configurations” from the tabs below, but you one should use Llama3.1 predefined configuration
Llama/Llama3_1_<model_size>
from NeMo-Framework-Launcher training configs: Llama Training Config.Use the “Parameter Efficient Fine-Tuning (PEFT)” tab below for LoRA and other PEFT methods, and use the checkpoints from Llama 3.1 from NVIDIA NGC Models.
For Llama 3.1, users should use the
nvcr.io/nvidia/nemo:24.05.Llama3.1
container which has the latest changes for Llama 3.1. One can get this container from NVIDIA NGC Containers.Clone the latest NeMo-Framework-Launcher:
git clone git@github.com:NVIDIA/NeMo-Framework-Launcher.git
Launch the docker container mounted with the above repository:
docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/NeMo-Framework-Launcher:NeMo-Framework-Launcher nvcr.io/nvidia/nemo:24.05.Llama3.1
For a step-by-step tutorial on PEFT for Llama 3.1, one can use this tutorial.
The rest of this user guide is usable for Llama 3.1. without any changes.
Feature |
Status |
---|---|
Data parallelism |
✓ |
Tensor parallelism |
✓ |
Pipeline parallelism |
✓ |
Interleaved Pipeline Parallelism Sched |
N/A |
Sequence parallelism |
✓ |
Selective activation checkpointing |
✓ |
Gradient checkpointing |
✓ |
Partial gradient checkpointing |
✓ |
FP32/TF32 |
✓ |
AMP/FP16 |
✗ |
BF16 |
✓ |
TransformerEngine/FP8 |
✗ |
Multi-GPU |
✓ |
Multi-Node |
✓ |
Inference |
N/A |
Slurm |
✓ |
Base Command Manager |
✓ |
Base Command Platform |
✓ |
Distributed data preprcessing |
✓ |
NVfuser |
✗ |
P-Tuning and Prompt Tuning |
✓ |
IA3 and Adapter learning |
✓ |
Distributed Optimizer |
✓ |
Distributed Checkpoint |
✓ |
Fully Shared Data Parallel |
✓ |