Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Llama and CodeLlama

Meta’s Llama builds on the general transformer decoder framework with some key additions such as pre-normalization, SwiGLU activations, and Rotary Positional Embeddings (RoPE). More information is available in the companion paper “Llama: Open and Efficient Foundation Language Models”. With a wide variety of model sizes - Llama has options for every inference budget.

New: Llama 3.1 Support

Llama 3.1 is the latest generation of Llama models that is available in three sizes, namely 8B, 70B, and 405B.
Llama 3.1 uses a scaling method for rotary positional embedding, which is different from previous Llama versions.
To launch a pretraining job with Llama 3.1, one can use the “Training with Predefined Configurations” from the tabs below, but you one should use Llama3.1 predefined configuration Llama/Llama3_1_<model_size> from NeMo-Framework-Launcher training configs: Llama Training Config.
Use the “Parameter Efficient Fine-Tuning (PEFT)” tab below for LoRA and other PEFT methods, and use the checkpoints from Llama 3.1 from NVIDIA NGC Models.
For Llama 3.1, users should use the nvcr.io/nvidia/nemo:24.05.Llama3.1 container which has the latest changes for Llama 3.1. One can get this container from NVIDIA NGC Containers.
- Clone the latest NeMo-Framework-Launcher:
```
git clone git@github.com:NVIDIA/NeMo-Framework-Launcher.git
```
- Launch the docker container mounted with the above repository:
```
docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/NeMo-Framework-Launcher:NeMo-Framework-Launcher nvcr.io/nvidia/nemo:24.05.Llama3.1
```
For a step-by-step tutorial on PEFT for Llama 3.1, one can use this tutorial.
The rest of this user guide is usable for Llama 3.1. without any changes.

Feature	Status
Data parallelism	✓
Tensor parallelism	✓
Pipeline parallelism	✓
Interleaved Pipeline Parallelism Sched	N/A
Sequence parallelism	✓
Selective activation checkpointing	✓
Gradient checkpointing	✓
Partial gradient checkpointing	✓
FP32/TF32	✓
AMP/FP16	✗
BF16	✓
TransformerEngine/FP8	✗
Multi-GPU	✓
Multi-Node	✓
Inference	N/A
Slurm	✓
Base Command Manager	✓
Base Command Platform	✓
Distributed data preprcessing	✓
NVfuser	✗
P-Tuning and Prompt Tuning	✓
IA3 and Adapter learning	✓
Distributed Optimizer	✓
Distributed Checkpoint	✓
Fully Shared Data Parallel	✓