Llama and CodeLlama

Meta’s Llama builds on the general transformer decoder framework with some key additions such as pre-normalization, SwiGLU activations, and Rotary Positional Embeddings (RoPE). More information is available in the companion paper “Llama: Open and Efficient Foundation Language Models”. With a wide variety of model sizes - Llama has options for every inference budget.

New: Llama 3.1 Support

  1. Llama 3.1 is the latest generation of Llama models that is available in three sizes, namely 8B, 70B, and 405B.

  2. Llama 3.1 uses a scaling method for rotary positional embedding, which is different from previous Llama versions.

  3. To launch a pretraining job with Llama 3.1, one can use the “Training with Predefined Configurations” from the tabs below, but you one should use Llama3.1 predefined configuration Llama/Llama3_1_<model_size> from NeMo-Framework-Launcher training configs: Llama Training Config.

  4. Use the “Parameter Efficient Fine-Tuning (PEFT)” tab below for LoRA and other PEFT methods, and use the checkpoints from Llama 3.1 from NVIDIA NGC Models.

  5. For Llama 3.1, users should use the nvcr.io/nvidia/nemo:24.05.Llama3.1 container which has the latest changes for Llama 3.1. One can get this container from NVIDIA NGC Containers.

    • Clone the latest NeMo-Framework-Launcher:

      git clone git@github.com:NVIDIA/NeMo-Framework-Launcher.git
      
    • Launch the docker container mounted with the above repository:

      docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/NeMo-Framework-Launcher:NeMo-Framework-Launcher nvcr.io/nvidia/nemo:24.05.Llama3.1
      
  6. For a step-by-step tutorial on PEFT for Llama 3.1, one can use this tutorial.

  7. The rest of this user guide is usable for Llama 3.1. without any changes.

Feature

Status

Data parallelism

Tensor parallelism

Pipeline parallelism

Interleaved Pipeline Parallelism Sched

N/A

Sequence parallelism

Selective activation checkpointing

Gradient checkpointing

Partial gradient checkpointing

FP32/TF32

AMP/FP16

BF16

TransformerEngine/FP8

Multi-GPU

Multi-Node

Inference

N/A

Slurm

Base Command Manager

Base Command Platform

Distributed data preprcessing

NVfuser

P-Tuning and Prompt Tuning

IA3 and Adapter learning

Distributed Optimizer

Distributed Checkpoint

Fully Shared Data Parallel