Llama 4 Models#

Meta’s Llama 4 models introduce advanced capabilities in visual recognition, image reasoning, captioning, and answering general image-related questions. Llama 4 introduces two powerful multimodal models:

  • Llama 4 Scout: A 17B active parameter model with 16 experts that outperforms all previous Llama generations while fitting on a single NVIDIA H100 GPU. It features an industry-leading 10M context window and delivers superior results compared to Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across numerous benchmarks.

  • Llama 4 Maverick: A 17B active parameter model with 128 experts that surpasses GPT-4o and Gemini 2.0 Flash across various benchmarks, while achieving comparable results to DeepSeek v3 on reasoning and coding with less than half the active parameters. It offers an exceptional performance-to-cost ratio, with its experimental chat version scoring 1417 ELO on LMArena.

Resources:

Import from Hugging Face to NeMo 2.0#

To import the Hugging Face (HF) model and convert it to NeMo 2.0 format, run the following command. This step only needs to be performed once:

from nemo.collections import llm, vlm
from nemo.collections.vlm.llama4.model.llama4_omni import Llama4ScoutExperts16Config

if __name__ == '__main__':
    # Specify the Hugging Face model ID (e.g., Scout 16 Experts Instruct)
    hf_model_id = 'meta-llama/Llama-4-Scout-17B-16E-Instruct'
    # Import the model and convert to NeMo 2.0 format
    llm.import_ckpt(
        model=vlm.Llama4OmniModel(Llama4ScoutExperts16Config()),
        source=f"hf://{hf_model_id}",
    )

The command above saves the converted file in the NeMo cache folder, located at: ~/.cache/nemo.

If needed, you can change the default cache directory by setting the NEMO_CACHE_DIR environment variable before running the script.

NeMo 2.0 Llama 4 Scripts#

The scripts for working with Llama 4 models within the NeMo Framework are located in scripts/vlm/llama4.

  • convert_llama4_hf.py: Converts Llama 4 models from Hugging Face format to NeMo 2.0 format.

    Usage:

    python scripts/vlm/llama4/convert_llama4_hf.py
    
  • llama4_generate.py: Performs inference (generation) using a fine-tuned or pre-converted Llama 4 NeMo 2.0 model.

    Usage:

    python scripts/vlm/llama4/llama4_generate.py \
        --local_model_path=<path_to_nemo_model>
    
  • Multi-Node Usage (Example with SLURM and Pyxis):

    The following example demonstrates how to run text generation inference on 4 nodes with 8 GPUs each (total 32 GPUs) using SLURM. It assumes a containerized environment managed by Pyxis.

    srun --mpi=pmix --no-kill \
        --container-image <path_to_container_image> \
        --container-mounts <necessary_mounts> \
        -N 4 --ntasks-per-node=8 -p <partition_name> --pty \
        bash -c " \
            python scripts/vlm/llama4/llama4_generate.py \
                --local_model_path=<path_to_nemo_model> \
                --tp 8 \
                --pp 4 \
        "
    
  • llama4_finetune.py: Fine-tunes a Llama 4 model on a given dataset.

    Usage:

    torchrun --nproc_per_node=2 scripts/vlm/llama4/llama4_finetune.py \
      --devices=2 --tp=2 --data_type=mock \
      --mbs=1 --gbs=4 --use_toy_model
    

    Note

    Multi-node fine-tuning can be launched similarly to the multi-node generation example above, using job schedulers like SLURM. Replace llama4_generate.py with llama4_finetune.py and adjust the script parameters and SLURM configuration (nodes, tasks, etc.) accordingly.

NeMo 2.0 Fine-Tuning Recipes#

We provide pre-defined recipes for fine-tuning Llama 4 vision models (Llama4OmniModel) using NeMo 2.0 and NeMo-Run. These recipes configure a run.Partial for one of the nemo.collections.llm api functions introduced in NeMo 2.0. # Placeholder links - update with actual recipe file locations The recipes are hosted in hypothetical llama4_omni_16e and llama4_omni_128e files. The recipes use the Llama4MockDataModule for the data argument by default.

Note

The recipes use the Llama4MockDataModule for the data argument. You are expected to replace the Llama4MockDataModule with your custom dataset module.

By default, the non-instruct version of the model is loaded. To load a different model, set finetune.resume.restore_config.path=nemo://<hf_model_id> or finetune.resume.restore_config.path=<local_model_path>.

We provide an example below on how to invoke the default recipe and override the data argument:

from nemo.collections import vlm

# Get the fine-tuning recipe function (adjust for the specific Llama 4 model)
finetune = vlm.llama4_omni_16e.finetune_recipe(
    name="llama4_omni_16e_finetune",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
    peft_scheme='lora', # or 'none' for full fine-tuning
)

By default, the fine-tuning recipe applies LoRA to all linear layers in the language model, including cross-attention layers, while keeping the vision model unfrozen.

  • To configure which layers to apply LoRA: Set finetune.peft.target_modules. For example, to apply LoRA only on the self-attention qkv projection layers, set finetune.peft.target_modules=["*.language_model.*.linear_qkv"].

  • To freeze the vision model: Set finetune.peft.freeze_vision_model=True. Note: Verify if this parameter exists for Llama4OmniModel PEFT config.

  • To fine-tune the entire model without LoRA: Set peft_scheme='none' in the recipe argument.

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which just runs the pretraining locally in a separate process. You can use it as follows:

import nemo_run as run

run.run(finetune, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(finetune, direct=True)

Bring Your Own Data#

Replace the Llama4MockDataModule in default recipes with your custom dataset module. Here is an example to switch to energon dataset:

# Assuming recipe functions exist in a structure like nemo.collections.vlm.llama4.recipes
from nemo.collections.vlm.llama4 import recipes # Hypothetical recipes module

# Import your custom Llama 4 data module and necessary configs
from nemo.collections.vlm.data.data_module import EnergonDataModule
from nemo.collections.vlm.llama4.data.task_encoder import TaskEncoder as Llama4TaskEncoder
from nemo.collections.vlm.llama4.data.task_encoder import TaskEncoderConfig as Llama4TaskEncoderConfig

# Define the fine-tuning recipe using the appropriate Llama 4 recipe (adjust name if needed)
finetune = recipes.llama4_omni_16e.finetune_recipe(
    name="llama4_omni_16e_finetune",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
    peft_scheme='lora', # or 'none'
)

# Example custom dataset configuration (replace with your actual data setup)
# Llama 4 specific data configuration might be required here
task_encoder = Llama4TaskEncoder(
    config=Llama4TaskEncoderConfig(
        hf_path='meta-llama/Llama-4-Scout-17B-16E-Instruct', # Use the appropriate model path
    )
)
custom_data = EnergonDataModule(
    path="/path/to/energon/dataset", # Path to your Energon dataset
    train_encoder=task_encoder,
    seq_length=8192, # Adjust as needed
    global_batch_size=16, # Adjust based on GPU memory
    micro_batch_size=1, # Adjust based on GPU memory
    num_workers=8, # Adjust based on system capabilities
)

# Assign custom data to the fine-tuning recipe
finetune.data = custom_data

Please refer to the Data Preparation to Use Megatron-Energon Dataloader for how to prepare your llava-like data for fine-tuning.

A comprehensive list of pretraining recipes that we currently support or plan to support soon is provided below for reference:

Recipe

Status

Llama 4 Scout 17B 16E LoRA

Yes

Llama 4 Scout 17B 16E Full fine-tuning

Yes

Llama 4 Maverick 17B 128E LoRA

Yes

Llama 4 Maverick 17B 128E Full fine-tuning

Yes