Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Llama 3.2 Vision Models#
Meta’s Llama 3.2 models introduce advanced capabilities in visual recognition, image reasoning, captioning, and answering general image-related questions.
The Llama 3.2 vision-language models are available in two parameter sizes: 11B and 90B. Each size is offered in both base and instruction-tuned versions, providing flexibility for various use cases.
Resources:
Hugging Face Llama 3.2 Collection: HF llama32 collection
Meta LLaMA Source Code: GitHub Repository
Import from Hugging Face to NeMo 2.0#
To import the Hugging Face (HF) model and convert it to NeMo 2.0 format, run the following command. This step only needs to be performed once:
from nemo.collections.llm import import_ckpt
from nemo.collections import vlm
if __name__ == '__main__':
hf_model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
import_ckpt(
model=vlm.MLlamaModel(vlm.MLlamaConfig11BInstruct(),
source=f"hf://{hf_model_id}",
)
The command above saves the converted file in the NeMo cache folder, located at: ~/.cache/nemo
.
If needed, you can change the default cache directory by setting the NEMO_CACHE_DIR
environment variable before running the script.
NeMo 2.0 Fine-Tuning Recipes#
We provide pre-defined recipes for fine-tuning Llama 3.2 vision models (also known as MLlama) using NeMo 2.0 and NeMo-Run.
These recipes configure a run.Partial
for one of the nemo.collections.llm api functions introduced in NeMo 2.0.
The recipes are hosted in mllama_11b
and mllama_90b files. The recipes use the MockDataModule
for the data
argument.
Note
The recipes use the MockDataModule
for the data
argument. You are expected to replace the MockDataModule
with your custom dataset.
By default, the non-instruct version of the model is loaded. To load a different model, set
finetune.resume.restore_config.path=nemo://<hf_model_id>
or
finetune.resume.restore_config.path=<local_model_path>
We provide an example below on how to invoke the default recipe and override the data argument:
from nemo.collections import vlm
finetune = vlm.mllama_11b.finetune_recipe(
name="mllama_11b_finetune",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
peft_scheme='lora', # 'lora', 'none'
)
By default, the fine-tuning recipe applies LoRA to all linear layers in the language model, including cross-attention layers, while keeping the vision model unfrozen.
To configure which layers to apply LoRA: Set
finetune.peft.target_modules
. For example, to apply LoRA only on the self-attention qkv projection layers, setfinetune.peft.target_modules=["*.language_model.*.linear_qkv"]
.To freeze the vision model: Set
finetune.peft.freeze_vision_model=True
.To fine-tune the entire model without LoRA: Set
peft_scheme='none'
in the recipe argument.
Note
The configuration in the recipes is done using the NeMo-Run run.Config
and run.Partial
configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.
Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which just runs the pretraining locally in a separate process. You can use it as follows:
import nemo_run as run
run.run(finetune, executor=run.LocalExecutor())
Additionally, you can also run it directly in the same Python process as follows:
run.run(finetune, direct=True)
Bring Your Own Data#
Replace the MockDataModule
in default recipes with your custom dataset.
from nemo.collections import vlm
# Define the finetuning recipe
finetune = vlm.mllama_11b.finetune_recipe(
name="mllama_11b_finetune",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
peft_scheme='lora', # 'lora', 'none'
)
# Example of a custom dataset configuration
# Data configuration
data_config = vlm.ImageDataConfig(
image_folder="/path/to/images",
conv_template="mllama", # Customize based on your dataset needs
)
# Data module setup
custom_data = vlm.MLlamaLazyDataModule(
paths="/path/to/dataset.json", # Path to your llava-like dataset
data_config=data_config,
seq_length=6404,
decoder_sequence_length=2048,
global_batch_size=16, # Global batch size
micro_batch_size=1, # Micro batch size
tokenizer=None, # Define your tokenizer if needed
image_processor=None, # Add an image processor if required
num_workers=8, # Number of workers for data loading
)
# Assign custom data to the finetuning recipe
finetune.data = custom_data
A comprehensive list of pretraining recipes that we currently support or plan to support soon is provided below for reference:
Recipe |
Status |
---|---|
Llama 3.2 11B LoRA |
Yes |
Llama 3.2 11B Full fine-tuning |
Yes |
Llama 3.2 90B LoRA |
Yes |
Llama 3.2 90B Full fine-tuning |
Yes |