Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
LLaVA-Next#
LLaVA-Next is an extension of the LLaVA model designed to handle high-resolution images efficiently through tiling. This enables users to work with larger image sizes for improved accuracy in various VL tasks. For more details about LLaVA-Next, refer to the LLaVA-Next blog.
We have extended the NeVA model to support LLaVA-Next. Users can easily switch to LLaVA-Next for high-resolution image tiling support with minimal configuration changes. To switch to LLaVA-Next from NeVA (LLaVA), replace the task encoder with LlavaNextTaskEncoder. It is designed to handle image tiling, supporting the LLaVA-Next architecture.
To get started with LLaVA-Next, follow these steps, which are similar to NeVA but with minor modifications.
Import from Hugging Face to NeMo 2.0#
The following script downloads the checkpoint for LLM (vicuna - 7b) and converts it to NeMo format.
The converted checkpoint is stored in the NeMo cache folder at: ~/.cache/nemo
. For example, when used with the NeMo container, the full path is /root/.cache/nemo/models/lmsys/vicuna-7b-v1.5/
. The checkpoint can be used to initialize the LLM for pretraining LlaVa-Next.
.. code-block:: python
from nemo.collections.llm import import_ckpt from nemo.collections.llm import Llama2Config7B, LlamaModel
- if __name__ == “__main__”:
# Specify the Hugging Face model ID hf_model_id = “lmsys/vicuna-7b-v1.5”
# Import the model and convert to NeMo 2.0 format import_ckpt(
model=LlamaModel(Llama2Config7B()), source=f”hf://{hf_model_id}”,
)
This step is optional and is intended for users who want to fine-tune the LLaVA-Next model starting from a pretrained checkpoint from Hugging Face.
from nemo.collections.llm import import_ckpt
from nemo.collections import vlm
if __name__ == '__main__':
# Specify the Hugging Face model ID
hf_model_id = "llava-hf/llava-v1.6-vicuna-7b-hf"
# Import the model and convert to NeMo 2.0 format
import_ckpt(
model=vlm.LlavaNextModel(vlm.Llava16Config7B()), # Model configuration
source=f"hf://{hf_model_id}", # Hugging Face model source
)
NeMo 2.0 Pretraining Recipes#
Similar to the NeVA model, we provide some default recipes for pretraining LLaVA-NEXT llava_next_7b.
from nemo.collections import vlm
finetune = vlm.llava_next_7b.pretrain_recipe(
name="llava_next_7b_pretrain",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
language_model_from_pretrained='/root/.cache/nemo/models/lmsys/vicuna-7b-v1.5/',
# Can be None or change based on local checkpoint path
)
NeMo 2.0 Fine-Tuning Recipes#
We also provide a fine-tuning recipe - llava_next_7b that you can use.
from nemo.collections import vlm
finetune = vlm.llava_next_7b.finetune_recipe(
name="llava_next_7b_finetune",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
peft_scheme='none', # 'lora', 'none'
)
Note
The configuration in the recipes is done using the NeMo-Run run.Config
and run.Partial
configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.
Note
The recipes use the MockDataModule
for the data
argument. You are expected to replace the MockDataModule
with your custom dataset.
Once you have your final configuration ready, you can execute it using any of the NeMo-Run supported executors. The simplest option is the local executor, which runs the pretraining locally in a separate process. You can use it as follows:
import nemo_run as run
run.run(finetune, executor=run.LocalExecutor())
Additionally, you can also run it directly in the same Python process as follows:
run.run(finetune, direct=True)
Use the Energon Dataloader#
Below is an example of how to set up the Energon data module for LLaVA-Next training:
from nemo.collections.multimodal.data.energon.config import MultiModalSampleConfig
from nemo.collections.vlm import LlavaNextTaskEncoder
from nemo.collections.multimodal.data.energon import EnergonMultiModalDataModule
from transformers import AutoProcessor
# Load the processor from the pretrained LLaVA-Next model
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
# Paths and configuration
data_path = "<path_to_dataset>"
image_processor = processor.image_processor
tokenizer = processor.tokenizer
# Define multimodal sample configuration
multimodal_sample_config = MultiModalSampleConfig()
# Initialize the LLaVA-Next task encoder
task_encoder = LlavaNextTaskEncoder(
tokenizer=tokenizer,
image_processor=image_processor,
multimodal_sample_config=multimodal_sample_config,
)
# Create the data module
data = EnergonMultiModalDataModule(
path=data_path,
tokenizer=tokenizer,
image_processor=image_processor,
num_workers=8,
micro_batch_size=4,
global_batch_size=32,
multimodal_sample_config=multimodal_sample_config,
task_encoder=task_encoder,
)
Replace the MockDataModule
in the default recipes with the above data.
from nemo.collections import vlm
# Define the fine-tuning recipe
finetune = vlm.llava_next_7b.finetune_recipe(
name="llava_next_7b_finetune",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
peft_scheme='none', # 'lora', 'none'
)
# Assign the above data module to the finetuning recipe
finetune.data = data
We have also included additional example scripts to further customize LLaVA-NeXT training:
Pretraining: llava_next_pretrain.py
Finetuning: llava_next_finetune.py
Generation: llava_next_generation.py
NeMo Run: llava_next_nemo_run.py
These scripts allow for flexible and comprehensive training workflows tailored to your requirements.