Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Vision Language Models with NeMo AutoModel#

Introduction#

Vision Language Models (VLM) are a type of model that combines vision and language processing. They are trained on a large dataset of images and text pairs, and can be used to generate text descriptions of images or to answer questions about images.

NeMo AutoModel LLM APIs can be easily extended to support VLM tasks. While most of the training setup is the same, some additional steps are required to prepare the data and model for VLM training.

In this guide, we will walk through the data preparation steps for two datasets and also provide a table of scripts and configurations that have been tested with NeMo AutoModel. The code for both the datasets is available in Nemo Repository.

rdr items dataset#

The rdr items dataset is a small dataset containing 48 images with descriptions. To make sure the data is in correct format, we apply a collate function with user instructions to describe the image.

def collate_fn(examples, processor):
def fmt(sample):
    instruction = "Describe accurately the given image."
    conversation = [
        {
            "role": "user",
            "content": [{"type": "text", "text": instruction}, {"type": "image", "image": sample["image"]}],
        },
        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
    ]
    return {"conversation": conversation, "images": [sample['image'].convert("RGB")]}

text = []
images = []
for example in map(fmt, examples):

    text.append(
        processor.apply_chat_template(example["conversation"],tokenize=False,add_generation_prompt=False,)
    )
    images += example['images']

# Tokenize the text and process the images
batch = processor(text=text,images=images,padding=True,return_tensors="pt",)

batch["pixel_values"] = batch["pixel_values"].to(torch.bfloat16)

labels = batch["input_ids"].clone()
labels[torch.isin(labels, skipped_tokens)] = -100
batch["labels"] = labels
return batch

This code block ensures that the images are processed correctly and the text is tokenized along with the chat template.

cord-v2 dataset#

The cord-v2 dataset is a dataset containing receipts with descriptions in JSON format. To ensure the data is in the correct format, we apply the following collate function in which we convert JSON to text along with special tokens. While we do not add these special tokens to the tokenizer, it is possible to do so. In addition to converting JSON to text, we also apply an appropriate chat template to the text.

def json2token(obj, sort_json_key: bool = True):
"""
Convert an ordered JSON object into a token sequence
"""
if type(obj) == dict:
    if len(obj) == 1 and "text_sequence" in obj:
        return obj["text_sequence"]
    else:
        output = ""
        if sort_json_key:
            keys = sorted(obj.keys(), reverse=True)
        else:
            keys = obj.keys()
        for k in keys:
            output += (
                    fr"<s_{k}>"
                    + json2token(obj[k], sort_json_key)
                    + fr"</s_{k}>"
            )
        return output
elif type(obj) == list:
    return r"<sep/>".join(
        [json2token(item, sort_json_key) for item in obj]
    )
else:
    obj = str(obj)
    return obj

def train_collate_fn(examples, processor):
processed_examples = []
for example in examples:
    ground_truth = json.loads(example["ground_truth"])
    if "gt_parses" in ground_truth:  # when multiple ground truths are available, e.g., docvqa
        assert isinstance(ground_truth["gt_parses"], list)
        gt_jsons = ground_truth["gt_parses"]
    else:
        assert "gt_parse" in ground_truth and isinstance(ground_truth["gt_parse"], dict)
        gt_jsons = [ground_truth["gt_parse"]]


    text = random.choice([json2token(gt_json,sort_json_key=True) for gt_json in gt_jsons])
    processed_examples.append((example["image"], text))

examples = processed_examples
images = []
texts = []

for example in examples:
    image, ground_truth = example
    images.append(image)

    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": "Extract JSON"},
            ],
        },
        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": ground_truth},
            ],
        }
    ]
    text_prompt = processor.apply_chat_template(conversation)
    texts.append(text_prompt)

batch = processor(text=texts, images=images, padding=True, truncation=True,
                  return_tensors="pt")

labels = batch["input_ids"].clone()
labels[torch.isin(labels, skipped_tokens)] = -100
batch["labels"] = labels
return batch

Models#

While most VLM models from Hugging Face are compatible with NeMo AutoModel, we have specifically tested the following models for convergence with the datasets mentioned above. You can find the script for running these models in the NeMo Repository.

Supported Models#
Model	Dataset	FSDP2	4 bit model	PEFT
Gemma 3-4B & 27B	naver-clova-ix & rdr-items	Supported	Supported	Supported
Qwen2-VL-2B-Instruct & Qwen2.5-VL-3B-Instruct	cord-v2	Supported	Supported	Supported
llava-v1.6	cord-v2 & naver-clova-ix	Supported	Supported	Supported

Addtional Notes#

For running the vision language models with NeMo AutoModel, please use NeMo container version 25.02.rc5 or higher. Additionally, you might have to install the latest version of transformers library using the following command:

pip install git+https://github.com/huggingface/transformers.git