Multi-Modal Context in Nemo Data Designer#

Data Designer supports multi-modal context, allowing you to incorporate images into your LLM-based column generation. This feature enables you to create a more sophisticated synthetic data generation pipeline by having vision enabled LLMs analyze and respond to visual content alongside text prompts.

Overview#

Multi-modal context injection allows you to reference image data from columns in your dataset when generating content with LLM-based columns. This is particularly useful for workflows that involve combining text and visual information:

Generating descriptions and captions of images
Generating question-answer pairs from images such as charts and tables for enterprise document intelligence
Creating content based on visual analysis

Image Context Configuration#

To use multi-modal context, you need to configure ImageContext objects with the following parameters:

ImageContext Configuration Parameters#
Parameter	Type	Required	Default	Description
`column_name`	`str`	Yes		The name of the column containing image data in your dataset
`data_type`	`ModalityDataType`	Yes		How the image is stored. Options: `URL` or `BASE64`
`image_format`	`ImageFormat`	No	`None`	The format of the image. Options: `PNG`, `JPG`, `JPEG`, `gif`, `webp`. Optional parameter
`modality`	`Modality`	No	`Modality.IMAGE`	The type of modality. Currently only `"image"` is supported

Image Data Types#

URL-based Images

When your images are stored as URLs in your dataset:

from nemo_microservices.beta.data_designer.config import params as P

image_context = P.ImageContext(
    column_name="image_urls",
    data_type=P.ModalityDataType.URL
)

Base64-encoded Images

When your images are stored as base64-encoded strings in your dataset:

from nemo_microservices.beta.data_designer.config import params as P

image_context = P.ImageContext(
    column_name="image_data",
    data_type=P.ModalityDataType.BASE64,
    image_format=P.ImageFormat.PNG
)

Basic Example: Image Description Generation#

Here’s an example configuration that generates descriptions of images:

from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P


# create the config_builder object
...

# Add a column with image URLs. Replace these below with images of your choice available over the internet.
config_builder.add_column(
    C.SamplerColumn(
        name="image_urls",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "https://example.com/image1.jpg",
                "https://example.com/image2.jpg",
                "https://example.com/image3.jpg"
            ]
        )
    )
)

# Add LLM column that generates descriptions using image context
config_builder.add_column(
    C.LLMTextColumn(
        name="image_description",
        prompt="Describe this image in detail. Focus on the visual elements, colors, composition, and any objects or scenes you can identify.",
        model_alias="vision_model",
        multi_modal_context=[
            P.ImageContext(
                column_name="image_urls",
                data_type=P.ModalityDataType.URL
            )
        ]
    )
)

# Generate the data
preview = data_designer_client.preview(config_builder)
preview.display_sample_record()

Working with Base64 Images from Seed Datasets#

A more practical approach is to load images from a local directory, encode them as base64, and use them as a seed dataset. This allows you to work with your own image collections.

Loading Images from Directory#

Here’s how to create a seed dataset with base64-encoded images:

import base64
import io
import pandas as pd
from pathlib import Path
from PIL import Image


def create_image_dataset(image_directory: str, output_parquet="image_dataset.parquet") -> None:
    """Create a Parquet dataset from images in a directory, converting all to PNG format."""
    image_dir = Path(image_directory)
    image_files = list(image_dir.glob("*.jpg")) + list(image_dir.glob("*.png")) + list(image_dir.glob("*.jpeg"))
    
    data = []
    for img_path in image_files:
        try:
            # Open image with PIL and convert to PNG
            with Image.open(img_path) as img:
                # Convert to RGB if necessary (PNG doesn't support all modes)
                if img.mode in ('RGBA', 'LA', 'P'):
                    img = img.convert('RGB')
                buffer = io.BytesIO()
                img.save(buffer, format='PNG')
                image_bytes = buffer.getvalue()
                base64_data = base64.b64encode(image_bytes).decode('utf-8')
            
            data.append({
                "image_filename": img_path.name,
                "image_path": str(img_path),
                "image_base64": base64_data,
                "image_format": "png"
            })
        except Exception as e:
            print(f"Error processing {img_path}: {e}")
    
    df = pd.DataFrame(data)
    df.to_parquet(output_parquet, index=False)

# Create the dataset
image_dataset = create_image_dataset("./images")
print(f"Created dataset with {len(image_dataset)} images")
print(image_dataset.head())

Best Practices#

Model Selection#

Ensure you’re using a vision-capable model that can process images. Common vision-capable models include:

mistralai/mistral-medium-3-instruct
meta/llama-3.2-90b-vision-instruct
meta/llama-4-maverick-17b-128e-instruct

Image Format Considerations#

For URL-based images, ensure the URLs are accessible from where data designer and the models are running.
The base64 data must be properly encoded and match the specified format.

Performance Considerations#

Vision models typically have higher latency than text-only models.
Consider the size, complexity, clarity of images in your dataset.
Multiple images in a single context will increase processing time.