Multi-Modal Context in Nemo Data Designer#
Data Designer supports multi-modal context, allowing you to incorporate images into your LLM-based column generation. This feature enables you to create a more sophisticated synthetic data generation pipeline by having vision enabled LLMs analyze and respond to visual content alongside text prompts.
Overview#
Multi-modal context injection allows you to reference image data from columns in your dataset when generating content with LLM-based columns. This is particularly useful for workflows that involve combining text and visual information:
Generating descriptions and captions of images
Generating question-answer pairs from images such as charts and tables for enterprise document intelligence
Creating content based on visual analysis
Image Context Configuration#
To use multi-modal context, you need to configure ImageContext
objects with the following parameters:
Parameter |
Type |
Required |
Default |
Description |
---|---|---|---|---|
|
|
Yes |
|
The name of the column containing image data in your dataset |
|
|
Yes |
|
How the image is stored. Options: |
|
|
No |
|
The format of the image. Options: |
|
|
No |
|
The type of modality. Currently only |
Image Data Types#
When your images are stored as URLs in your dataset:
from nemo_microservices.beta.data_designer.config import params as P
image_context = P.ImageContext(
column_name="image_urls",
data_type=P.ModalityDataType.URL
)
When your images are stored as base64-encoded strings in your dataset:
from nemo_microservices.beta.data_designer.config import params as P
image_context = P.ImageContext(
column_name="image_data",
data_type=P.ModalityDataType.BASE64,
image_format=P.ImageFormat.PNG
)
Basic Example: Image Description Generation#
Here’s an example configuration that generates descriptions of images:
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P
# create the config_builder object
...
# Add a column with image URLs. Replace these below with images of your choice available over the internet.
config_builder.add_column(
C.SamplerColumn(
name="image_urls",
type=P.SamplerType.CATEGORY,
params=P.CategorySamplerParams(
values=[
"https://example.com/image1.jpg",
"https://example.com/image2.jpg",
"https://example.com/image3.jpg"
]
)
)
)
# Add LLM column that generates descriptions using image context
config_builder.add_column(
C.LLMTextColumn(
name="image_description",
prompt="Describe this image in detail. Focus on the visual elements, colors, composition, and any objects or scenes you can identify.",
model_alias="vision_model",
multi_modal_context=[
P.ImageContext(
column_name="image_urls",
data_type=P.ModalityDataType.URL
)
]
)
)
# Generate the data
preview = data_designer_client.preview(config_builder)
preview.display_sample_record()
Working with Base64 Images from Seed Datasets#
A more practical approach is to load images from a local directory, encode them as base64, and use them as a seed dataset. This allows you to work with your own image collections.
Loading Images from Directory#
Here’s how to create a seed dataset with base64-encoded images:
import base64
import io
import pandas as pd
from pathlib import Path
from PIL import Image
def create_image_dataset(image_directory: str, output_parquet="image_dataset.parquet") -> None:
"""Create a Parquet dataset from images in a directory, converting all to PNG format."""
image_dir = Path(image_directory)
image_files = list(image_dir.glob("*.jpg")) + list(image_dir.glob("*.png")) + list(image_dir.glob("*.jpeg"))
data = []
for img_path in image_files:
try:
# Open image with PIL and convert to PNG
with Image.open(img_path) as img:
# Convert to RGB if necessary (PNG doesn't support all modes)
if img.mode in ('RGBA', 'LA', 'P'):
img = img.convert('RGB')
buffer = io.BytesIO()
img.save(buffer, format='PNG')
image_bytes = buffer.getvalue()
base64_data = base64.b64encode(image_bytes).decode('utf-8')
data.append({
"image_filename": img_path.name,
"image_path": str(img_path),
"image_base64": base64_data,
"image_format": "png"
})
except Exception as e:
print(f"Error processing {img_path}: {e}")
df = pd.DataFrame(data)
df.to_parquet(output_parquet, index=False)
# Create the dataset
image_dataset = create_image_dataset("./images")
print(f"Created dataset with {len(image_dataset)} images")
print(image_dataset.head())
Using the Seed Dataset with Multi-Modal Context#
Now you can use the image dataset created above as a seed for Data Designer:
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P
# create the config builder object
...
# Load the seed dataset with base64 images
config_builder.with_seed_dataset(
repo_id="sample/image-dataset",
dataset_path="image_dataset.parquet",
sampling_strategy="shuffle",
with_replacement=True,
datastore={"endpoint": "http://localhost:3000/v1/hf"}
)
# Add LLM column that generates descriptions using the base64 images
config_builder.add_column(
C.LLMTextColumn(
name="image_description",
prompt="Analyze this image and provide a detailed description. Focus on the visual elements, colors, composition, and any objects or scenes you can identify.",
model_alias="vision_model",
multi_modal_context=[
P.ImageContext(
column_name="image_base64",
data_type=P.ModalityDataType.BASE64,
image_format=P.ImageFormat.PNG
)
]
)
)
# Generate the data
preview = data_designer_client.preview(config_builder)
preview.display_sample_record()
Best Practices#
Model Selection#
Ensure you’re using a vision-capable model that can process images. Common vision-capable models include:
mistralai/mistral-medium-3-instruct
meta/llama-3.2-90b-vision-instruct
meta/llama-4-maverick-17b-128e-instruct
Image Format Considerations#
For URL-based images, ensure the URLs are accessible from where data designer and the models are running.
The base64 data must be properly encoded and match the specified format.
Performance Considerations#
Vision models typically have higher latency than text-only models.
Consider the size, complexity, clarity of images in your dataset.
Multiple images in a single context will increase processing time.