Providing Images as Context
🎨 Data Designer Tutorial: Providing Images as Context for Vision-Based Data Generation
📚 What you'll learn
This notebook demonstrates how to provide images as context to generate text descriptions using vision-language models.
The same multi_modal_context field can also carry audio or video context when the selected model supports those modalities.
- ✨ Visual Document Processing: Converting images to chat-ready format for model consumption
- 🔍 Vision-Language Generation: Using vision models to generate detailed summaries from images
- 🧩 Media Context Pattern: Understanding how
ImageContext,AudioContext, andVideoContextfit into the same configuration field
If this is your first time using Data Designer, we recommend starting with the first notebook in this tutorial series.
📦 Import Data Designer
-
data_designer.configprovides access to the configuration API. -
DataDesigneris the main interface for data generation.
⚙️ Initialize the Data Designer interface
-
DataDesigneris the main object responsible for managing the data generation process. -
When initialized without arguments, the default model providers are used.
🏗️ Initialize the Data Designer Config Builder
-
The Data Designer config defines the dataset schema and generation process.
-
The config builder provides an intuitive interface for building this configuration.
-
When initialized without arguments, the default model configurations are used.
🌱 Seed Dataset Creation
In this section, we'll prepare our visual documents as a seed dataset for summarization:
- Loading Visual Documents: We use a small pets image dataset containing labeled images
- Image Processing: Convert images to base64 format for vision model consumption
- Metadata Extraction: Preserve relevant image information (label, etc.)
The seed dataset will be used to generate detailed text descriptions of each image.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[13:24:13] [WARNING] Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
📥 Loading and processing images...
README.md: 0.00B [00:00, ?B/s]
dataset_infos.json: 0.00B [00:00, ?B/s]
data/train.zip: 0%| | 0.00/20.4M [00:00<?, ?B/s]
data/test.zip: 0%| | 0.00/3.29M [00:00<?, ?B/s]
Generating train split: 0%| | 0/900 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/150 [00:00<?, ? examples/s]
Map: 0%| | 0/900 [00:00<?, ? examples/s]
✅ Loaded 512 images with columns: ['image', 'label', 'base64_image', 'uuid']
| image | label | base64_image | uuid | |
|---|---|---|---|---|
| 0 | <PIL.JpegImagePlugin.JpegImageFile image mode=... | 0 | iVBORw0KGgoAAAANSUhEUgAAAeQAAAIACAIAAADc8YinAA... | 87f84627-9911-4344-9e18-07d39c8f36d1 |
| 1 | <PIL.JpegImagePlugin.JpegImageFile image mode=... | 0 | iVBORw0KGgoAAAANSUhEUgAAAiQAAAIACAIAAAA9rOAHAA... | c8ae8bad-9f5b-40fc-b292-3662b5a9d742 |
| 2 | <PIL.JpegImagePlugin.JpegImageFile image mode=... | 0 | iVBORw0KGgoAAAANSUhEUgAAAqoAAAIACAIAAADFYNm1AA... | 85eb94d7-a00d-436a-a75c-f3a867b6c64c |
| 3 | <PIL.JpegImagePlugin.JpegImageFile image mode=... | 0 | iVBORw0KGgoAAAANSUhEUgAAAwAAAAIACAIAAAC6lJxtAA... | a4095d29-0b51-4c1c-a3be-003f66f3dc1b |
| 4 | <PIL.PngImagePlugin.PngImageFile image mode=RG... | 0 | iVBORw0KGgoAAAANSUhEUgAAAqoAAAIACAIAAADFYNm1AA... | 7db77bef-1bca-4d57-babd-6426ff5632af |
DataDesignerConfigBuilder( seed_dataset: df seed )
🧩 Media context and model capabilities
multi_modal_context accepts media context descriptors such as ImageContext, AudioContext, and VideoContext. Data Designer reads the referenced seed columns and serializes them for the model request, but the selected model still determines which modalities are valid.
This notebook uses image context only because image-capable VLMs are broadly available. Before combining image, audio, and video in one column, choose a model alias backed by an omni or otherwise modality-compatible model, and check that the provider accepts every context type you send.
For base64 seed columns, store the raw base64 payload without a data:<media-type>;base64, prefix and specify the media format on the context object:
media_context = [
dd.ImageContext(
column_name="image_base64",
data_type=dd.ModalityDataType.BASE64,
image_format=dd.ImageFormat.PNG,
),
dd.AudioContext(
column_name="audio_base64",
data_type=dd.ModalityDataType.BASE64,
audio_format=dd.AudioFormat.MP3,
),
dd.VideoContext(
column_name="video_base64",
data_type=dd.ModalityDataType.BASE64,
video_format=dd.VideoFormat.MP4,
),
]
URL-backed media can use data_type=dd.ModalityDataType.URL, subject to the provider's URL support and file-size limits. Local audio/video paths require explicit URL mode and require the model endpoint to have filesystem access to the same paths, typically a colocated vLLM server configured for local media access.
[13:25:53] [INFO] ✅ Validation passed
🔁 Iteration is key – preview the dataset!
-
Use the
previewmethod to generate a sample of records quickly. -
Inspect the results for quality and format issues.
-
Adjust column configurations, prompts, or parameters as needed.
-
Re-run the preview until satisfied.
[13:25:53] [INFO] 👁️ Preview generation in progress
[13:25:53] [INFO] |-- 🔒 Jinja rendering engine: secure
[13:25:53] [INFO] ✅ Validation passed
[13:25:53] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[13:25:53] [INFO] 🩺 Running health checks for models...
[13:25:53] [INFO] |-- 👀 Checking 'nvidia/nemotron-3-nano-omni-30b-a3b-reasoning' in provider named 'nvidia' for model alias 'nvidia-vision'...
[13:25:55] [INFO] |-- ✅ Passed!
[13:25:55] [INFO] ⚡ DATA_DESIGNER_ASYNC_ENGINE is enabled - using async task-queue preview
[13:25:55] [INFO] 📝 llm-text model config for column 'description'
[13:25:55] [INFO] |-- model: 'nvidia/nemotron-3-nano-omni-30b-a3b-reasoning'
[13:25:55] [INFO] |-- model alias: 'nvidia-vision'
[13:25:55] [INFO] |-- model provider: 'nvidia'
[13:25:55] [INFO] |-- inference parameters:
[13:25:55] [INFO] | |-- generation_type=chat-completion
[13:25:55] [INFO] | |-- max_parallel_requests=4
[13:25:55] [INFO] | |-- temperature=0.60
[13:25:55] [INFO] | |-- top_p=0.95
[13:25:55] [INFO] ⚡️ Async generation: 1 column(s) (description), 2 tasks across 1 row group(s)
[13:25:55] [INFO] 🚀 (1/1) Dispatching with 2 records
[13:25:55] [INFO] 🌱 (1/1) Sampling 2 records from seed dataset
[13:25:55] [INFO] |-- seed dataset size: 512 records
[13:25:55] [INFO] |-- sampling strategy: ordered
[13:26:00] [INFO] 📊 Progress [4.8s]:
[13:26:00] [INFO] |-- 🤩 description: 2/2 (100%) 0.4 rec/s
[13:26:00] [INFO] ✅ Async generation complete [4.8s]: 2 ok, 0 failed across 1 column(s)
[13:26:00] [INFO] 📊 Model usage summary:
[13:26:00] [INFO] |-- model: nvidia/nemotron-3-nano-omni-30b-a3b-reasoning
[13:26:00] [INFO] |-- tokens: input=658, output=1985, reasoning=1198 (estimated), total=2643, tps=545
[13:26:00] [INFO] |-- reasoning token count estimated with tiktoken
[13:26:00] [INFO] |-- requests: success=2, failed=0, total=2, rpm=24
[13:26:00] [INFO] 📐 Measuring dataset column statistics:
[13:26:00] [INFO] |-- 📝 column: 'description'
[13:26:00] [INFO] 🥳 Preview complete!
Seed Columns ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Name ┃ Value ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ uuid │ 87f84627-9911-4344-9e18-07d39c8f36d1 │ ├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ │ label │ 0 │ ├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ │ base64_image │ iVBORw0KGgoAAAANSUhEUgAAAeQAAAIACAIAAADc8YinAAEAAElEQVR4nOy9V5ckuZEmamZwEREpSna1YAv28JLDHT… │ └──────────────┴─────────────────────────────────────────────────────────────────────────────────────────────┘ Generated Columns ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Name ┃ Value ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ description │ # Close-up Portrait of a Black and White Cat │ │ │ │ │ │ ## Main Subject │ │ │ The image features a close-up shot of a domestic cat, likely a tuxedo or bicolor breed. The │ │ │ cat is positioned centrally, filling most of the frame from the chest up. It is looking │ │ │ slightly upward and forward with an attentive, wide-eyed expression. │ │ │ │ │ │ * **Eyes:** The cat has large, round eyes that are a striking yellow-green color with │ │ │ vertical black pupils. The gaze is intense and focused. │ │ │ * **Fur Pattern:** The fur is distinctly two-toned. The ears, the sides of the head, and │ │ │ patches around the eyes are black. A broad stripe of white fur runs down the center of the │ │ │ forehead, between the eyes, and covers the nose, mouth area, and chest. │ │ │ * **Face:** The nose is small, black, and triangular. The mouth is closed in a neutral, │ │ │ slightly downturned line, giving the cat a somewhat serious or curious look. Long, thin │ │ │ white whiskers extend outward from the muzzle on both sides. │ │ │ * **Ears:** The ears are pointed and upright, indicating alertness. The inside of the ears │ │ │ shows some lighter fur mixed with black. │ │ │ │ │ │ ## Background │ │ │ The background is simple and out of focus, which helps emphasize the cat as the main │ │ │ subject. │ │ │ * **Left/Top:** A plain, light-colored wall (appearing off-white or very light grey). │ │ │ * **Right:** A vertical section of a light brown, possibly wooden surface, likely a door │ │ │ frame or furniture edge. │ │ │ │ │ │ ## Colors and Lighting │ │ │ * **Color Palette:** The dominant colors are black, white, and the yellow-green of the │ │ │ eyes. The background introduces neutral tones of white/grey and tan/brown. │ │ │ * **Lighting:** The lighting appears to be soft and diffuse, coming from the front. It │ │ │ illuminates the cat's face evenly without creating harsh shadows, highlighting the texture │ │ │ of the fur and the shine in the eyes. │ └─────────────┴──────────────────────────────────────────────────────────────────────────────────────────────┘
| uuid | label | base64_image | description | |
|---|---|---|---|---|
| 0 | 87f84627-9911-4344-9e18-07d39c8f36d1 | 0 | iVBORw0KGgoAAAANSUhEUgAAAeQAAAIACAIAAADc8YinAA... | # Close-up Portrait of a Black and White Cat\n... |
| 1 | c8ae8bad-9f5b-40fc-b292-3662b5a9d742 | 0 | iVBORw0KGgoAAAANSUhEUgAAAiQAAAIACAIAAAA9rOAHAA... | # Detailed Description of the Image\n\n**Main ... |
📊 Analyze the generated data
-
Data Designer automatically generates a basic statistical analysis of the generated data.
-
This analysis is available via the
analysisproperty of generation result objects.
──────────────────────────────────────── 🎨 Data Designer Dataset Profile ───────────────────────────────────────── Dataset Overview ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ number of records ┃ number of columns ┃ percent complete records ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 2 │ 1 │ 100.0% │ └─────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────────────────┘ 📝 LLM-Text Columns ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ prompt tokens ┃ completion tokens ┃ ┃ column name ┃ data type ┃ number unique values ┃ per record ┃ per record ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ description │ string │ 2 (100.0%) │ 29.0 +/- 0.0 │ 383.0 +/- 25.5 │ └──────────────────┴───────────────┴──────────────────────────────┴─────────────────────┴─────────────────────────┘ ╭────────────────────────────────────────────────── Table Notes ──────────────────────────────────────────────────╮ │ │ │ 1. All token statistics are based on a sample of max(1000, len(dataset)) records. │ │ 2. Tokens are calculated using tiktoken's cl100k_base tokenizer. │ │ │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
🔎 Visual Inspection
Let's compare the original image with the generated description to validate quality:
📄 Original Image:
📝 Generated Description:
╭─ Image Description ─────────────────────────────────────────────────────────────────────────────────────────────╮ │ # Close-up Portrait of a Black and White Cat │ │ │ │ ## Main Subject │ │ The image features a close-up shot of a domestic cat, likely a tuxedo or bicolor breed. The cat is positioned │ │ centrally, filling most of the frame from the chest up. It is looking slightly upward and forward with an │ │ attentive, wide-eyed expression. │ │ │ │ * **Eyes:** The cat has large, round eyes that are a striking yellow-green color with vertical black pupils. │ │ The gaze is intense and focused. │ │ * **Fur Pattern:** The fur is distinctly two-toned. The ears, the sides of the head, and patches around the │ │ eyes are black. A broad stripe of white fur runs down the center of the forehead, between the eyes, and covers │ │ the nose, mouth area, and chest. │ │ * **Face:** The nose is small, black, and triangular. The mouth is closed in a neutral, slightly downturned │ │ line, giving the cat a somewhat serious or curious look. Long, thin white whiskers extend outward from the │ │ muzzle on both sides. │ │ * **Ears:** The ears are pointed and upright, indicating alertness. The inside of the ears shows some lighter │ │ fur mixed with black. │ │ │ │ ## Background │ │ The background is simple and out of focus, which helps emphasize the cat as the main subject. │ │ * **Left/Top:** A plain, light-colored wall (appearing off-white or very light grey). │ │ * **Right:** A vertical section of a light brown, possibly wooden surface, likely a door frame or furniture │ │ edge. │ │ │ │ ## Colors and Lighting │ │ * **Color Palette:** The dominant colors are black, white, and the yellow-green of the eyes. The background │ │ introduces neutral tones of white/grey and tan/brown. │ │ * **Lighting:** The lighting appears to be soft and diffuse, coming from the front. It illuminates the cat's │ │ face evenly without creating harsh shadows, highlighting the texture of the fur and the shine in the eyes. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
🆙 Scale up!
-
Happy with your preview data?
-
Use the
createmethod to submit larger Data Designer generation jobs.
[13:26:00] [INFO] 🎨 Creating Data Designer dataset
[13:26:00] [INFO] |-- 🔒 Jinja rendering engine: secure
[13:26:00] [INFO] ✅ Validation passed
[13:26:00] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[13:26:00] [INFO] 🩺 Running health checks for models...
[13:26:00] [INFO] |-- 👀 Checking 'nvidia/nemotron-3-nano-omni-30b-a3b-reasoning' in provider named 'nvidia' for model alias 'nvidia-vision'...
[13:26:00] [INFO] |-- ✅ Passed!
[13:26:00] [INFO] ⚡ DATA_DESIGNER_ASYNC_ENGINE is enabled - using async task-queue builder
[13:26:00] [INFO] 📝 llm-text model config for column 'description'
[13:26:00] [INFO] |-- model: 'nvidia/nemotron-3-nano-omni-30b-a3b-reasoning'
[13:26:00] [INFO] |-- model alias: 'nvidia-vision'
[13:26:00] [INFO] |-- model provider: 'nvidia'
[13:26:00] [INFO] |-- inference parameters:
[13:26:00] [INFO] | |-- generation_type=chat-completion
[13:26:00] [INFO] | |-- max_parallel_requests=4
[13:26:00] [INFO] | |-- temperature=0.60
[13:26:00] [INFO] | |-- top_p=0.95
[13:26:00] [INFO] ⚡️ Async generation: 1 column(s) (description), 10 tasks across 1 row group(s)
[13:26:00] [INFO] 🚀 (1/1) Dispatching with 10 records
[13:26:00] [INFO] 🌱 (1/1) Sampling 10 records from seed dataset
[13:26:00] [INFO] |-- seed dataset size: 512 records
[13:26:00] [INFO] |-- sampling strategy: ordered
[13:26:05] [INFO] 📊 Progress [5.1s]:
[13:26:05] [INFO] |-- 🌦️ description: 3/10 (30%) 0.6 rec/s
[13:26:11] [INFO] 📊 Progress [10.5s]:
[13:26:11] [INFO] |-- ⛅ description: 6/10 (60%) 0.6 rec/s
[13:26:16] [INFO] 📊 Progress [15.5s]:
[13:26:16] [INFO] |-- ☀️ description: 10/10 (100%) 0.6 rec/s
[13:26:16] [INFO] ✅ Async generation complete [15.6s]: 10 ok, 0 failed across 1 column(s)
[13:26:16] [INFO] 📊 Model usage summary:
[13:26:16] [INFO] |-- model: nvidia/nemotron-3-nano-omni-30b-a3b-reasoning
[13:26:16] [INFO] |-- tokens: input=3720, output=9214, reasoning=5820 (estimated), total=12934, tps=822
[13:26:16] [INFO] |-- reasoning token count estimated with tiktoken
[13:26:16] [INFO] |-- requests: success=10, failed=0, total=10, rpm=38
[13:26:16] [INFO] 📐 Measuring dataset column statistics:
[13:26:16] [INFO] |-- 📝 column: 'description'
| uuid | label | base64_image | description | |
|---|---|---|---|---|
| 0 | 87f84627-9911-4344-9e18-07d39c8f36d1 | 0 | iVBORw0KGgoAAAANSUhEUgAAAeQAAAIACAIAAADc8YinAA... | # Detailed Description ## Main Subject The im... |
| 1 | c8ae8bad-9f5b-40fc-b292-3662b5a9d742 | 0 | iVBORw0KGgoAAAANSUhEUgAAAiQAAAIACAIAAAA9rOAHAA... | # Detailed Description ## Main Subject The pr... |
| 2 | 85eb94d7-a00d-436a-a75c-f3a867b6c64c | 0 | iVBORw0KGgoAAAANSUhEUgAAAqoAAAIACAIAAADFYNm1AA... | # Cat on Wooden Floor ## Main Subject The pri... |
| 3 | a4095d29-0b51-4c1c-a3be-003f66f3dc1b | 0 | iVBORw0KGgoAAAANSUhEUgAAAwAAAAIACAIAAAC6lJxtAA... | # Cat in a Green Container ## Main Subject Th... |
| 4 | 7db77bef-1bca-4d57-babd-6426ff5632af | 0 | iVBORw0KGgoAAAANSUhEUgAAAqoAAAIACAIAAADFYNm1AA... | Based on the image provided, here is a detaile... |
──────────────────────────────────────── 🎨 Data Designer Dataset Profile ───────────────────────────────────────── Dataset Overview ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ number of records ┃ number of columns ┃ percent complete records ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 10 │ 1 │ 100.0% │ └─────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────────────────┘ 📝 LLM-Text Columns ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ prompt tokens ┃ completion tokens ┃ ┃ column name ┃ data type ┃ number unique values ┃ per record ┃ per record ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ description │ string │ 10 (100.0%) │ 29.0 +/- 0.0 │ 306.5 +/- 61.2 │ └──────────────────┴───────────────┴──────────────────────────────┴─────────────────────┴─────────────────────────┘ ╭────────────────────────────────────────────────── Table Notes ──────────────────────────────────────────────────╮ │ │ │ 1. All token statistics are based on a sample of max(1000, len(dataset)) records. │ │ 2. Tokens are calculated using tiktoken's cl100k_base tokenizer. │ │ │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
⏭️ Next Steps
Now that you've learned how to use visual context for image summarization in Data Designer, explore more:
-
Experiment with different vision models for specific image types
-
Try different prompt variations to generate specialized descriptions (e.g., technical details, key findings)
-
Combine image, audio, or video context with other column types after confirming your selected model supports those modalities
-
Apply this pattern to other vision tasks like image captioning, OCR validation, or visual question answering
-
Generating images with Data Designer