> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/datadesigner/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/datadesigner/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/datadesigner/_mcp/server.

# Columns

Columns are the fundamental building blocks in Data Designer. Each column represents a field in your dataset and defines how to generate it—whether that's sampling from a distribution, calling an LLM, or applying a transformation.

The Declarative Approach
Columns are **declarative specifications**. You describe *what* you want, and the framework handles *how* to generate it—managing execution order, batching, parallelization, and resources automatically.

## Column Types

Data Designer provides eleven built-in column types, each optimized for different generation scenarios.

### 🎲 Sampler Columns

Sampler columns generate data using numerical sampling—fast, deterministic, and ideal for numerical and categorical dataset fields. They're significantly faster than LLMs and can produce data following specific distributions (Poisson for event counts, Gaussian for measurements, etc.).

Available sampler types:

* **UUID**: Unique identifiers
* **Category**: Categorical values with optional probability weights
* **Subcategory**: Hierarchical categorical data (states within countries, models within brands)
* **Uniform**: Evenly distributed numbers (integers or floats)
* **Gaussian**: Normally distributed values with configurable mean and standard deviation
* **Bernoulli**: Binary outcomes with specified success probability
* **Bernoulli Mixture**: Binary outcomes from multiple probability components
* **Binomial**: Count of successes in repeated trials
* **Poisson**: Count data and event frequencies
* **Scipy**: Access to the full scipy.stats distribution library
* **Person**: Realistic synthetic individuals with names, demographics, and attributes
* **Datetime**: Timestamps within specified ranges
* **Timedelta**: Time duration values

Conditional Sampling
Samplers support **conditional parameters** that change behavior based on other columns. Want age distributions that vary by country? Income ranges that depend on occupation? Just define conditions on existing column values.

### 📝 LLM-Text Columns

LLM-Text columns generate natural language text: product descriptions, customer reviews, narrative summaries, email threads, or anything requiring semantic understanding and creativity.

Use **Jinja2 templating** in prompts to reference other columns. Data Designer automatically manages dependencies and injects the referenced column values into the prompt.

Generation Traces
LLM columns can optionally capture message traces in a separate `{column_name}__trace` column. Set `with_trace` on the column config to control what's captured: `TraceType.NONE` (default, no trace), `TraceType.LAST_MESSAGE` (final assistant message only), or `TraceType.ALL_MESSAGES` (full conversation history). The trace includes the ordered message history for the final generation attempt (system/user/assistant/tool calls/tool results), and may include model reasoning fields when the provider exposes them.

Extracting Reasoning Content
Some models expose chain-of-thought reasoning separately from the main response via a `reasoning_content` field. To capture only this reasoning (without the full trace), set `extract_reasoning_content=True`:

```python
dd.LLMTextColumnConfig(
    name="answer",
    model_alias="reasoning-model",
    prompt="Solve this problem: {{ problem }}",
    extract_reasoning_content=True,  # Creates answer__reasoning_content column
)
```

This creates a `{column_name}__reasoning_content` column containing the stripped reasoning content from the final assistant response, or `None` if the model didn't provide reasoning. This is independent of `with_trace`—you can use either or both.

Tool Use in LLM Columns
LLM columns can invoke external tools during generation via MCP (Model Context Protocol). Enable tools by setting `tool_alias` to reference a configured `ToolConfig`:

```python
dd.LLMTextColumnConfig(
    name="answer",
    model_alias="nvidia-text",
    prompt="Search for information and answer: {{ question }}",
    tool_alias="search-tools",  # References a ToolConfig
    with_trace=dd.TraceType.ALL_MESSAGES,  # Capture tool call history
)
```

When `tool_alias` is set, the model can request tool calls during generation. Data Designer executes the tools via configured MCP providers and feeds results back until the model produces a final answer. See [Tool Use & MCP](/concepts/tool-use-and-mcp/overview) for full configuration details.

Performance
LLM columns are parallelized within each batch using `max_parallel_requests` from your model's inference parameters. See the [Architecture & Performance](/concepts/architecture-and-performance) guide for optimization strategies.

### 💻 LLM-Code Columns

LLM-Code columns generate code in specific programming languages. They handle the prompting and parsing necessary to extract clean code from the LLM's response—automatically detecting and extracting code from markdown blocks. You provide the prompt and choose the model; the column handles the extraction.

Supported languages: **Bash, C, C++, C#, COBOL, Go, Java, JavaScript, Kotlin, Python, Ruby, Rust, Scala, Swift, TypeScript**, plus **SQL** dialects (SQLite, PostgreSQL, MySQL, T-SQL, BigQuery, ANSI SQL).

### 🗂️ LLM-Structured Columns

LLM-Structured columns generate JSON with a *guaranteed schema*. Define your structure using a Pydantic model or JSON schema, and Data Designer ensures the LLM output conforms—no parsing errors, no schema drift.

Use for complex nested structures: API responses, configuration files, database records with multiple related fields, or any structured data where type safety matters. Schemas can be arbitrarily complex with nested objects, arrays, enums, and validation constraints, but success depends on the model's capabilities.

Schema Complexity and Model Choice
Flat schemas with simple fields are easier and more robustly produced across models. Deeply nested schemas with complex validation constraints are more sensitive to model choice—stronger models handle complexity better. If you're experiencing schema conformance issues, try simplifying the schema or switching to a more capable model.

### ⚖️ LLM-Judge Columns

LLM-Judge columns score generated content across multiple quality dimensions using LLMs as evaluators.

Define scoring rubrics (relevance, accuracy, fluency, helpfulness) and the judge model evaluates each record. Score rubrics specify criteria and scoring options (1-5 scales, categorical grades, etc.), producing quantified quality metrics for every data point.

Use judge columns for data quality filtering (e.g., keep only 4+ rated responses), A/B testing generation strategies, and quality monitoring over time.

### 🖼️ Image Columns

Image columns generate images from text prompts using either **diffusion** models (DALL·E, Stable Diffusion, Imagen) or **autoregressive** models (Gemini image, GPT image).

Use **Jinja2 templating** in the prompt to reference other columns, driving diversity across generated images. For example, reference sampled attributes like style, subject, and composition to produce varied images without manually writing different prompts.

Image columns require a model configured with `ImageInferenceParams`. Model-specific options (size, quality, aspect ratio) are passed via `extra_body` in the inference parameters.

**Output modes:**

* **Preview** (`data_designer.preview()`): Images are stored as base64-encoded strings directly in the DataFrame for quick iteration
* **Create** (`data_designer.create()`): Images are saved to disk in an `images/<column_name>/` folder with UUID filenames; the DataFrame stores relative paths

Image columns also support `multi_modal_context` for autoregressive models that accept image inputs, enabling image-to-image generation workflows.

Tutorials
The image tutorials cover three workflows: [Providing Images as Context](/tutorials/providing-images-as-context) (image → text), [Generating Images](/tutorials/generating-images) (text → image), and [Editing Images with Image Context](/tutorials/image-to-image-editing) (image → image).

### 🧬 Embedding Columns

Embedding columns generate vector embeddings (numerical representations) for text content using embedding models. These embeddings capture semantic meaning, enabling similarity search, clustering, and semantic analysis.

Specify a `target_column` containing text, and Data Designer generates embeddings for that content. The target column can contain either a single text string or a list of text strings in stringified JSON format. In the latter case, embeddings are generated for each text string in the list.

Common use cases:

* **Semantic search**: Generate embeddings for documents, then find similar content by vector similarity
* **Clustering**: Group similar texts based on embedding proximity
* **Recommendation systems**: Match content by semantic similarity
* **Anomaly detection**: Identify outliers in embedding space

Embedding Models
Embedding columns require an embedding model configured with `EmbeddingInferenceParams`. These models differ from chat completion models—they output vectors rather than text. The generation type is automatically determined by the inference parameters type.

### 🧩 Expression Columns

Expression columns handle simple transformations using **Jinja2 templates**—concatenate first and last names, calculate numerical totals, format date strings. No LLM overhead needed.

Template capabilities:

* **Variable substitution**: Pull values from any existing column
* **String filters**: Uppercase, lowercase, strip whitespace, replace patterns
* **Conditional logic**: if/elif/else support
* **Arithmetic**: Add, subtract, multiply, divide

### 🔍 Validation Columns

Validation columns check generated content against rules and return structured pass/fail results.

Built-in validation types:

**Code validation** runs Python or SQL code through a linter to validate the code.

**Local callable validation** accepts a Python function directly when using Data Designer as a library.

**Remote validation** sends data to HTTP endpoints for validation-as-a-service. Useful for linters, security scanners, or proprietary systems.

### 🌱 Seed Dataset Columns

Seed dataset columns bootstrap generation from existing data. Provide a real dataset, and those columns become available as context for generating new synthetic data.

Typical pattern: use seed data for one part of your schema (real product names and categories), then generate synthetic fields around it (customer reviews, purchase histories, ratings). The seed data provides realism and constraints; generated columns add volume and variation.

### 🔧 Custom Columns

Custom columns let you implement your own generation logic using Python functions. Use the `@custom_column_generator` decorator to declare dependencies, and the framework handles DAG ordering and parallelization.

Two generation strategies:

* **`cell_by_cell`** (default): Function receives one row, framework parallelizes
* **`full_column`**: Function receives entire DataFrame for vectorized operations

For LLM access, declare `model_aliases` in the decorator and receive a `models` dict as the third argument. See [Custom Columns](/concepts/custom-columns) for details.

## Shared Column Properties

Every column configuration inherits from `SingleColumnConfig` with these standard properties:

### `name`

The column's identifier—unique within your configuration, used in Jinja2 references, and becomes the column name in the output DataFrame. Choose descriptive names: `user_review` > `col_17`.

### `drop`

Boolean flag (default: `False`) controlling whether the column appears in final output. Setting `drop=True` generates the column (available as a dependency) but excludes it from final output.

**When to drop columns:**

* Intermediate calculations that feed expressions but aren't meaningful standalone
* Context columns used only for LLM prompt templates
* Validation results during development unwanted in production

Dropped columns participate fully in generation and the dependency graph—just filtered out at the end.

### `column_type`

Literal string identifying the column type: `"sampler"`, `"llm-text"`, `"expression"`, etc. Set automatically by each configuration class and serves as Pydantic's discriminator for deserialization.

You rarely set this manually—instantiating `LLMTextColumnConfig` automatically sets `column_type="llm-text"`. Serialization is reversible: save to YAML, load later, and Pydantic reconstructs the exact objects.

### `required_columns`

Computed property listing columns that must be generated before this one. The framework derives this automatically:

* For LLM/Expression columns: extracted from Jinja2 template `{{ variables }}`
* For Validation columns: explicitly listed target columns
* For Sampler columns with conditional parameters: columns referenced in conditions

You read this property for introspection but never set it—always computed from configuration details.

### `side_effect_columns`

Computed property listing columns created implicitly alongside the primary column. Currently, only LLM columns produce side effects:

* `{name}__trace`: Created when `with_trace` is not `TraceType.NONE` on the column.
* `{name}__reasoning_content`: Created when `extract_reasoning_content=True` on the column.

For detailed information on each column type, refer to the [column configuration code reference](/code-reference/config/column-configs).