For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
      • Overview
      • seed_readers
      • processors
      • mcp
      • column_generators
      • Seed Reader API
      • Processor API
      • MCP Runtime API
      • Column Generator API
  • Dev Notes
    • Overview
    • Prompt Sensitivity
    • Retriever SDG Toolkit
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Module Contents
  • Classes
  • Functions
  • Data
  • API
Code ReferenceEngine Extension API

data_designer.engine.resources.seed_reader

||View as Markdown|
Previous

Column Generators

Next

Processor API

Module Contents

Classes

NameDescription
SeedReaderFileSystemContextFilesystem and root path available to filesystem seed-reader plugins.
SeedReaderBatchBatch object returned by seed readers and convertible to a DataFrame.
SeedReaderBatchReaderReader that yields seed batches until exhausted.
PandasSeedReaderBatchSeed-reader batch backed by an in-memory pandas DataFrame.
DuckDBSeedReaderBatchReaderNone
HydratingSeedReaderBatchReaderNone
SeedReaderBase class for reading a seed dataset.
LocalFileSeedReaderBase class for reading a seed dataset.
HuggingFaceSeedReaderBase class for reading a seed dataset.
DataFrameSeedReaderBase class for reading a seed dataset.
FileSystemSeedReaderBase class for filesystem-derived seed readers.
DirectorySeedReaderBase class for filesystem-derived seed readers.
FileContentsSeedReaderBase class for filesystem-derived seed readers.
AgentRolloutSeedReaderBase class for filesystem-derived seed readers.
SeedReaderRegistryNone

Functions

NameDescription
create_seed_reader_output_dataframeCreate a DataFrame and verify hydrated records match the declared output schema.
_build_metadata_recordNone
_normalize_relative_pathNone
_normalize_hydrated_row_outputNone

Data

logger SourceT FileSystemSourceT

API

1logger = getLogger(...)
1class data_designer.engine.resources.seed_reader.SeedReaderError

Bases: data_designer.errors.DataDesignerError

1class data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext

Filesystem and root path available to filesystem seed-reader plugins.

1fs: fsspec.spec.AbstractFileSystem
1root_path: pathlib.Path
1class data_designer.engine.resources.seed_reader.SeedReaderBatch

Bases: typing.Protocol

Batch object returned by seed readers and convertible to a DataFrame.

1to_pandas() -> pandas.DataFrame
1class data_designer.engine.resources.seed_reader.SeedReaderBatchReader

Bases: typing.Protocol

Reader that yields seed batches until exhausted.

1read_next_batch() -> data_designer.engine.resources.seed_reader.SeedReaderBatch
1class data_designer.engine.resources.seed_reader.PandasSeedReaderBatch

Seed-reader batch backed by an in-memory pandas DataFrame.

1dataframe: pandas.DataFrame
1to_pandas() -> pandas.DataFrame

Return the batch as a pandas DataFrame.

1data_designer.engine.resources.seed_reader.create_seed_reader_output_dataframe(
2 *,
3 records: list[dict[str, typing.Any]],
4 output_columns: list[str]
5) -> pandas.DataFrame

Create a DataFrame and verify hydrated records match the declared output schema.

1class data_designer.engine.resources.seed_reader.DuckDBSeedReaderBatchReader(
2 *,
3 conn: duckdb.DuckDBPyConnection,
4 query_result: typing.Any,
5 batch_size: int
6)
1read_next_batch() -> data_designer.engine.resources.seed_reader.SeedReaderBatch
1class data_designer.engine.resources.seed_reader.HydratingSeedReaderBatchReader(
2 *,
3 manifest_batch_reader: data_designer.engine.resources.seed_reader.SeedReaderBatchReader,
4 hydrate_records: collections.abc.Callable[[list[dict[str, typing.Any]]], list[dict[str, typing.Any]]],
5 output_columns: list[str],
6 no_rows_error_message: str
7)
1read_next_batch() -> data_designer.engine.resources.seed_reader.SeedReaderBatch
1SourceT = TypeVar(...)
1FileSystemSourceT = TypeVar(...)
1class data_designer.engine.resources.seed_reader.SeedReader

Bases: abc.ABC, typing.Generic[data_designer.engine.resources.seed_reader.SourceT]

Base class for reading a seed dataset.

Seeds are read using duckdb. Reader implementations define duckdb connection setup details and how to get a URI that can be queried with duckdb (i.e. ”… FROM <uri> …”).

The Data Designer engine automatically supplies the appropriate SeedSource and a SecretResolver to use for any secret fields in the config via attach(...). Subclasses that need per-attachment setup can override on_attach(...) without needing to call super().

1source: data_designer.engine.resources.seed_reader.SourceT
1secret_resolver: data_designer.engine.secret_resolver.SecretResolver
1get_dataset_uri() -> str
1create_duckdb_connection() -> duckdb.DuckDBPyConnection
1attach(
2 source: data_designer.engine.resources.seed_reader.SourceT,
3 secret_resolver: data_designer.engine.secret_resolver.SecretResolver
4) -> None

Attach a source and secret resolver to the instance.

This is called internally by the engine so that these objects do not need to be provided in the reader’s constructor.

1on_attach() -> None

Hook for subclasses that need per-attachment setup.

1_reset_attachment_state() -> None
1create_dataframe_duckdb_connection(
2 *,
3 table_name: str,
4 dataframe: pandas.DataFrame
5) -> duckdb.DuckDBPyConnection
1get_seed_dataset_size() -> int
1create_batch_reader(
2 *,
3 batch_size: int,
4 index_range: data_designer.config.seed.IndexRange | None,
5 shuffle: bool
6) -> data_designer.engine.resources.seed_reader.SeedReaderBatchReader
1create_filesystem_context(root_path: pathlib.Path | str) -> data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext

Create a rooted filesystem context for directory-backed seed readers.

1get_matching_relative_paths(
2 *,
3 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext,
4 file_pattern: str,
5 recursive: bool
6) -> list[str]
1get_column_names() -> list[str]

Returns the seed dataset’s column names

1_get_duckdb_connection() -> duckdb.DuckDBPyConnection
1_ensure_attached() -> None
1build_dataset_read_query(
2 *,
3 dataset_uri: str,
4 index_range: data_designer.config.seed.IndexRange | None,
5 shuffle: bool
6) -> str
1get_seed_type() -> str

Return the seed_type of the source class this reader is generic over.

1class data_designer.engine.resources.seed_reader.LocalFileSeedReader

Bases: data_designer.engine.resources.seed_reader.SeedReader[data_designer.config.seed_source.LocalFileSeedSource]

1create_duckdb_connection() -> duckdb.DuckDBPyConnection
1get_dataset_uri() -> str
1class data_designer.engine.resources.seed_reader.HuggingFaceSeedReader

Bases: data_designer.engine.resources.seed_reader.SeedReader[data_designer.config.seed_source.HuggingFaceSeedSource]

1create_duckdb_connection() -> duckdb.DuckDBPyConnection
1get_dataset_uri() -> str
1class data_designer.engine.resources.seed_reader.DataFrameSeedReader

Bases: data_designer.engine.resources.seed_reader.SeedReader[data_designer.config.seed_source_dataframe.DataFrameSeedSource]

1_table_name = df
1create_duckdb_connection() -> duckdb.DuckDBPyConnection
1get_dataset_uri() -> str
1class data_designer.engine.resources.seed_reader.FileSystemSeedReader

Bases: data_designer.engine.resources.seed_reader.SeedReader[data_designer.engine.resources.seed_reader.FileSystemSourceT], abc.ABC

Base class for filesystem-derived seed readers.

Plugin authors implement build_manifest(...) to describe the cheap logical rows available under the configured filesystem root. Readers that need expensive enrichment can optionally override hydrate_row(...) to emit one record dict or an iterable of record dicts per manifest row. When emitted records change the manifest schema, output_columns must declare the exact hydrated output schema for each emitted record. The framework owns attachment-scoped filesystem context reuse, manifest sampling, partitioning, randomization, batching, and DuckDB registration details.

1output_columns: typing.ClassVar[list[str] | None]
1_reset_attachment_state() -> None
1create_duckdb_connection() -> duckdb.DuckDBPyConnection
1get_dataset_uri() -> str
1get_output_column_names() -> list[str]
1build_manifest(
2 *,
3 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
4) -> pandas.DataFrame | list[dict[str, typing.Any]]
1hydrate_row(
2 *,
3 manifest_row: dict[str, typing.Any],
4 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
5) -> dict[str, typing.Any] | collections.abc.Iterable[dict[str, typing.Any]]
1get_column_names() -> list[str]
1get_seed_dataset_size() -> int
1create_batch_reader(
2 *,
3 batch_size: int,
4 index_range: data_designer.config.seed.IndexRange | None,
5 shuffle: bool
6) -> data_designer.engine.resources.seed_reader.SeedReaderBatchReader
1_get_row_manifest_dataframe() -> pandas.DataFrame
1_get_output_dataframe() -> pandas.DataFrame
1_get_filesystem_context() -> data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
1_get_manifest_dataset_uri() -> str
1_build_internal_table_name(suffix: str) -> str
1_get_empty_selected_manifest_rows_error_message() -> str
1_normalize_rows_to_dataframe(rows: pandas.DataFrame | list[dict[str, typing.Any]]) -> pandas.DataFrame
1_hydrate_rows(
2 *,
3 manifest_rows: list[dict[str, typing.Any]],
4 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
5) -> list[dict[str, typing.Any]]
1class data_designer.engine.resources.seed_reader.DirectorySeedReader

Bases: data_designer.engine.resources.seed_reader.FileSystemSeedReader[data_designer.config.seed_source.DirectorySeedSource]

1build_manifest(
2 *,
3 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
4) -> pandas.DataFrame | list[dict[str, typing.Any]]
1class data_designer.engine.resources.seed_reader.FileContentsSeedReader

Bases: data_designer.engine.resources.seed_reader.FileSystemSeedReader[data_designer.config.seed_source.FileContentsSeedSource]

1output_columns = ['source_kind', 'source_path', 'relative_path', 'file_name', 'content']
1build_manifest(
2 *,
3 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
4) -> pandas.DataFrame | list[dict[str, typing.Any]]
1hydrate_row(
2 *,
3 manifest_row: dict[str, typing.Any],
4 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
5) -> dict[str, typing.Any]
1class data_designer.engine.resources.seed_reader.AgentRolloutSeedReader

Bases: data_designer.engine.resources.seed_reader.FileSystemSeedReader[data_designer.config.seed_source.AgentRolloutSeedSource]

1output_columns = get_field_names(...)
1_PARSE_CONTEXT_UNSET: data_designer.engine.resources.agent_rollout.AgentRolloutParseContext | None = object(...)
1_reset_attachment_state() -> None
1build_manifest(
2 *,
3 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
4) -> list[dict[str, typing.Any]]
1hydrate_row(
2 *,
3 manifest_row: dict[str, typing.Any],
4 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext
5) -> list[dict[str, typing.Any]]
1get_format_handler() -> data_designer.engine.resources.agent_rollout.AgentRolloutFormatHandler
1_get_parse_context(context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext) -> data_designer.engine.resources.agent_rollout.AgentRolloutParseContext | None
1class data_designer.engine.resources.seed_reader.SeedReaderRegistry(readers: collections.abc.Sequence[data_designer.engine.resources.seed_reader.SeedReader])class data_designer.engine.resources.seed_reader.SeedReaderRegistry(readers: collections.abc.Sequence[data_designer.engine.resources.seed_reader.SeedReader])
1add_reader(reader: data_designer.engine.resources.seed_reader.SeedReader) -> typing_extensions.Self
1get_reader(
2 seed_dataset_source: data_designer.config.seed_source.SeedSource,
3 secret_resolver: data_designer.engine.secret_resolver.SecretResolver
4) -> data_designer.engine.resources.seed_reader.SeedReader
1_get_reader_for_source(seed_dataset_source: data_designer.config.seed_source.SeedSource) -> data_designer.engine.resources.seed_reader.SeedReader
1data_designer.engine.resources.seed_reader._build_metadata_record(
2 *,
3 context: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext,
4 relative_path: str,
5 source_kind: str
6) -> dict[str, str]
1data_designer.engine.resources.seed_reader._normalize_relative_path(path: str) -> str
1data_designer.engine.resources.seed_reader._normalize_hydrated_row_output(
2 *,
3 hydrated_row_output: dict[str, typing.Any] | collections.abc.Iterable[dict[str, typing.Any]],
4 manifest_row_index: int
5) -> list[dict[str, typing.Any]]