For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
    • Data Designer Got Skills
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • What the framework owns
  • Start with an existing filesystem config
  • Step 1: Build a cheap manifest
  • Step 2: Hydrate one file into one or many rows
  • Step 3: Pass the reader to Data Designer
  • Manifest-Based Selection Semantics
  • Package it later when needed
  • Advanced Hooks
Plugins

FileSystemSeedReader Plugins

||View as Markdown|
Previous

Example Plugin: Column Generator

Next

Discover Plugins

Experimental in this version Plugins were experimental in v0.5.8 and v0.5.9. For stable plugin docs, see the v0.6.0 plugin docs.

FileSystemSeedReader is the simplest way to build a seed reader plugin when your source data lives in a directory of files. You describe the files cheaply in build_manifest(...), then optionally read and reshape them in hydrate_row(...).

This guide focuses on the filesystem-specific contract. The fastest way to learn it is usually to start with an inline reader over DirectorySeedSource, then package that reader later only if you need automatic plugin discovery or a brand-new seed_type. For a runnable single-file example, see the Markdown Section Seed Reader recipe.

What the framework owns

When you inherit from FileSystemSeedReader, Data Designer already handles:

  • attachment-scoped filesystem context reuse
  • file matching with file_pattern and recursive
  • manifest sampling, IndexRange, PartitionBlock, and shuffle
  • batching and DuckDB registration
  • hydrated output schema validation via output_columns

Most readers only need to implement build_manifest(...) and hydrate_row(...).

Start with an existing filesystem config

If your source data already fits DirectorySeedSource or FileContentsSeedSource, you do not need a new config model just to learn or prototype a reader. Reuse the built-in source type and override how one DataDesigner instance interprets that seed source.

The Markdown recipe uses DirectorySeedSource(path=..., file_pattern="*.md") and pairs it with an inline reader:

1import data_designer.config as dd
2from pathlib import Path
3from typing import Any
4
5from data_designer.engine.resources.seed_reader import FileSystemSeedReader, SeedReaderFileSystemContext
6
7
8class MarkdownSectionDirectorySeedReader(FileSystemSeedReader[dd.DirectorySeedSource]):
9 output_columns = [
10 "relative_path",
11 "file_name",
12 "section_index",
13 "section_header",
14 "section_content",
15 ]
16
17 def build_manifest(self, *, context: SeedReaderFileSystemContext) -> list[dict[str, str]]:
18 matched_paths = self.get_matching_relative_paths(
19 context=context,
20 file_pattern=self.source.file_pattern,
21 recursive=self.source.recursive,
22 )
23 return [
24 {
25 "relative_path": relative_path,
26 "file_name": Path(relative_path).name,
27 }
28 for relative_path in matched_paths
29 ]
30
31 def hydrate_row(
32 self,
33 *,
34 manifest_row: dict[str, Any],
35 context: SeedReaderFileSystemContext,
36 ) -> list[dict[str, Any]]:
37 ...

This approach lets you inspect the manifest and hydration contract without first creating a package, entry points, or a new seed_type.

Step 1: Build a cheap manifest

build_manifest(...) should be inexpensive. Usually that means enumerating matching files and returning one logical row per file, without reading file contents yet.

In this example, the manifest only tracks:

  • relative_path
  • file_name

That keeps selection and partitioning file-based.

Step 2: Hydrate one file into one or many rows

hydrate_row(...) can return either:

  • a single record dict for 1:1 hydration
  • an iterable of record dicts for 1:N hydration

If hydration changes the schema, set output_columns to the exact emitted schema:

1output_columns = [
2 "relative_path",
3 "file_name",
4 "section_index",
5 "section_header",
6 "section_content",
7]

In the recipe implementation, hydrate_row(...) reads one file and emits one record per ATX heading section.

Every emitted record must match output_columns exactly. Data Designer will raise a plugin-facing error if a hydrated record is missing a declared column or includes an undeclared one.

Step 3: Pass the reader to Data Designer

Register the inline reader on the DataDesigner instance you want to use:

1import data_designer.config as dd
2from data_designer.interface import DataDesigner
3
4data_designer = DataDesigner(seed_readers=[MarkdownSectionDirectorySeedReader()])
5
6builder = dd.DataDesignerConfigBuilder()
7builder.with_seed_dataset(
8 dd.DirectorySeedSource(path="sample_data", file_pattern="*.md"),
9)

That pattern overrides how this DataDesigner instance handles the built-in directory seed source. Because seed_readers sets the registry for that instance, include any other readers you still want available. This is a good fit for local experiments, tests, and docs recipes.

Manifest-Based Selection Semantics

Selection stays manifest-based even when hydrate_row(...) fans out.

If the matched files are:

0 -> faq.md
1 -> guide.md

and guide.md hydrates into two section rows, then:

1import data_designer.config as dd
2from data_designer.config.seed import IndexRange
3
4builder.with_seed_dataset(
5 dd.DirectorySeedSource(path="sample_data", file_pattern="*.md"),
6 selection_strategy=IndexRange(start=1, end=1),
7)

selects only guide.md, then returns all section rows emitted from guide.md.

That means get_seed_dataset_size(), IndexRange, PartitionBlock, and shuffle all operate on manifest rows before hydration.

Package it later when needed

If you want the same reader to be installable and auto-discovered as a plugin, then move from the inline pattern to a package:

  • define a config class that inherits from FileSystemSeedSource
  • give it a unique seed_type
  • create a Plugin object with plugin_type=PluginType.SEED_READER
  • register that plugin via a data_designer.plugins entry point

That extra packaging step is only necessary when you need a reusable plugin boundary. The reader logic itself still lives in the same build_manifest(...) and hydrate_row(...) methods shown above.

Advanced Hooks

If you need more control, FileSystemSeedReader also lets you override:

  • on_attach(...) for per-attachment setup
  • create_filesystem_context(...) for custom rooted filesystem behavior

Most filesystem plugins do not need either hook.