For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Two Ways In - same outcome
  • What You Get on the Hub
  • What Gets Uploaded
  • Processors Get First-Class Treatment
  • The Auto-Generated Dataset Card
  • Auth
  • Reproducible Pipelines — The Round-Trip
  • Gotchas
Dev Notes

Push Datasets to Hugging Face Hub

||View as Markdown|
Previous

Training a VLM to Understand Long Documents: An Iterative SDG Story

Next

Engineering an Enterprise-Grade Text-to-SQL Dataset with NeMo Data Designer

Nabin MulepatiResearcher at NVIDIA
Daniel van StrienMachine Learning Librarian at Hugging Face

You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file? Nah. Call .push_to_hub() and you’ve got a live dataset page on Hugging Face. Done and dusted 🚢.

Push to Hub Hero


Here’s the full flow — build a multilingual greeting dataset with a conversation training processor, generate it, and push it to the Hub in one go:

1import data_designer.config as dd
2from data_designer.interface import DataDesigner
3
4data_designer = DataDesigner()
5config_builder = dd.DataDesignerConfigBuilder()
6
7config_builder.add_column(
8 dd.SamplerColumnConfig(
9 name="language",
10 sampler_type=dd.SamplerType.CATEGORY,
11 params=dd.CategorySamplerParams(
12 values=["English", "Spanish", "French", "German", "Italian"],
13 ),
14 drop=True,
15 )
16)
17
18config_builder.add_column(
19 dd.LLMTextColumnConfig(
20 name="greeting",
21 model_alias="nvidia-text",
22 prompt="Write a casual greeting in {{ language }}.",
23 )
24)
25config_builder.add_column(
26 dd.LLMTextColumnConfig(
27 name="response",
28 model_alias="nvidia-text",
29 prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.",
30 )
31)
32
33# Reshape into an OpenAI-style conversation training format
34config_builder.add_processor(
35 dd.SchemaTransformProcessorConfig(
36 name="conversations",
37 template={
38 "messages": [
39 {"role": "user", "content": "{{ greeting }}"},
40 {"role": "assistant", "content": "{{ response }}"},
41 ]
42 },
43 )
44)
45
46results = data_designer.create(config_builder, num_records=10_000)
47
48# Ship it:
49url = results.push_to_hub(
50 "my-org/multilingual-greetings",
51 "10k synthetic agent/user conversations across 5 languages.",
52 tags=["greetings", "multilingual", "conversation"],
53)
54print(url) # https://huggingface.co/datasets/my-org/multilingual-greetings

Two Ways In - same outcome

From results (the happy path) — you just ran .create(), you have the results object, call .push_to_hub() on it.

From a folder (the “I closed my notebook” path) — you saved artifacts to disk earlier and want to push them later:

1from data_designer.integrations.huggingface import HuggingFaceHubClient
2
3url = HuggingFaceHubClient.push_to_hub_from_folder(
4 dataset_path="./my-saved-dataset",
5 repo_id="my-org/multilingual-greetings",
6 description="10k synthetic agent/user conversations across 5 languages.",
7)

What You Get on the Hub

Once pushed, your dataset is live in the Hugging Face ecosystem:

  • Dataset Viewer — browsable in the browser immediately. Each processor config shows up as a separate subset tab (more on this in Processors Get First-Class Treatment).

  • Streaming — parquet means consumers can stream without downloading:

    1from datasets import load_dataset
    2
    3ds = load_dataset("my-org/multilingual-greetings", "conversations", split="train", streaming=True)
  • Dataset Viewer API — row pagination, text search, column statistics, and parquet shard URLs with no extra setup.


What Gets Uploaded

Push to Hub Pipeline

Everything. The upload pipeline runs in this order:

1. README.md ← auto-generated dataset card
2. data/*.parquet ← your main dataset (remapped from parquet-files/)
3. images/* ← if you have image columns (skipped otherwise)
4. {processor}/* ← processor outputs (remapped from processors-files/)
5. builder_config.json
6. metadata.json ← paths rewritten to match HF repo layout

Each step is its own commit on the HF repo, so you get a clean history.

This is especially nice for large datasets. Data Designer writes output in batched parquet partitions — generate 100k records and you’ll have dozens of parquet files across parquet-files/, processors-files/, and maybe images/. Manually uploading all of that, organizing it into the right HF repo structure, writing the dataset card YAML configs, and rewriting metadata paths would be tedious and error-prone. push_to_hub handles the whole thing in one call — folder uploads, path remapping, config registration, dataset card generation, all of it.

Re-pushing to the same repo_id updates the existing repo — no need to delete and recreate.


Processors Get First-Class Treatment

Schema Transform for Conversation Training

Notice the SchemaTransformProcessorConfig in the example above. That’s doing the heavy lifting — it takes the raw greeting and response columns and reshapes each row into an OpenAI-style messages array:

1config_builder.add_processor(
2 dd.SchemaTransformProcessorConfig(
3 name="conversations",
4 template={
5 "messages": [
6 {"role": "user", "content": "{{ greeting }}"},
7 {"role": "assistant", "content": "{{ response }}"},
8 ]
9 },
10 )
11)

The template is Jinja2 all the way down. Keys become columns in the output, values get rendered per-row with the actual column data. The template dict must be JSON-serializable — strings, lists, nested objects, all fair game. So you can build arbitrarily complex conversation schemas (multi-turn, system prompts, tool calls) just by adding more entries to the messages list.

The processor runs after each batch and writes its output to a separate parquet file alongside the main dataset. The main dataset (data/) still has the raw columns — the processor output is an additional view, not a replacement.

When you push to hub, each processor gets its own top-level directory and its own HF dataset config. So the conversations processor from our example ends up like this on HF:

my-org/multilingual-greetings/
├── README.md
├── data/
│ ├── batch_00000.parquet ← raw columns (greeting, response)
│ └── batch_00001.parquet
├── conversations/
│ ├── batch_00000.parquet ← transformed (messages array)
│ └── batch_00001.parquet
├── builder_config.json
└── metadata.json

The dataset card YAML frontmatter registers each processor as its own named config:

1configs:
2- config_name: data
3 data_files: "data/*.parquet"
4 default: true
5- config_name: conversations
6 data_files: "conversations/*.parquet"

So consumers grab exactly the format they need:

1from datasets import load_dataset
2
3# Raw columns — good for analysis
4df = load_dataset("my-org/multilingual-greetings", "data", split="train")
5
6# Conversation format — ready for fine-tuning
7df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train")
8print(df_conv[0])
9# {'messages': [{'role': 'user', 'content': 'Hey! Como estás?'},
10# {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]}

The Quick Start section in the generated README includes these snippets automatically — one load_dataset call per processor.

Metadata paths are rewritten too. Local paths like processors-files/conversations/batch_00000.parquet become conversations/batch_00000.parquet so file references in the metadata match the actual HF repo structure.

If there are no processors, all of this is silently skipped — no empty directories, no phantom configs.


The Auto-Generated Dataset Card

This is the fun part. The upload generates a full HuggingFace dataset card from your run metadata. It pulls from metadata.json and builder_config.json to build:

  • A Quick Start section with load_dataset code (including processor subsets)
  • A Dataset Summary with record count, column count, completion %
  • A Schema & Statistics table — per-column type, uniqueness, null rate, token stats
  • Generation Details — how many columns of each config type
  • A Citation block so people can cite your dataset

Tags default to ["synthetic", "datadesigner"] plus whatever you pass in. Size category (n<1K, 1K<n<10K, etc.) is auto-computed. These tags make your dataset discoverable in Hub search — you can browse all Data Designer datasets in one place.

The template lives at packages/data-designer/src/data_designer/integrations/huggingface/dataset_card_template.md if you want to see the Jinja2 source.


Auth

Token resolution follows the standard huggingface_hub chain:

  1. Explicit token= parameter
  2. HF_TOKEN env var
  3. Cached creds from hf auth login

If none of those work, you get a clear error telling you what to do.


Reproducible Pipelines — The Round-Trip

Round-Trip Reproducibility

Here’s the payoff: every dataset you push includes builder_config.json — the full SDG pipeline definition. Anyone (including future-you) can recreate the exact same pipeline from the HuggingFace URL:

1import data_designer.config as dd
2
3config_builder = dd.DataDesignerConfigBuilder.from_config(
4 "https://huggingface.co/datasets/my-org/multilingual-greetings/blob/main/builder_config.json"
5)

That’s it. One line. from_config accepts a raw URL, a local file path, a dict, or a YAML string. When you hand it a HuggingFace Hub URL, it auto-rewrites the blob URL to a raw URL behind the scenes so the fetch just works (same trick for GitHub blob URLs).

The loaded config builder comes back fully hydrated — columns, model configs, constraints, seed config, all of it. You can inspect it, tweak it, and re-run:

1from data_designer.interface import DataDesigner
2
3# Maybe bump the count or swap a model
4results = DataDesigner().create(config_builder, num_records=50_000)
5
6# And push the new version right back
7results.push_to_hub(
8 "my-org/multilingual-greetings-v2",
9 "50k version with the same pipeline.",
10)

So the full loop is: design → generate → push → share URL → recreate → iterate. The builder_config.json on HuggingFace is the reproducibility artifact.


Gotchas

  • repo_id must be username/dataset-name — exactly one slash. The client validates this before hitting the network.
  • description is required — it’s the prose that appears right under the title on the dataset card. Make it good.
  • private=True if you don’t want the world to see your dataset yet. You can flip it to public later from the dataset settings page.
  • Metadata paths get rewritten — local paths like parquet-files/batch_00000.parquet become data/batch_00000.parquet in the uploaded metadata.json so references stay valid on HF.