Push Datasets to Hugging Face Hub | NVIDIA NeMo Data Designer

Nabin MulepatiResearcher at NVIDIA

Daniel van StrienMachine Learning Librarian at Hugging Face

You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file? Nah. Call .push_to_hub() and you’ve got a live dataset page on Hugging Face. Done and dusted 🚢.

Push to Hub Hero

Here’s the full flow — build a multilingual greeting dataset with a conversation training processor, generate it, and push it to the Hub in one go:

1 import data_designer.config as dd
2 from data_designer.interface import DataDesigner
3 
4 data_designer = DataDesigner()
5 config_builder = dd.DataDesignerConfigBuilder()
6 
7 config_builder.add_column(
8     dd.SamplerColumnConfig(
9         name="language",
10         sampler_type=dd.SamplerType.CATEGORY,
11         params=dd.CategorySamplerParams(
12             values=["English", "Spanish", "French", "German", "Italian"],
13         ),
14         drop=True,
15     )
16 )
17 
18 config_builder.add_column(
19     dd.LLMTextColumnConfig(
20         name="greeting",
21         model_alias="nvidia-text",
22         prompt="Write a casual greeting in {{ language }}.",
23     )
24 )
25 config_builder.add_column(
26     dd.LLMTextColumnConfig(
27         name="response",
28         model_alias="nvidia-text",
29         prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.",
30     )
31 )
32 
33 # Reshape into an OpenAI-style conversation training format
34 config_builder.add_processor(
35     dd.SchemaTransformProcessorConfig(
36         name="conversations",
37         template={
38             "messages": [
39                 {"role": "user", "content": "{{ greeting }}"},
40                 {"role": "assistant", "content": "{{ response }}"},
41             ]
42         },
43     )
44 )
45 
46 results = data_designer.create(config_builder, num_records=10_000)
47 
48 # Ship it:
49 url = results.push_to_hub(
50     "my-org/multilingual-greetings",
51     "10k synthetic agent/user conversations across 5 languages.",
52     tags=["greetings", "multilingual", "conversation"],
53 )
54 print(url)  # https://huggingface.co/datasets/my-org/multilingual-greetings

Two Ways In - same outcome

From results (the happy path) — you just ran .create(), you have the results object, call .push_to_hub() on it.

From a folder (the “I closed my notebook” path) — you saved artifacts to disk earlier and want to push them later:

1 from data_designer.integrations.huggingface import HuggingFaceHubClient
2 
3 url = HuggingFaceHubClient.push_to_hub_from_folder(
4     dataset_path="./my-saved-dataset",
5     repo_id="my-org/multilingual-greetings",
6     description="10k synthetic agent/user conversations across 5 languages.",
7 )

What You Get on the Hub

Once pushed, your dataset is live in the Hugging Face ecosystem:

Dataset Viewer — browsable in the browser immediately. Each processor config shows up as a separate subset tab (more on this in Processors Get First-Class Treatment).

Streaming — parquet means consumers can stream without downloading:

1 from datasets import load_dataset
2 
3 ds = load_dataset("my-org/multilingual-greetings", "conversations", split="train", streaming=True)

Dataset Viewer API — row pagination, text search, column statistics, and parquet shard URLs with no extra setup.

What Gets Uploaded

Push to Hub Pipeline

Everything. The upload pipeline runs in this order:

1. README.md          ← auto-generated dataset card
2. data/*.parquet     ← your main dataset (remapped from parquet-files/)
3. images/*           ← if you have image columns (skipped otherwise)
4. {processor}/*      ← processor outputs (remapped from processors-files/)
5. builder_config.json
6. metadata.json      ← paths rewritten to match HF repo layout

Each step is its own commit on the HF repo, so you get a clean history.

This is especially nice for large datasets. Data Designer writes output in batched parquet partitions — generate 100k records and you’ll have dozens of parquet files across parquet-files/, processors-files/, and maybe images/. Manually uploading all of that, organizing it into the right HF repo structure, writing the dataset card YAML configs, and rewriting metadata paths would be tedious and error-prone. push_to_hub handles the whole thing in one call — folder uploads, path remapping, config registration, dataset card generation, all of it.

Re-pushing to the same repo_id updates the existing repo — no need to delete and recreate.

Processors Get First-Class Treatment

Schema Transform for Conversation Training

Notice the SchemaTransformProcessorConfig in the example above. That’s doing the heavy lifting — it takes the raw greeting and response columns and reshapes each row into an OpenAI-style messages array:

1 config_builder.add_processor(
2     dd.SchemaTransformProcessorConfig(
3         name="conversations",
4         template={
5             "messages": [
6                 {"role": "user", "content": "{{ greeting }}"},
7                 {"role": "assistant", "content": "{{ response }}"},
8             ]
9         },
10     )
11 )

The template is Jinja2 all the way down. Keys become columns in the output, values get rendered per-row with the actual column data. The template dict must be JSON-serializable — strings, lists, nested objects, all fair game. So you can build arbitrarily complex conversation schemas (multi-turn, system prompts, tool calls) just by adding more entries to the messages list.

The processor runs after each batch and writes its output to a separate parquet file alongside the main dataset. The main dataset (data/) still has the raw columns — the processor output is an additional view, not a replacement.

When you push to hub, each processor gets its own top-level directory and its own HF dataset config. So the conversations processor from our example ends up like this on HF:

my-org/multilingual-greetings/
├── README.md
├── data/
│   ├── batch_00000.parquet      ← raw columns (greeting, response)
│   └── batch_00001.parquet
├── conversations/
│   ├── batch_00000.parquet      ← transformed (messages array)
│   └── batch_00001.parquet
├── builder_config.json
└── metadata.json

The dataset card YAML frontmatter registers each processor as its own named config:

1 configs:
2 - config_name: data
3   data_files: "data/*.parquet"
4   default: true
5 - config_name: conversations
6   data_files: "conversations/*.parquet"

So consumers grab exactly the format they need:

1 from datasets import load_dataset
2 
3 # Raw columns — good for analysis
4 df = load_dataset("my-org/multilingual-greetings", "data", split="train")
5 
6 # Conversation format — ready for fine-tuning
7 df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train")
8 print(df_conv[0])
9 # {'messages': [{'role': 'user', 'content': 'Hey! Como estás?'},
10 #               {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]}

The Quick Start section in the generated README includes these snippets automatically — one load_dataset call per processor.

Metadata paths are rewritten too. Local paths like processors-files/conversations/batch_00000.parquet become conversations/batch_00000.parquet so file references in the metadata match the actual HF repo structure.

If there are no processors, all of this is silently skipped — no empty directories, no phantom configs.

The Auto-Generated Dataset Card

This is the fun part. The upload generates a full HuggingFace dataset card from your run metadata. It pulls from metadata.json and builder_config.json to build:

A Quick Start section with load_dataset code (including processor subsets)
A Dataset Summary with record count, column count, completion %
A Schema & Statistics table — per-column type, uniqueness, null rate, token stats
Generation Details — how many columns of each config type
A Citation block so people can cite your dataset

Tags default to ["synthetic", "datadesigner"] plus whatever you pass in. Size category (n<1K, 1K<n<10K, etc.) is auto-computed. These tags make your dataset discoverable in Hub search — you can browse all Data Designer datasets in one place.

The template lives at packages/data-designer/src/data_designer/integrations/huggingface/dataset_card_template.md if you want to see the Jinja2 source.

Auth

Token resolution follows the standard huggingface_hub chain:

Explicit token= parameter
HF_TOKEN env var
Cached creds from hf auth login

If none of those work, you get a clear error telling you what to do.

Reproducible Pipelines — The Round-Trip

Round-Trip Reproducibility

Here’s the payoff: every dataset you push includes builder_config.json — the full SDG pipeline definition. Anyone (including future-you) can recreate the exact same pipeline from the HuggingFace URL:

1 import data_designer.config as dd
2 
3 config_builder = dd.DataDesignerConfigBuilder.from_config(
4     "https://huggingface.co/datasets/my-org/multilingual-greetings/blob/main/builder_config.json"
5 )

That’s it. One line. from_config accepts a raw URL, a local file path, a dict, or a YAML string. When you hand it a HuggingFace Hub URL, it auto-rewrites the blob URL to a raw URL behind the scenes so the fetch just works (same trick for GitHub blob URLs).

The loaded config builder comes back fully hydrated — columns, model configs, constraints, seed config, all of it. You can inspect it, tweak it, and re-run:

1 from data_designer.interface import DataDesigner
2 
3 # Maybe bump the count or swap a model
4 results = DataDesigner().create(config_builder, num_records=50_000)
5 
6 # And push the new version right back
7 results.push_to_hub(
8     "my-org/multilingual-greetings-v2",
9     "50k version with the same pipeline.",
10 )

So the full loop is: design → generate → push → share URL → recreate → iterate. The builder_config.json on HuggingFace is the reproducibility artifact.

Gotchas

repo_id must be username/dataset-name — exactly one slash. The client validates this before hitting the network.
description is required — it’s the prose that appears right under the title on the dataset card. Make it good.
private=True if you don’t want the world to see your dataset yet. You can flip it to public later from the dataset settings page.
Metadata paths get rewritten — local paths like parquet-files/batch_00000.parquet become data/batch_00000.parquet in the uploaded metadata.json so references stay valid on HF.