About Data Curation With NeMo Curator#
The nemotron steps run curate/nemo_curator command reads JSONL data, optionally materializes a Hugging Face dataset snapshot, applies lightweight NeMo Curator filters, and writes filtered JSONL shards for downstream translation or training data preparation.
Use this step when you already have JSONL records and need a small, repeatable curation pass before a later step such as translate/nemo_curator, data_prep/pretrain_prep, or data_prep/sft_packing.
When to Use#
Use curate/nemo_curator when you need:
A local JSONL reader and writer path using NeMo Curator.
Optional FastText language identification and language filtering.
Optional word-count filtering.
Optional multilingual domain classification and filtering.
Optional Hugging Face dataset snapshot download before the Curator reader runs.
Note
This step is intentionally lightweight. It does not crawl web pages, extract Common Crawl WARC files, or run large deduplication workflows. Use a dedicated Curator recipe for those jobs before this step, or add a separate step when that behavior is needed.
Pipeline Summary#
flowchart LR
A[Optional Hugging Face snapshot] --> B[JSONL files]
C[Local JSONL files] --> B
B --> D[JsonlReader]
D --> E{Language filter enabled?}
E -->|yes| F[FastText language ID]
E -->|no| G{Word-count filter enabled?}
F --> G
G -->|yes| H[WordCountFilter]
G -->|no| I{Domain filter enabled?}
H --> I
I -->|yes| J[MultilingualDomainClassifier]
I -->|no| K[JsonlWriter]
J --> K
K --> L[Filtered JSONL shards]
Documentation Series#
Install the Nemotron CLI, run a local tiny JSONL initial curation validation, and inspect output shards.
Run local JSONL curation, download a Hugging Face snapshot, and enable optional filters.
YAML parameters, CLI syntax, input/output format, and troubleshooting.
All Documentation#
Guide |
What you do |
|---|---|
Run |
Guide |
Focus |
|---|---|
Local JSONL reader/writer path |
|
|
|
Language, word-count, and domain filters |
Guide |
Content |
|---|---|
YAML field reference |
|
|
|
Input and output shapes |
|
Common failures and fixes |
What You Need#
JSONL input with one text field, usually named
text.Optional model assets when filters are enabled, such as a FastText language identification model for
language_codes.A writable output directory for JSONL shards.
Quick Paths#
First local run: Getting Started With Data Curation
Local corpus setup: Run Curation on Local JSONL
Hugging Face snapshot setup: Use a Hugging Face Snapshot
Filter setup: Enable Curation Filters
Lookup flags: curate/nemo_curator CLI