About Data Curation With NeMo Curator#

The nemotron steps run curate/nemo_curator command reads JSONL data, optionally materializes a Hugging Face dataset snapshot, applies lightweight NeMo Curator filters, and writes filtered JSONL shards for downstream translation or training data preparation.

Use this step when you already have JSONL records and need a small, repeatable curation pass before a later step such as translate/nemo_curator, data_prep/pretrain_prep, or data_prep/sft_packing.

When to Use#

Use curate/nemo_curator when you need:

  • A local JSONL reader and writer path using NeMo Curator.

  • Optional FastText language identification and language filtering.

  • Optional word-count filtering.

  • Optional multilingual domain classification and filtering.

  • Optional Hugging Face dataset snapshot download before the Curator reader runs.

Note

This step is intentionally lightweight. It does not crawl web pages, extract Common Crawl WARC files, or run large deduplication workflows. Use a dedicated Curator recipe for those jobs before this step, or add a separate step when that behavior is needed.

Pipeline Summary#

        flowchart LR
    A[Optional Hugging Face snapshot] --> B[JSONL files]
    C[Local JSONL files] --> B
    B --> D[JsonlReader]
    D --> E{Language filter enabled?}
    E -->|yes| F[FastText language ID]
    E -->|no| G{Word-count filter enabled?}
    F --> G
    G -->|yes| H[WordCountFilter]
    G -->|no| I{Domain filter enabled?}
    H --> I
    I -->|yes| J[MultilingualDomainClassifier]
    I -->|no| K[JsonlWriter]
    J --> K
    K --> L[Filtered JSONL shards]
    

Documentation Series#

Tutorial

Install the Nemotron CLI, run a local tiny JSONL initial curation validation, and inspect output shards.

Getting Started With Data Curation
How-To Guides

Run local JSONL curation, download a Hugging Face snapshot, and enable optional filters.

Curation How-To Guides
Reference

YAML parameters, CLI syntax, input/output format, and troubleshooting.

Curation Reference

All Documentation#

Guide

What you do

Getting Started With Data Curation

Run curate/nemo_curator on the packaged tiny JSONL fixture

Guide

Focus

Run Curation on Local JSONL

Local JSONL reader/writer path

Use a Hugging Face Snapshot

dataset block and Hugging Face snapshot download

Enable Curation Filters

Language, word-count, and domain filters

Guide

Content

curate/nemo_curator Configuration

YAML field reference

curate/nemo_curator CLI

nemotron steps run curate/nemo_curator syntax

Curation Input and Output Format

Input and output shapes

Curation Troubleshooting

Common failures and fixes

What You Need#

  • JSONL input with one text field, usually named text.

  • Optional model assets when filters are enabled, such as a FastText language identification model for language_codes.

  • A writable output directory for JSONL shards.

Quick Paths#

  1. First local run: Getting Started With Data Curation

  2. Local corpus setup: Run Curation on Local JSONL

  3. Hugging Face snapshot setup: Use a Hugging Face Snapshot

  4. Filter setup: Enable Curation Filters

  5. Lookup flags: curate/nemo_curator CLI