curate/nemo_curator Configuration#
The step reads YAML from src/nemotron/steps/curate/nemo_curator/config/.
File |
Purpose |
|---|---|
|
Curator-container initial validation configuration. Optional filters are disabled. Override |
|
Example Hugging Face snapshot workflow for FineWeb-Edu-style JSONL with language and word-count filters enabled. |
Top-Level Fields#
- input_glob#
JSONL file path or glob passed to NeMo Curator
JsonlReader.
- output_dir#
Directory where NeMo Curator writes JSONL output shards.
- text_field#
Record field containing the text to curate.
Default:
text.
- dataset#
Optional keyword arguments passed to
huggingface_hub.snapshot_download. Setdataset: nullfor local-input-only runs.Common keys are
repo_id,repo_type,local_dir, andallow_patterns.
- language_codes#
Uppercase language codes to keep. Set
language_codes: []to skip FastText language identification and language filtering. When this list is non-empty,models.fasttext_langidmust point at a FastText language identification model.
- domains#
Domains to keep through NeMo Curator
MultilingualDomainClassifier. Setdomains: []to skip domain classification.
- quality_filters#
Optional quality settings.
min_langid_scoreapplies when language filtering is enabled.min_wordsandmax_wordsenable word-count filtering and must be set together. Setquality_filters: {}to skip word-count filtering.
- models#
Optional model and cache paths.
Common keys:
Set
fasttext_langidto the path of the FastText language identification model.Set
hf_cache_dirto the Hugging Face model cache directory for classifier assets.
- ray.num_cpus#
Optional Ray CPU count. If omitted, the Lepton curate profile can provide
NEMOTRON_CURATOR_RAY_NUM_CPUS.
Minimal Local Configuration#
language_codes: []
domains: []
text_field: text
input_glob: ./data/**/*.jsonl
output_dir: ./output/curated-jsonl
dataset: null
models: {}
quality_filters: {}
Filtered Configuration#
language_codes:
- EN
domains: []
text_field: text
input_glob: ./data/**/*.jsonl
output_dir: ./output/curated-jsonl
dataset: null
models:
fasttext_langid: ./cache/models/fasttext/lid.176.bin
hf_cache_dir: ./cache/huggingface
quality_filters:
min_langid_score: 0.3
min_words: 50
max_words: 5000