Enable Curation Filters#
Start with no filters, confirm JSONL input and output, then add one filter family at a time.
Language Filtering#
Set language_codes to uppercase language codes and provide a FastText language identification model.
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
input_glob="${PWD}/data/my_corpus/**/*.jsonl" \
output_dir="${PWD}/output/curated-en" \
language_codes=[EN] \
models.fasttext_langid="${PWD}/cache/models/fasttext/lid.176.bin" \
quality_filters.min_langid_score=0.3
Set language_codes=[] to skip FastText language identification entirely.
Word-Count Filtering#
Set both quality_filters.min_words and quality_filters.max_words.
The step raises an error if only one of those keys is present.
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
input_glob="${PWD}/data/my_corpus/**/*.jsonl" \
output_dir="${PWD}/output/curated-word-count" \
quality_filters.min_words=50 \
quality_filters.max_words=5000
Set quality_filters={} to skip word-count filtering.
Domain Filtering#
Set domains to the domains you want to keep.
The step uses NeMo Curator’s multilingual domain classifier.
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
input_glob="${PWD}/data/my_corpus/**/*.jsonl" \
output_dir="${PWD}/output/curated-domain" \
domains=[STEM] \
models.hf_cache_dir="${PWD}/cache/huggingface"
Tip
Keep the first domain-filtered run small. The classifier may download or cache model assets on first use.
Filter Order#
The step applies filters in this order:
FastText language identification and language filtering, when
language_codesis non-empty.Word-count filtering, when
quality_filters.min_wordsandquality_filters.max_wordsare both set.Multilingual domain classification, when
domainsis non-empty.
When output is unexpectedly small, disable later filters first, then relax thresholds.