curate/nemo_curator CLI#
Syntax#
uv run --no-sync nemotron steps run curate/nemo_curator \
[-c <config-name-or-path>] \
[-r <run-profile> | -b <batch-profile>] \
[-d] \
[<dotlist-overrides>...]
Use -c tiny for a small initial validation configuration and -c default for the Hugging Face snapshot example.
Refer to Nemotron Steps CLI Reference for the shared flag set.
Common Commands#
Show the step contract:
$ uv run --no-sync nemotron steps show curate/nemo_curator
Run a local JSONL initial validation:
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
input_glob="${PWD}/src/nemotron/steps/curate/nemo_curator/data/tiny.jsonl" \
output_dir="${PWD}/output/curate-tiny"
Run on Lepton with the generated Curator profile:
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny --batch lepton_curate
Run against local corpus shards:
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
input_glob="${PWD}/data/my_corpus/**/*.jsonl" \
output_dir="${PWD}/output/curated-jsonl" \
text_field=text \
language_codes=[] \
domains=[] \
quality_filters={}
Dotlist Overrides#
All YAML fields can be overridden from the command line with key=value syntax.
Examples:
input_glob=/data/**/*.jsonloutput_dir=/output/curatedtext_field=bodylanguage_codes=[EN]quality_filters.min_words=50quality_filters.max_words=5000ray.num_cpus=4
Use shell quoting around globs or lists when your shell expands them unexpectedly.