Run Curation on Local JSONL#
Use this path when your corpus already exists as local JSONL files.
Input Requirements#
Each input record must contain the configured text_field.
The default field name is text.
Example record:
{"id": "doc-001", "text": "The text to keep, filter, or route downstream."}
Minimal Local Run#
Start with optional filters disabled. This verifies the reader and writer path before adding model-backed filters.
$ uv sync --extra curate
$ export RAY_ENABLE_UV_RUN_RUNTIME_ENV=0
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
input_glob="${PWD}/data/my_corpus/**/*.jsonl" \
output_dir="${PWD}/output/curated-jsonl" \
text_field=text \
language_codes=[] \
domains=[] \
quality_filters={}
Use an absolute path or a path relative to the repository root. When running on a remote executor, make sure the path exists inside the container or shared mount.
Add CPU Resources#
For local runs, you can set Ray CPU count in YAML or as a CLI override:
$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
input_glob="${PWD}/data/my_corpus/**/*.jsonl" \
output_dir="${PWD}/output/curated-jsonl" \
ray.num_cpus=4
For generated Lepton profiles, NEMOTRON_CURATOR_RAY_NUM_CPUS can provide the CPU count when ray.num_cpus is omitted.
Validate Output#
After the run:
Confirm that output shards exist under
output_dir.Count records before and after filtering.
Inspect a few output records to confirm the
text_fieldis present and not empty.
If output is empty, run again with language_codes=[], domains=[], and quality_filters={} before enabling filters.