Getting Started With Data Curation#
What You’ll Build: a filtered set of JSONL shards from the packaged tiny fixture. The run reads the fixture, passes it through the NeMo Curator pipeline with all optional filters disabled, and writes output shards to a local directory.
In this tutorial, you will:
Clone the repository and install dependencies.
Configure the Ray runtime environment.
Inspect the packaged fixture.
Run the curation step with the tiny configuration.
Inspect the output shards.
This tutorial requires approximately 5 minutes to complete.
Run the curate/nemo_curator step on the packaged tiny JSONL fixture, then show me
the names and record counts of the output shards.
Prerequisites#
The
uvtool is available in your shell.
Procedure#
Clone the repository, if you haven’t already:
$ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron
Install the dependencies for curating data:
$ uv sync --extra curate
Set the Ray runtime environment variable so Ray workers reuse the synchronized project environment:
$ export RAY_ENABLE_UV_RUN_RUNTIME_ENV=0
Inspect the packaged fixture at
src/nemotron/steps/curate/nemo_curator/data/tiny.jsonl. Each record contains atextfield.{"text":"This is a small English document for the curation smoke test with enough words for optional quality filters."} {"text":"Another English sample about mathematics, banking, and clean data processing for a tiny curation validation run."}
Run the curation step. The
tinyconfiguration disables optional language, word-count, and domain filters, making this run a baseline validation of the reader, writer, and Ray startup. The checked-intiny.yamluses container paths, so overrideinput_globandoutput_dirwith host paths:$ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \ input_glob="${PWD}/src/nemotron/steps/curate/nemo_curator/data/tiny.jsonl" \ output_dir="${PWD}/output/curate-tiny"
Inspect the output directory:
$ find output/curate-tiny -type f
Open a shard and confirm that records contain the configured
text_field. NeMo Curator assigns the exact shard name.
Summary#
In this tutorial, you completed the following tasks:
Cloned the repository and installed the project dependencies.
Configured the Ray runtime environment.
Ran the
curate/nemo_curatorstep with the packaged tiny JSONL fixture.Inspected the output shards and verified that records contain the configured text field.
The tiny configuration disables all optional filters.
To add language, word-count, or domain filters to a production corpus, refer to
the how-to guides below.
Next Steps#
Point the step at your own JSONL corpus: Run Curation on Local JSONL.
Download a Hugging Face dataset before curation: Use a Hugging Face Snapshot.
Add language, word-count, or domain filters: Enable Curation Filters.
Look up all configuration fields and CLI flags: curate/nemo_curator Configuration and curate/nemo_curator CLI.