Getting Started With Data Curation#

What You’ll Build: a filtered set of JSONL shards from the packaged tiny fixture. The run reads the fixture, passes it through the NeMo Curator pipeline with all optional filters disabled, and writes output shards to a local directory.

In this tutorial, you will:

  1. Clone the repository and install dependencies.

  2. Configure the Ray runtime environment.

  3. Inspect the packaged fixture.

  4. Run the curation step with the tiny configuration.

  5. Inspect the output shards.

This tutorial requires approximately 5 minutes to complete.

Sample Prompt

Run the curate/nemo_curator step on the packaged tiny JSONL fixture, then show me the names and record counts of the output shards.

Prerequisites#

  • The uv tool is available in your shell.

Procedure#

  1. Clone the repository, if you haven’t already:

    $ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron
    
  2. Install the dependencies for curating data:

    $ uv sync --extra curate
    
  3. Set the Ray runtime environment variable so Ray workers reuse the synchronized project environment:

    $ export RAY_ENABLE_UV_RUN_RUNTIME_ENV=0
    
  4. Inspect the packaged fixture at src/nemotron/steps/curate/nemo_curator/data/tiny.jsonl. Each record contains a text field.

    {"text":"This is a small English document for the curation smoke test with enough words for optional quality filters."}
    {"text":"Another English sample about mathematics, banking, and clean data processing for a tiny curation validation run."}
    
  5. Run the curation step. The tiny configuration disables optional language, word-count, and domain filters, making this run a baseline validation of the reader, writer, and Ray startup. The checked-in tiny.yaml uses container paths, so override input_glob and output_dir with host paths:

    $ uv run --no-sync nemotron steps run curate/nemo_curator -c tiny \
        input_glob="${PWD}/src/nemotron/steps/curate/nemo_curator/data/tiny.jsonl" \
        output_dir="${PWD}/output/curate-tiny"
    
  6. Inspect the output directory:

    $ find output/curate-tiny -type f
    

    Open a shard and confirm that records contain the configured text_field. NeMo Curator assigns the exact shard name.

Summary#

In this tutorial, you completed the following tasks:

  • Cloned the repository and installed the project dependencies.

  • Configured the Ray runtime environment.

  • Ran the curate/nemo_curator step with the packaged tiny JSONL fixture.

  • Inspected the output shards and verified that records contain the configured text field.

The tiny configuration disables all optional filters. To add language, word-count, or domain filters to a production corpus, refer to the how-to guides below.

Next Steps#