Use a Hugging Face Snapshot#

Set the dataset block when the curation job should call huggingface_hub.snapshot_download before NeMo Curator reads files.

Configuration Shape#

dataset:
  repo_id: HuggingFaceFW/fineweb-edu
  repo_type: dataset
  local_dir: ./data/fineweb-edu
  allow_patterns:
    - data/*.jsonl

input_glob: ./data/fineweb-edu/**/*.jsonl

The dataset block is passed to snapshot_download. After the download finishes, input_glob must point at JSONL files under dataset.local_dir.

Run With the Default Configuration#

The default config demonstrates a FineWeb-Edu-style snapshot. It also enables language filtering, so you must provide the FastText language identification model path if you keep language_codes non-empty.

$ uv sync --extra curate
$ export RAY_ENABLE_UV_RUN_RUNTIME_ENV=0

$ uv run --no-sync nemotron steps run curate/nemo_curator -c default \
    dataset.local_dir="${PWD}/data/fineweb-edu" \
    input_glob="${PWD}/data/fineweb-edu/**/*.jsonl" \
    output_dir="${PWD}/output/fineweb-edu-curated" \
    models.fasttext_langid="${PWD}/cache/models/fasttext/lid.176.bin"

Snapshot Without Optional Filters#

For a first infrastructure run, disable filters and verify that snapshot download and JSONL IO work.

$ uv run --no-sync nemotron steps run curate/nemo_curator -c default \
    dataset.local_dir="${PWD}/data/fineweb-edu" \
    input_glob="${PWD}/data/fineweb-edu/**/*.jsonl" \
    output_dir="${PWD}/output/fineweb-edu-curated" \
    language_codes=[] \
    domains=[] \
    quality_filters={}

Private or Gated Datasets#

If the Hugging Face repository requires authentication, export HF_TOKEN before running. For remote jobs, pass HF_TOKEN through the environment profile.

$ export HF_TOKEN="<hugging-face-token>"