For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
      • ArXiv
      • Common Crawl
      • Custom Sources
      • Nemotron-Parse PDF Pipeline
      • Read Existing Data
      • Wikipedia
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • Data Sources & File Formats
Curate TextLoad Data

Download Data

||View as Markdown|
Previous

Tutorials

Next

ArXiv

Load text data from ArXiv, Common Crawl, Wikipedia, and custom sources using Curator.

Curator provides a task-centric pipeline for downloading and processing large-scale public text datasets. It runs on Ray and converts raw formats like Common Crawl’s .warc.gz into JSONL.

How it Works

Curator uses a 4-step pipeline pattern where data flows through stages as tasks. Each step uses a ProcessingStage that transforms tasks according to Curator’s pipeline-based architecture .

Data sources provide composite stages that combine these steps into complete download-and-extract pipelines, producing DocumentBatch tasks for further processing.

Python
1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
4from nemo_curator.stages.text.io.writer import JsonlWriter
5
6# Initialize Ray client
7ray_client = RayClient()
8ray_client.start()
9
10# Create a pipeline for downloading Common Crawl data
11pipeline = Pipeline(
12 name="common_crawl_download",
13 description="Download and process Common Crawl web archives"
14)
15
16# Add data loading stage
17cc_stage = CommonCrawlDownloadExtractStage(
18 start_snapshot="2020-50",
19 end_snapshot="2020-50",
20 download_dir="/tmp/cc_downloads",
21 crawl_type="main",
22 url_limit=10 # Limit for testing
23)
24pipeline.add_stage(cc_stage)
25
26# Add writer stage to save as JSONL
27writer = JsonlWriter(path="/output/folder")
28pipeline.add_stage(writer)
29
30# Execute pipeline
31results = pipeline.run()
32
33# Stop Ray client
34ray_client.stop()

Data Sources & File Formats

Load data from public datasets and custom data sources using Curator stages.

Read Existing Data

Read existing JSONL and Parquet datasets using Curator’s reader stages jsonl parquet

Common Crawl

Download and extract web archive data from Common Crawl web-data warc html-extraction

Wikipedia

Download and extract Wikipedia articles from Wikipedia dumps articles multilingual xml-dumps

Custom Data Sources

Implement a download and extract pipeline for a custom data source jsonl parquet file-partitioning