Curate TextLoad Data

Download Data

View as Markdown

Load text data from ArXiv, Common Crawl, Wikipedia, and custom sources using Curator.

Curator provides a task-centric pipeline for downloading and processing large-scale public text datasets. It runs on Ray and converts raw formats like Common Crawl’s .warc.gz into JSONL.

How it Works

Curator uses a 4-step pipeline pattern where data flows through stages as tasks. Each step uses a ProcessingStage that transforms tasks according to Curator’s pipeline-based architecture .

Data sources provide composite stages that combine these steps into complete download-and-extract pipelines, producing DocumentBatch tasks for further processing.

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
4from nemo_curator.stages.text.io.writer import JsonlWriter
5
6# Initialize Ray client
7ray_client = RayClient()
8ray_client.start()
9
10# Create a pipeline for downloading Common Crawl data
11pipeline = Pipeline(
12 name="common_crawl_download",
13 description="Download and process Common Crawl web archives"
14)
15
16# Add data loading stage
17cc_stage = CommonCrawlDownloadExtractStage(
18 start_snapshot="2020-50",
19 end_snapshot="2020-50",
20 download_dir="/tmp/cc_downloads",
21 crawl_type="main",
22 url_limit=10 # Limit for testing
23)
24pipeline.add_stage(cc_stage)
25
26# Add writer stage to save as JSONL
27writer = JsonlWriter(path="/output/folder")
28pipeline.add_stage(writer)
29
30# Execute pipeline
31results = pipeline.run()
32
33# Stop Ray client
34ray_client.stop()

Data Sources & File Formats

Load data from public datasets and custom data sources using Curator stages.