Get Started with Text Curation#
This guide helps you set up and get started with NeMo Curator’s text curation capabilities. Follow these steps to prepare your environment and run your first text curation pipeline.
Prerequisites#
To use NeMo Curator’s text curation modules, ensure you meet the following requirements:
Python 3.10 or 3.12
packaging >= 22.0
Ubuntu 22.04/20.04
NVIDIA GPU (optional for many text modules, required for GPU-accelerated operations)
Volta™ or higher (compute capability 7.0+)
CUDA 12 (or above)
Installation Options#
You can install NeMo Curator in three ways:
The simplest way to install NeMo Curator:
# CPU-only text curation modules
pip install nemo-curator
# CPU + GPU text curation modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
# Text curation with bitext processing
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[bitext]
Note
For other modalities (image, video) or all modules, see the Installation Guide.
Install the latest version directly from GitHub:
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
Note
Replace cuda12x
with your desired extras: use .
for CPU-only, .[bitext]
for bitext processing, or .[all]
for all modules.
NeMo Curator is available as a standalone container:
Warning
Container Availability: The standalone NeMo Curator container is currently in development. Check the NGC Catalog for the latest availability and container path.
# Pull the container
docker pull nvcr.io/nvidia/nemo-curator:latest
# Run the container
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:latest
See also
For details on container environments and configurations, see Container Environments.
Download Sample Configuration#
NeMo Curator provides default configurations for common curation tasks. You can download a sample configuration for English text filtering:
mkdir -p ~/nemo_curator/configs
wget -O ~/nemo_curator/configs/heuristic_filter_en.yaml https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/config/heuristic_filter_en.yaml
This configuration file contains a comprehensive set of heuristic filters for English text, including filters for word count, non-alphanumeric content, repeated patterns, and content quality metrics.
Set Up Data Directory#
Create a directory to store your text datasets:
mkdir -p ~/nemo_curator/data
Basic Text Curation Example#
Here’s a simple example to get started with NeMo Curator:
import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import WordCountFilter, NonAlphaNumericFilter
from nemo_curator.utils.distributed_utils import get_client
# Initialize a Dask client for distributed processing (CPU or GPU)
client = get_client(cluster_type="cpu") # Use "gpu" for GPU-accelerated processing
# Load sample text data
dataset = DocumentDataset.read_json("~/nemo_curator/data/sample/*.jsonl")
# Create a simple curation pipeline
curation_pipeline = nc.Sequential([
# Filter documents with 50-10000 words
nc.ScoreFilter(
WordCountFilter(min_words=50, max_words=10000),
text_field="text",
score_field="word_count"
),
# Filter documents with excessive non-alphanumeric content
nc.ScoreFilter(
NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25),
text_field="text",
score_field="non_alpha_score"
)
])
# Apply the curation pipeline
curated_dataset = curation_pipeline(dataset)
# Save the curated dataset
curated_dataset.to_json("~/nemo_curator/data/curated")
Next Steps#
Explore the Text Curation documentation for more advanced filtering techniques, GPU acceleration options, and large-scale processing workflows.
Key areas to explore next:
Advanced Filtering: Learn about the 30+ built-in filters for quality assessment
GPU Acceleration: Scale your processing with RAPIDS and GPU clusters
Configuration Files: Use YAML configurations for complex filter pipelines
Distributed Processing: Process datasets across multiple machines
Quality Classification: Use machine learning models for document scoring