Get Started with Text Curation
This guide provides step-by-step instructions for setting up NeMo Curator’s text curation capabilities. Follow these instructions to prepare your environment and execute your first text curation pipeline.
Prerequisites
To use NeMo Curator’s text curation modules, ensure your system meets the following requirements:
- Python 3.10, 3.11, or 3.12
- packaging >= 22.0
- uv (for package management and installation)
- Ubuntu 22.04/20.04
- NVIDIA GPU (optional for most text modules, required for GPU-accelerated operations)
- Volta™ or higher (compute capability 7.0+)
- CUDA 12 (or later)
If uv is not installed, refer to the Installation Guide for setup instructions, or install it quickly using:
Installation Options
You can install NeMo Curator using one of the following methods:
PyPI Installation
Source Installation
NeMo Curator Container
The simplest way to install NeMo Curator:
For other modalities (image, video) or all modules, see the Installation Guide.
Prepare Your Environment
NeMo Curator uses a pipeline-based architecture for processing text data. Before running your first pipeline, ensure you have a proper directory structure:
Set Up Data Directory
Create the following directories for your text datasets:
For this example, you need sample JSONL files in ~/nemo_curator/data/sample/. Each line should be a JSON object with at least text and id fields. You can create test data or refer to Read Existing Data and Data Loading for information on downloading data.
Set your HuggingFace token to avoid rate limiting when downloading models or datasets:
export HF_TOKEN=“your_token_here”
Without a token, repeated downloads from Hugging Face may result in 429 Client Error (rate limiting). Get a free token at huggingface.co/settings/tokens.
Basic Text Curation Example
Here’s a simple example to get started with NeMo Curator’s pipeline-based architecture:
Next Steps
Explore the Text Curation documentation for more advanced filtering techniques, GPU acceleration options, and large-scale processing workflows.