*** description: >- Step-by-step guide to setting up and running your first text curation pipeline with NeMo Curator categories: * getting-started tags: * text-curation * installation * quickstart * data-loading * quality-filtering * python-api personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: tutorial modality: text-only *** # Get Started with Text Curation This guide provides step-by-step instructions for setting up NeMo Curator’s text curation capabilities. Follow these instructions to prepare your environment and execute your first text curation pipeline. ## Prerequisites To use NeMo Curator’s text curation modules, ensure your system meets the following requirements: * Python 3.10, 3.11, or 3.12 * packaging >= 22.0 * uv (for package management and installation) * Ubuntu 22.04/20.04 * NVIDIA GPU (optional for most text modules, required for GPU-accelerated operations) * Volta™ or higher (compute capability 7.0+) * CUDA 12 (or later) If `uv` is not installed, refer to the [Installation Guide](/admin/installation) for setup instructions, or install it quickly using: ```bash curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh source $HOME/.local/bin/env ``` *** ## Installation Options You can install NeMo Curator using one of the following methods: The simplest way to install NeMo Curator: ```bash uv pip install "nemo-curator[text_cuda12]" ``` For other modalities (image, video) or all modules, see the [Installation Guide](/admin/installation). Install the latest version directly from GitHub: ```bash git clone https://github.com/NVIDIA-NeMo/Curator.git cd Curator uv sync --extra text_cuda12 --all-groups source .venv/bin/activate ``` Replace `text_cuda12` with your desired extras: use `.` for CPU-only, `.[text_cpu]` for text processing only, or `.[all]` for all modules. NeMo Curator is available as a standalone container: ```bash # Pull the container docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }} # Run the container docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:{{ container_version }} ``` For details on container environments and configurations, see [Container Environments](/reference/infra/container-environments). ## Prepare Your Environment NeMo Curator uses a pipeline-based architecture for processing text data. Before running your first pipeline, ensure you have a proper directory structure: ## Set Up Data Directory Create the following directories for your text datasets: ```bash mkdir -p ~/nemo_curator/data/sample mkdir -p ~/nemo_curator/data/curated ``` For this example, you need sample JSONL files in `~/nemo_curator/data/sample/`. Each line should be a JSON object with at least `text` and `id` fields. You can create test data or refer to [Read Existing Data ](/curate-text/load-data/read-existing) and [Data Loading ](/curate-text/load-data) for information on downloading data. **Set your HuggingFace token** to avoid rate limiting when downloading models or datasets: export HF\_TOKEN="your\_token\_here" Without a token, repeated downloads from Hugging Face may result in `429 Client Error` (rate limiting). Get a free token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). ## Basic Text Curation Example Here's a simple example to get started with NeMo Curator's pipeline-based architecture: ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.stages.text.io.writer import JsonlWriter from nemo_curator.stages.text.modules.score_filter import ScoreFilter from nemo_curator.stages.text.filters import WordCountFilter, NonAlphaNumericFilter # Create a pipeline for text curation pipeline = Pipeline( name="text_curation_pipeline", description="Basic text quality filtering pipeline" ) # Add stages to the pipeline pipeline.add_stage( JsonlReader( file_paths="~/nemo_curator/data/sample/", files_per_partition=4, fields=["text", "id"] ) ) # Add quality filtering stages pipeline.add_stage( ScoreFilter( filter_obj=WordCountFilter(min_words=50, max_words=100000), text_field="text", score_field="word_count" ) ) pipeline.add_stage( ScoreFilter( filter_obj=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25), text_field="text", score_field="non_alpha_score" ) ) # Write the curated results pipeline.add_stage( JsonlWriter("~/nemo_curator/data/curated") ) # Execute the pipeline results = pipeline.run() print(f"Pipeline completed successfully! Processed {len(results) if results else 0} tasks.") ``` ## Next Steps Explore the [Text Curation documentation](/curate-text) for more advanced filtering techniques, GPU acceleration options, and large-scale processing workflows.