*** description: >- Process text data using comprehensive filtering, deduplication, content processing, and specialized tools for high-quality datasets categories: * workflows tags: * data-processing * filtering * deduplication * content-processing * quality-assessment * distributed personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: workflow modality: text-only *** # Process Data for Text Curation Process text data you've loaded through NeMo Curator's [pipeline architecture ](/about/concepts/text/data/loading). NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training. ## How it Works NeMo Curator's text processing capabilities are organized into five main categories: 1. **Language Management**: Handle multilingual content and language-specific processing 2. **Content Processing & Cleaning**: Clean, normalize, and transform text content 3. **Deduplication**: Remove duplicate and near-duplicate documents efficiently 4. **Quality Assessment & Filtering**: Score and remove low-quality content using heuristics and ML classifiers 5. **Specialized Processing**: Domain-specific processing for code and advanced curation tasks Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training. *** ## Language Management Handle multilingual content and language-specific processing requirements. Identify document languages and separate multilingual datasets fasttext 176-languages detection Manage high-frequency words to enhance text extraction and content detection preprocessing filtering language-specific ## Content Processing & Cleaning Clean, normalize, and transform text content for high-quality training data. Fix Unicode issues, standardize spacing, and remove URLs unicode normalization preprocessing ## Deduplication Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows. Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated Identify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated Identify and remove semantically similar documents using embeddings and clustering embeddings meaning-based gpu-accelerated ## Quality Assessment & Filtering Score and remove low-quality content using heuristics and ML classifiers. Filter text using configurable rules and metrics rules metrics fast Filter text using trained quality classifiers ml-models quality scoring GPU-accelerated classification with pre-trained models gpu distributed scalable ## Specialized Processing Domain-specific processing for code and advanced curation tasks. Specialized filters for programming content and source code programming syntax comments