Text Curation#
- Downloading and Extracting Text
Downloading a massive public dataset is usually the first step in data curation, and it can be cumbersome due to the dataset’s massive size and hosting method. This section describes how to download and extract large corpora efficiently. NeMo Curator supports multiple content extraction methods including JusText, Resiliparse, and Trafilatura to cleanly extract text from web content.
- Working with DocumentDataset
DocumentDataset is the standard format for datasets in NeMo Curator. This section describes how to get datasets in and out of this format, as well as how DocumentDataset interacts with the modules.
- CPU and GPU Modules with Dask
NeMo Curator provides both CPU based modules and GPU based modules and supports methods for creating compatible Dask clusters and managing the dataset transfer between CPU and GPU.
- Document Filtering
This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.
- Language Identification
Large, unlabeled text corpora often contain a variety of languages. NeMo Curator provides utilities to identify languages.
- Text Cleaning
Many parts of the Internet contained malformed or poorly formatted text. NeMo Curator can fix many of these issues with text.
- Stop Words in Text Processing
Stop words are common words that are often filtered out in NLP tasks because they typically don’t carry significant meaning. NeMo Curator provides built-in stop word lists for various languages to support text analysis and extraction.
- GPU Accelerated Exact and Fuzzy Deduplication
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.
- GPU Accelerated Semantic Deduplication
NeMo Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and PyTorch.
- Distributed Data Classification
NeMo-Curator provides a scalable and GPU accelerated module to help users run inference with pre-trained models on large volumes of text documents.
- Synthetic Data Generation
Synthetic data generation tools and example piplines are available within NeMo Curator.
- Downstream Task Decontamination
After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets, there is a potential for leakage of this test data into the model’s training dataset. NeMo Curator allows you to remove sections of documents in your dataset that are present in downstream tasks.
- Personally Identifiable Information Identification and Removal
The purpose of the personally identifiable information (PII) redaction tool is to help scrub sensitive data out of training datasets
- Download and Extract Text
- Working with DocumentDataset
- CPU and GPU Modules with Dask
- Classifier and Heuristic Quality Filtering
- Language Identification and Unicode Fixing
- Stop Words in Text Processing
- GPU Accelerated Exact and Fuzzy Deduplication
- Semantic Deduplication
- Synthetic Data Generation
- Downstream Task Decontamination/Deduplication
- PII Identification and Removal
- Distributed Data Classification