Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Data Curation

Downloading and Extracting Text

Downloading a massive public dataset is usually the first step in data curation, and it can be cumbersome due to the dataset’s massive size and hosting method. This section describes how to download and extract large corpora efficiently.

Working with DocumentDataset

DocumentDataset is the standard format for datasets in NeMo Curator. This section describes how to get datasets in and out of this format, as well as how DocumentDataset interacts with the modules.

CPU and GPU Modules with Dask

NeMo Curator provides both CPU based modules and GPU based modules and supports methods for creating compatible Dask clusters and managing the dataset transfer between CPU and GPU.

Document Filtering

This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.

Language Identification and Unicode Fixing

Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.

GPU Accelerated Exact and Fuzzy Deduplication

Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.

GPU Accelerated Semantic Deduplication

NeMo-Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and Pytorch.

Synthetic Data Generation

Synthetic data generation tools and example piplines are available within NeMo Curator.

Downstream Task Decontamination

After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets, there is a potential for leakage of this test data into the model’s training dataset. NeMo Curator allows you to remove sections of documents in your dataset that are present in downstream tasks.

Personally Identifiable Information Identification and Removal

The purpose of the personally identifiable information (PII) redaction tool is to help scrub sensitive data out of training datasets

NeMo Curator on Kubernetes

Demonstration of how to run the NeMo Curator on a Dask Cluster deployed on top of Kubernetes

Best Practices

A collection of suggestions on how to best use NeMo Curator to curate your dataset

Next Steps

Now that you’ve curated your data, let’s discuss where to go next in the NeMo Framework to put it to good use.

Tutorials

To get started, you can explore the NeMo Curator GitHub repository and follow the available tutorials and notebooks. These resources cover various aspects of data curation, including training from scratch and Parameter-Efficient Fine-Tuning (PEFT).

API Docs

API Documentation for all the modules in NeMo Curator