Data Preparation - NVIDIA Docs

You must prepare several datasets for the NeMo framework to use, depending on the type of model you are using. Each individual model guide contains information about the specific data preparation settings needed. Here, we present some overall information about the datasets available, and how to prepare your own datasets.

Predefined Datasets

The Pile: NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.

NVIDIA recommends that the NeMo-Megatron-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.

The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.
mC4: The Multilingual C4 (mC4) dataset has 101 languages and is generated from 71 Common Crawl dumps. NVIDIA provides utilities to download and prepare the mC4 dataset. (allen-ai version). NVIDIA recommends that this datasets be stored in a file system shared by all of the nodes. A shared workspace with read-write permissions is recommended in the case of Base Command Platform based clusters.

NVIDIA provides scripts that give you the option to download and preprocess any subset of the dataset’s 101 languages. A selection of 24 languages are included in the default language list. The raw size of the default language set is around 5 TB.

Parallelization is enabled in the downloading and preprocessing scripts. It provides a significant speed-up by automatically distributing and balancing the work on multi-node systems. Downloading and preprocessing the default language list takes approximately 7 hours, assuming a 30 MB/sec download speed and parallelization using 20 nodes.

The preprocessed dataset’s size is around 12 TB. NVIDIA recommends that you use a file system with more than 20 TB of free space to prepare the data.

NVIDIA currently does not support training with more than 25 languages. (See Known Issues.)

All config files for data preparation are in launcher_scripts.

The configuration used for data preparation for the Pile dataset or mC4 dataset must be specified in the conf/config.yaml file and data_preparation must be included in stages to run it. .. _modelguide-owndataset:

Bring Your Own Dataset

If you want to train models on your own dataset (which must already be filtered and cleaned), you must first convert the dataset files to JSONL files (with the extension .jsonl).

As discussed in earlier sections, the data_preparation configuration in conf/config.yaml specifies which file to use for data preparation configuration. To run your own dataset you must set data_preparation to generic/custom_dataset, and include data_preparation in stages. The configuration file custom_dataset is at conf/data_preparation/generic/custom_dataset.yaml. NVIDIA provides scripts that you can use to train your own tokenizer and preprocess your dataset into a format that the NVIDIA training scripts can consume.

At present, custom dataset processing supports only SentencePiece tokenizers. You can either train a fresh SentencePiece tokenizer with NVIDIA scripts, or load existing ones.

By default, custom configurations parallelize data preparation by using 20 nodes to preprocess custom dataset files in parallel. You can increase parallelization by using more nodes, up to the number of dataset files to be processed or the number of nodes available.

Slurm

First, ensure that the cluster-related configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then set time_limit or any other configurations that need to be updated in custom_dataset.yaml. The data preparation can be parallelized by using nodes * workers_per_node number of workers (up to one worker per dataset file).

Example

To run only the data preparation pipeline and not the training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - data_preparation

Then enter:

Copy
Copied!

            
            python3 main.py

Base Command Platform

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the tokenizer model files in the same workspace for later use.

You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use two or more nodes, up to the number of custom dataset files.

To run the data preparation pipeline, enter:

Copy
Copied!

            
            python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \
data_dir=/mount/data \
base_results_dir=/mount/results data_preparation=custom_dataset \
dataprepartion.train_tokenizer_args.inp=/path/to/text/file/for/training/tokenizer \
datapreparation.raw_dataset_files=</path/to/custom_data_files> \
>> /results/data_custom_dataset_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. Any other required configuration may be added to modify the command’s behavior.

Kubernetes

First, ensure that the cluster-related configuration in conf/cluster/k8s.yaml is correct. Set the cluster and cluster_type parameters in conf/config.yaml to k8s. Then set time_limit or any other parameters you need to change in custom_dataset.yaml. The data preparation can be parallelized by using nodes * workers_per_node number of workers (up to one worker per dataset file).

Example

To run only the data preparation pipeline and not the training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - data_preparation

Then enter:

Copy
Copied!

            
            python3 main.py

Once launched, a Helm chart will be created based on the config files and the data preparation will begin.

Common

Set the configuration for the custom data preparation job in the YAML file:

Copy
Copied!

            
            run:
  name: custom_dataset
  results_dir: ${base_results_dir}/${.name}
  time_limit: "24:00:00"
  dependency: "singleton"
  node_array_size: 4
  cpus_per_node: 256
  workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: custom_dataset
custom_dataset_dir: ${data_dir}/custom_dataset
train_tokenizer: True # True to train a sentence piece tokenizer
train_tokenizer_args: # For all options please check: https://github.com/google/sentencepiece/blob/master/doc/options.md
   input: null # text file for training tokenizer
   input_format: "text" # text or tsv
   model_prefix: "custom_sp_tokenizer"
   model_type: "bpe" # model algorithm: unigram, bpe, word or char
   vocab_size: 8000 # Vocabulary size
   character_coverage: 0.9995 # character coverage to determine the minimum symbols
   unk_id: 1
   bos_id: 2
   eos_id: 3
   pad_id: 0
bpe_save_dir: ${.custom_dataset_dir}/bpe # Dir to save sentence piece tokenizer model and vocab files

preprocess_data: True  # True to preprocess the data from json, jsonl or json.gz files, False otherwise.
raw_dataset_files:
  - null # Each file should be input json, jsonl or json.gz file
tokenizer_model: ${.bpe_save_dir}/${data_preparation.train_tokenizer_args.model_prefix}.model # trained SentencePiece tokenizer model
preprocess_worker_mapping: ${.custom_dataset_dir}/preprocess_mapping
preprocessed_dir: ${.custom_dataset_dir}/preprocessed

Note

Depending on the dataset and system, it’s possible that system memory gets OOM with very large dataset shard files. The solution to this issue is to reduce dataset shard sizes. If you see similar issue, please consider breaking up json, jsonl or json.gz files into smaller chunks before running preprocessing.