Data Preparation
You must prepare several datasets for the NeMo framework to use, depending on the type of model you are using. Each individual model guide contains information about the specific data preparation settings needed. Here, we present some overall information about the datasets available, and how to prepare your own datasets.
Predefined Datasets
The Pile: NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.
NVIDIA recommends that the NeMo-Framework-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.
The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.
mC4: The Multilingual C4 (mC4) dataset has 101 languages and is generated from 71 Common Crawl dumps. NVIDIA provides utilities to download and prepare the mC4 dataset. (allen-ai version). NVIDIA recommends that this datasets be stored in a file system shared by all of the nodes. A shared workspace with read-write permissions is recommended in the case of Base Command Platform based clusters.
NVIDIA provides scripts that give you the option to download and preprocess any subset of the dataset’s 101 languages. A selection of 24 languages are included in the default language list. The raw size of the default language set is around 5 TB.
Parallelization is enabled in the downloading and preprocessing scripts. It provides a significant speed-up by automatically distributing and balancing the work on multi-node systems. Downloading and preprocessing the default language list takes approximately 7 hours, assuming a 30 MB/sec download speed and parallelization using 20 nodes.
The preprocessed dataset’s size is around 12 TB. NVIDIA recommends that you use a file system with more than 20 TB of free space to prepare the data.
NVIDIA currently does not support training with more than 25 languages. (See Known Issues.)
All config files for data preparation are in launcher_scripts
.
The configuration used for data preparation for the Pile dataset or mC4
dataset must be specified in the conf/config.yaml
file and
data_preparation
must be included in stages
to run it.
.. _modelguide-owndataset:
Bring Your Own Dataset
If you want to train models on your own dataset
(which must already be filtered and cleaned), you must first convert the
dataset files to JSONL files (with the extension .jsonl
).
As discussed in earlier sections, the data_preparation
configuration in
conf/config.yaml
specifies which file to use for data preparation
configuration. To run your own dataset you must set data_preparation
to generic/custom_dataset
,
and include data_preparation
in stages
. The
configuration file custom_dataset
is at
conf/data_preparation/generic/custom_dataset.yaml
. NVIDIA provides scripts that you can use to
train your own tokenizer and preprocess your dataset into a
format that the NVIDIA training scripts can consume.
At present, custom dataset processing supports only SentencePiece tokenizers. You can either train a fresh SentencePiece tokenizer with NVIDIA scripts, or load existing ones.
By default, custom configurations parallelize data preparation by using 20 nodes to preprocess custom dataset files in parallel. You can increase parallelization by using more nodes, up to the number of dataset files to be processed or the number of nodes available.
Slurm
First, ensure that the cluster-related configuration in
conf/cluster/bcm.yaml
is correct. Set the cluster
and
cluster_type
configurations in conf/config.yaml
to bcm
.
Then set time_limit
or any other configurations that need to be updated
in custom_dataset.yaml
. The data preparation can be
parallelized by using nodes * workers_per_node
number of workers (up
to one worker per dataset file).
Example
To run only the data preparation pipeline and not the training,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
. This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory. NVIDIA recommends that you set the data_dir
configuration to a workspace, making the data visible across multiple jobs later on. Store the tokenizer model files in the same workspace for later use.
You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use two or more nodes, up to the number of custom dataset files.
To run the data preparation pipeline, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts \
data_dir=/mount/data \
base_results_dir=/mount/results data_preparation=custom_dataset \
dataprepartion.train_tokenizer_args.inp=/path/to/text/file/for/training/tokenizer \
datapreparation.raw_dataset_files=</path/to/custom_data_files> \
>> /results/data_custom_dataset_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data
, and the results workspace in /mount/results
. stdout
and stderr
are redirected to the file /results/data_gpt3_log.txt
, which you can download from NGC. Any other required configuration may be added to modify the command’s behavior.
Kubernetes
First, ensure that the cluster-related configuration in
conf/cluster/k8s.yaml
is correct. Set the cluster
and
cluster_type
parameters in conf/config.yaml
to k8s
.
Then set time_limit
or any other parameters you need to change
in custom_dataset.yaml
. The data preparation can be
parallelized by using nodes * workers_per_node
number of workers (up
to one worker per dataset file).
Example
To run only the data preparation pipeline and not the training,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Once launched, a Helm chart will be created based on the config files and the data preparation will begin.
Common
Set the configuration for the custom data preparation job in the YAML file:
run:
name: custom_dataset
results_dir: ${base_results_dir}/${.name}
time_limit: "24:00:00"
dependency: "singleton"
node_array_size: 4
cpus_per_node: 256
workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: custom_dataset
custom_dataset_dir: ${data_dir}/custom_dataset
train_tokenizer: True # True to train a sentence piece tokenizer
train_tokenizer_args: # For all options please check: https://github.com/google/sentencepiece/blob/master/doc/options.md
input: null # text file for training tokenizer
input_format: "text" # text or tsv
model_prefix: "custom_sp_tokenizer"
model_type: "bpe" # model algorithm: unigram, bpe, word or char
vocab_size: 8000 # Vocabulary size
character_coverage: 0.9995 # character coverage to determine the minimum symbols
unk_id: 1
bos_id: 2
eos_id: 3
pad_id: 0
bpe_save_dir: ${.custom_dataset_dir}/bpe # Dir to save sentence piece tokenizer model and vocab files
preprocess_data: True # True to preprocess the data from json, jsonl or json.gz files, False otherwise.
raw_dataset_files:
- null # Each file should be input json, jsonl or json.gz file
tokenizer_model: ${.bpe_save_dir}/${data_preparation.train_tokenizer_args.model_prefix}.model # trained SentencePiece tokenizer model
preprocess_worker_mapping: ${.custom_dataset_dir}/preprocess_mapping
preprocessed_dir: ${.custom_dataset_dir}/preprocessed
Note
Depending on the dataset and system, it’s possible that system memory gets OOM with very large dataset shard files. The solution to this issue is to reduce dataset shard sizes. If you see similar issue, please consider breaking up json, jsonl or json.gz files into smaller chunks before running preprocessing.