Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
Data Preparation
The data_preparation
configuration in conf/config.yaml
specifies
the data preparation configuration file that will be used.
Its default value is download_gpt3_pile
, which corresponds to the file conf/data_preparation/download_gpt3_pile.yaml
.
The configurations in the data preparation configuration file are used to download, extract, and preprocess the Pile dataset for GPT models. Modify these configurations to control data preparation tasks and to specify where to store the datasets, vocabulary, and merge files.
To download a reduced portion of the dataset to run tests, you can set the
file_numbers
configuration to download only one of the
shards by changing “0-29” to “0.” The value must be a series of
numbers separated by hyphens (’‑‘) or commas ‘,’). For example,
this configuration value would download and prepare files 0, 3, 5, 6,
and 7:
file_numbers="0,3,5-7"
Slurm
First, ensure that the cluster configuration in conf/cluster/bcm.yaml
is correct. Set the cluster
and cluster_type
configurations in conf/config.yaml
to bcm
. Then make any required changes to time_limit
and other configurations related
to the job in download_gpt3_pile.yaml
for GPT models.
You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.
Example
To run only the data preparation pipeline and not the training,
evaluation or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory.
NVIDIA recommends that you set the data_dir
configuration to a workspace, making the data visible across multiple jobs later on.
Store the vocabulary and merge files in the same workspace for later use.
You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.
You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws
workspace’s permissions.
To run the data preparation pipeline for GPT models, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_gpt3_log.txt 2>&1
The command above assumes that you want to prepare the entire dataset (files
0-29), and that you mounted the data workspace in /mount/data
, and the
results workspace in /mount/results
. stdout
and stderr
are
redirected to the file /results/data_gpt3_log.txt
, which you can
download from NGC. You can add any other configuration required to modify the
command’s behavior.
Kubernetes
To run data preparation on a kubernetes cluster, set both the cluster
and
cluster_type
parameters to k8s
in conf/config.yaml
. Additionally, set the
launcher_scripts_path
parameter to the location where the launcher scripts
are located on the NFS filesystem. This must be the same path on all nodes in
the cluster. Ensure the stages
parameter is set to data_preparation
and
data_preparation
in the defaults
section points to the intended data
preparation script.
The conf/config/k8s.yaml
file also needs to be updated with the
kubernetes container registry secret if created earlier (pull_secret
), the
shm_size
to determine how much local memory to put in each pod, and the NFS
server and path to where the launcher scripts are saved. These can all be
overridden from the command line using hydra as well.
Once all of the config files are updated, the data preparation can be launched from the controller node with:
python main.py
This will generate and launch a job via Helm in the default namespace which
can be viewed with helm show
or kubectl get pods
. The logs can be followed
with kubectl logs <pod-name>
for the first pod deployed for the job.
Common
Set the data preparation job’s configuration for GPT models in the YAML file:
run:
name: download_gpt3_pile
results_dir: ${base_results_dir}/${.name}
time_limit: "4:00:00"
dependency: "singleton"
node_array_size: 30
array: ${..file_numbers}
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/tree/main/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://huggingface.co/gpt2/resolve/main/vocab.json" # URL to download the vocab from.
download_merges_url: "https://huggingface.co/gpt2/resolve/main/merges.txt" # URL to download the merges from.
vocab_save_dir: ${data_dir}/bpe
merges_save_dir: ${data_dir}/bpe
tokenizer_type: GPT2BPETokenizer
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.