If you want to train GPT, T5, or mT5 models on your own dataset
(which must already be filtered and cleaned), you must first convert the
dataset files to JSONL files (with the extension .jsonl
).
As discussed in earlier sections, the data_preparation
configuration in
conf/config.yaml
specifies which file to use for data preparation
configuration. To run your own dataset you must set data_preparation
to generic/custom_dataset
,
and include data_preparation
in stages
. The
configuration file custom_dataset
is at
conf/data_preparation/generic/custom_dataset.yaml
. NVIDIA provides scripts that you can use to
train your own tokenizer and preprocess your dataset into a
format that the NVIDIA training scripts can consume.
At present, custom dataset processing supports only SentencePiece tokenizers. You can either train a fresh SentencePiece tokenizer with NVIDIA scripts, or load existing ones.
By default, custom configurations parallelize data preparation by using 20 nodes to preprocess custom dataset files in parallel. You can increase parallelization by using more nodes, up to the number of dataset files to be processed or the number of nodes available.
First, ensure that the cluster-related configuration in
conf/cluster/bcm.yaml
is correct. Set the cluster
and
cluster_type
configurations in conf/config.yaml
to bcm
.
Then set time_limit
or any other configurations that need to be updated
in custom_dataset.yaml
. The data preparation can be
parallelized by using nodes * workers_per_node
number of workers (up
to one worker per dataset file).
Example
To run only the data preparation pipeline and not the training,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
. This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory. NVIDIA recommends that you set the data_dir
configuration to a workspace, making the data visible across multiple jobs later on. Store the tokenizer model files in the same workspace for later use.
You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use two or more nodes, up to the number of custom dataset files.
To run the data preparation pipeline, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \
data_dir=/mount/data \
base_results_dir=/mount/results data_preparation=custom_dataset \
dataprepartion.train_tokenizer_args.inp=/path/to/text/file/for/training/tokenizer \
datapreparation.raw_dataset_files=</path/to/custom_data_files> \
>> /results/data_custom_dataset_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data
, and the results workspace in /mount/results
. stdout
and stderr
are redirected to the file /results/data_gpt3_log.txt
, which you can download from NGC. Any other required configuration may be added to modify the command’s behavior.
First, ensure that the cluster-related configuration in
conf/cluster/k8s.yaml
is correct. Set the cluster
and
cluster_type
parameters in conf/config.yaml
to k8s
.
Then set time_limit
or any other parameters you need to change
in custom_dataset.yaml
. The data preparation can be
parallelized by using nodes * workers_per_node
number of workers (up
to one worker per dataset file).
Example
To run only the data preparation pipeline and not the training,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Once launched, a Helm chart will be created based on the config files and the data preparation will begin.
Set the configuration for the custom data preparation job in the YAML file:
run:
name: custom_dataset
results_dir: ${base_results_dir}/${.name}
time_limit: "24:00:00"
dependency: "singleton"
node_array_size: 4
cpus_per_node: 256
workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: custom_dataset
custom_dataset_dir: ${data_dir}/custom_dataset
train_tokenizer: True # True to train a sentence piece tokenizer
train_tokenizer_args: # For all options please check: https://github.com/google/sentencepiece/blob/master/doc/options.md
input: null # text file for training tokenizer
input_format: "text" # text or tsv
model_prefix: "custom_sp_tokenizer"
model_type: "bpe" # model algorithm: unigram, bpe, word or char
vocab_size: 8000 # Vocabulary size
character_coverage: 0.9995 # character coverage to determine the minimum symbols
unk_id: 1
bos_id: 2
eos_id: 3
pad_id: 0
bpe_save_dir: ${.custom_dataset_dir}/bpe # Dir to save sentence piece tokenizer model and vocab files
preprocess_data: True # True to preprocess the data from json, jsonl or json.gz files, False otherwise.
raw_dataset_files:
- null # Each file should be input json, jsonl or json.gz file
tokenizer_model: ${.bpe_save_dir}/${data_preparation.train_tokenizer_args.model_prefix}.model # trained SentencePiece tokenizer model
preprocess_worker_mapping: ${.custom_dataset_dir}/preprocess_mapping
preprocessed_dir: ${.custom_dataset_dir}/preprocessed
Depending on the dataset and system, it’s possible that system memory gets OOM with very large dataset shard files. The solution to this issue is to reduce dataset shard sizes. If you see similar issue, please consider breaking up json, jsonl or json.gz files into smaller chunks before running preprocessing.