Bring Your Own Dataset

If you want to train GPT, T5, or mT5 models on your own dataset (which must already be filtered and cleaned), you must first convert the dataset files to JSONL files (with the extension .jsonl).

As discussed in earlier sections, the data_preparation configuration in conf/config.yaml specifies which file to use for data preparation configuration. To run your own dataset you must set data_preparation to generic/custom_dataset, and include data_preparation in stages. The configuration file custom_dataset is at conf/data_preparation/generic/custom_dataset.yaml. NVIDIA provides scripts that you can use to train your own tokenizer and preprocess your dataset into a format that the NVIDIA training scripts can consume.

At present, custom dataset processing supports only SentencePiece tokenizers. You can either train a fresh SentencePiece tokenizer with NVIDIA scripts, or load existing ones.

By default, custom configurations parallelize data preparation by using 20 nodes to preprocess custom dataset files in parallel. You can increase parallelization by using more nodes, up to the number of dataset files to be processed or the number of nodes available.

First, ensure that the cluster-related configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then set time_limit or any other configurations that need to be updated in custom_dataset.yaml. The data preparation can be parallelized by using nodes * workers_per_node number of workers (up to one worker per dataset file).

Example

To run only the data preparation pipeline and not the training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - data_preparation

Then enter:

Copy
Copied!
            

python3 main.py

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the tokenizer model files in the same workspace for later use.

You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use two or more nodes, up to the number of custom dataset files.

To run the data preparation pipeline, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \ data_dir=/mount/data \ base_results_dir=/mount/results data_preparation=custom_dataset \ dataprepartion.train_tokenizer_args.inp=/path/to/text/file/for/training/tokenizer \ datapreparation.raw_dataset_files=</path/to/custom_data_files> \ >> /results/data_custom_dataset_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. Any other required configuration may be added to modify the command’s behavior.

First, ensure that the cluster-related configuration in conf/cluster/k8s.yaml is correct. Set the cluster and cluster_type parameters in conf/config.yaml to k8s. Then set time_limit or any other parameters you need to change in custom_dataset.yaml. The data preparation can be parallelized by using nodes * workers_per_node number of workers (up to one worker per dataset file).

Example

To run only the data preparation pipeline and not the training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - data_preparation

Then enter:

Copy
Copied!
            

python3 main.py

Once launched, a Helm chart will be created based on the config files and the data preparation will begin.

Set the configuration for the custom data preparation job in the YAML file:

Copy
Copied!
            

run: name: custom_dataset results_dir: ${base_results_dir}/${.name} time_limit: "24:00:00" dependency: "singleton" node_array_size: 4 cpus_per_node: 256 workers_per_node: 4 # Number of workers per node in preprocessing step. dataset: custom_dataset custom_dataset_dir: ${data_dir}/custom_dataset train_tokenizer: True # True to train a sentence piece tokenizer train_tokenizer_args: # For all options please check: https://github.com/google/sentencepiece/blob/master/doc/options.md input: null # text file for training tokenizer input_format: "text" # text or tsv model_prefix: "custom_sp_tokenizer" model_type: "bpe" # model algorithm: unigram, bpe, word or char vocab_size: 8000 # Vocabulary size character_coverage: 0.9995 # character coverage to determine the minimum symbols unk_id: 1 bos_id: 2 eos_id: 3 pad_id: 0 bpe_save_dir: ${.custom_dataset_dir}/bpe # Dir to save sentence piece tokenizer model and vocab files preprocess_data: True # True to preprocess the data from json, jsonl or json.gz files, False otherwise. raw_dataset_files: - null # Each file should be input json, jsonl or json.gz file tokenizer_model: ${.bpe_save_dir}/${data_preparation.train_tokenizer_args.model_prefix}.model # trained SentencePiece tokenizer model preprocess_worker_mapping: ${.custom_dataset_dir}/preprocess_mapping preprocessed_dir: ${.custom_dataset_dir}/preprocessed

Note

Depending on the dataset and system, it’s possible that system memory gets OOM with very large dataset shard files. The solution to this issue is to reduce dataset shard sizes. If you see similar issue, please consider breaking up json, jsonl or json.gz files into smaller chunks before running preprocessing.

© Copyright 2023, NVIDIA. Last updated on Dec 6, 2023.