NeMo Framework Single Node Pre-training
Quick Start
This playbook demonstrates how to train a GPT-style model with the NVIDIA NeMo Framework.
The total training will take a considerable amount of time, but can be achieved on a GPU with as low as 16GB of GPU RAM.
Set up the Nemo Framework container
Get the NeMo Framework container.
export NEMO_IMAGE="nvcr.io/nvidia/nemo:24.05"
Specify a common location where your data files are stored.
export DATA_DIR="/path/to/data"
Note
You’ll need approximately 30GB of available storage in the provided data directory.
Pull and run the NeMo Framework container.
Note
This example uses 8G of shared memory. You can allocate more shared memory as needed, providing you have the resources to do so.
docker run --gpus all -it --rm --shm-size 8G -v ${DATA_DIR}:/data $NEMO_IMAGE
After the command runs, you should see output similar to the following:
root@abc123:/workspace#
Navigate to the data directory.
cd /data
Get the example scripts provided by the NeMo GitHub repository.
git clone https://github.com/NVIDIA/NeMo.git
You are now ready to move onto the data and pre-processing stage.
Data download and pre-processing
Note
Depending on your Internet connection, it can take take 3-5 hours to complete the data download, pre-processing, and tokenizer retrieval procedures in this example.
Step 1: Download data
The following step downloads around 10GB of Wikipedia data and can take several hours.
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Step 2: Extract raw data
pip install wikiextractor
python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl
The train_data.jsonl
file contains the training data in the json line format. The data of interest is under “text” field.
Step 3: Get tokenizer files
This step uses the Hugging Face GPT2 tokenizer files. With this option, you download a pre-built vocabulary and merge the files for the BPE tokenizer.
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Step 4: Convert training data into memory map format
The memory map format makes training more efficient, especially with many nodes and GPUs. This step will also tokenize data using the tokenizer model from Step 3.
python NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=train_data.jsonl \
--json-keys=text \
--tokenizer-library=megatron \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--output-prefix=hfbpe_gpt_training_data \
--append-eod \
--workers=32
Train GPT-style model
Once you have prepared the training data and tokenizer, you are ready to train the model.
The following configuration has about 124M parameters and should fit on a single 16GB GPU using float16.
python NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
--config-path=conf/ \
--config-name=megatron_gpt_config \
trainer.devices=1 \
trainer.num_nodes=1 \
trainer.max_epochs=null \
trainer.max_steps=300000 \
trainer.val_check_interval=300 \
trainer.log_every_n_steps=50 \
trainer.limit_val_batches=50 \
trainer.limit_test_batches=50 \
trainer.accumulate_grad_batches=1 \
trainer.precision=16 \
model.micro_batch_size=6 \
model.global_batch_size=192 \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
model.max_position_embeddings=1024 \
model.encoder_seq_length=1024 \
model.hidden_size=768 \
model.ffn_hidden_size=3072 \
model.num_layers=12 \
model.num_attention_heads=12 \
model.init_method_std=0.021 \
model.hidden_dropout=0.1 \
model.layernorm_epsilon=1e-5 \
model.tokenizer.vocab_file=gpt2-vocab.json \
model.tokenizer.merge_file=gpt2-merges.txt \
model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \
model.data.num_workers=2 \
model.data.seq_length=1024 \
model.data.splits_string=\'980,10,10\' \
model.optim.name=fused_adam \
model.optim.lr=6e-4 \
model.optim.betas=[0.9,0.95] \
model.optim.weight_decay=0.1 \
model.optim.sched.name=CosineAnnealing \
model.optim.sched.warmup_steps=750 \
model.optim.sched.constant_steps=80000 \
model.optim.sched.min_lr=6e-5 \
exp_manager.resume_if_exists=True \
exp_manager.resume_ignore_no_checkpoint=True \
exp_manager.create_checkpoint_callback=True \
exp_manager.checkpoint_callback_params.monitor=val_loss \
exp_manager.checkpoint_callback_params.save_top_k=3 \
exp_manager.checkpoint_callback_params.mode=min \
exp_manager.checkpoint_callback_params.always_save_nemo=False
To monitor the training, launch Tensorboard from another terminal.
tensorboard --logdir nemo_experiments --bind_all