NeMo Framework Single Node Pre-training

Quick Start

This playbook demonstrates how to train a GPT-style model with the NVIDIA NeMo Framework.

The total training will take a considerable amount of time, but can be achieved on a GPU with as low as 16GB of GPU RAM.

Set up the Nemo Framework container

Get the NeMo Framework container.

export NEMO_IMAGE="nvcr.io/nvidia/nemo:24.05"

Specify a common location where your data files are stored.

export DATA_DIR="/path/to/data"

Note

You’ll need approximately 30GB of available storage in the provided data directory.

Pull and run the NeMo Framework container.

Note

This example uses 8G of shared memory. You can allocate more shared memory as needed, providing you have the resources to do so.

docker run --gpus all -it --rm --shm-size 8G -v ${DATA_DIR}:/data $NEMO_IMAGE

After the command runs, you should see output similar to the following:

root@abc123:/workspace#

Navigate to the data directory.

cd /data

Get the example scripts provided by the NeMo GitHub repository.

git clone https://github.com/NVIDIA/NeMo.git

You are now ready to move onto the data and pre-processing stage.

Data download and pre-processing

Note

Depending on your Internet connection, it can take take 3-5 hours to complete the data download, pre-processing, and tokenizer retrieval procedures in this example.

Step 1: Download data

The following step downloads around 10GB of Wikipedia data and can take several hours.

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Step 2: Extract raw data

pip install wikiextractor
python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl

The train_data.jsonl file contains the training data in the json line format. The data of interest is under “text” field.

Step 3: Get tokenizer files

This step uses the Hugging Face GPT2 tokenizer files. With this option, you download a pre-built vocabulary and merge the files for the BPE tokenizer.

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

Step 4: Convert training data into memory map format

The memory map format makes training more efficient, especially with many nodes and GPUs. This step will also tokenize data using the tokenizer model from Step 3.

python NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=train_data.jsonl \
--json-keys=text \
--tokenizer-library=megatron \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--output-prefix=hfbpe_gpt_training_data \
--append-eod \
--workers=32

Train GPT-style model

Once you have prepared the training data and tokenizer, you are ready to train the model.

The following configuration has about 124M parameters and should fit on a single 16GB GPU using float16.

python NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
    --config-path=conf/ \
    --config-name=megatron_gpt_config \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.max_epochs=null \
    trainer.max_steps=300000 \
    trainer.val_check_interval=300 \
    trainer.log_every_n_steps=50 \
    trainer.limit_val_batches=50 \
    trainer.limit_test_batches=50 \
    trainer.accumulate_grad_batches=1 \
    trainer.precision=16 \
    model.micro_batch_size=6 \
    model.global_batch_size=192 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.max_position_embeddings=1024 \
    model.encoder_seq_length=1024 \
    model.hidden_size=768 \
    model.ffn_hidden_size=3072 \
    model.num_layers=12 \
    model.num_attention_heads=12 \
    model.init_method_std=0.021 \
    model.hidden_dropout=0.1 \
    model.layernorm_epsilon=1e-5 \
    model.tokenizer.vocab_file=gpt2-vocab.json \
model.tokenizer.merge_file=gpt2-merges.txt \
    model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \
    model.data.num_workers=2 \
    model.data.seq_length=1024 \
    model.data.splits_string=\'980,10,10\' \
    model.optim.name=fused_adam \
    model.optim.lr=6e-4 \
    model.optim.betas=[0.9,0.95] \
    model.optim.weight_decay=0.1 \
    model.optim.sched.name=CosineAnnealing \
    model.optim.sched.warmup_steps=750 \
    model.optim.sched.constant_steps=80000 \
    model.optim.sched.min_lr=6e-5 \
    exp_manager.resume_if_exists=True \
    exp_manager.resume_ignore_no_checkpoint=True \
    exp_manager.create_checkpoint_callback=True \
    exp_manager.checkpoint_callback_params.monitor=val_loss \
    exp_manager.checkpoint_callback_params.save_top_k=3 \
    exp_manager.checkpoint_callback_params.mode=min \
    exp_manager.checkpoint_callback_params.always_save_nemo=False

To monitor the training, launch Tensorboard from another terminal.

tensorboard --logdir nemo_experiments --bind_all