Megatron-LM for Downstream Tasks¶
Megatron-LM [NLP-MEGATRON1] is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. Unlike BERT, the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture) are swapped, which allows the models to continue to improve as they were scaled up. This model reaches higher scores compared to BERT on a range of Natural Language Processing (NLP) tasks. More details on efficient, model-parallel and multi-node pre-training of GPT and BERT using mixed precision can be found in the Megatron-LM GitHub repo.
Fine-tuning¶
Pre-trained Megatron-LM (BERT) can be used for most of the NLP downstream tasks from NeMo/examples/nlp.
Specify the model.language_model.pretrained_model_name
parameter:
model.language_model.pretrained_model_name=megatron-bert-345m-uncased
Available pre-trained Megatron-LM models:
megatron-bert-345m-cased
megatron-bert-345m-uncased
biomegatron-bert-345m-uncased
biomegatron-bert-345m-cased
For example, to fine-tune SQuAD v1.1 with Megatron-LM, run:
python question_answering_squad.py \
model.train_ds.file=<TRAIN_JSON_FILE> \
model.validation_ds=<VAL_JSON_FILE> \
model.language_model.pretrained_model_name=megatron-bert-345m-uncased
If you have a different checkpoint or model configuration (pre-trained with Megatron-LM GitHub repo),
use model.language_model.pretrained_model_name=megatron-bert-uncased
or model.language_model.pretrained_model_name=megatron-bert-cased
and specify --bert_config
and --bert_checkpoint
for your model.
Note
Megatron-LM has its own set of training arguments (including tokenizer) that are ignored during fine-tuning in NeMo. Use downstream task training scripts for all NeMo supported arguments.
BioMegatron¶
BioMegatron has the same network architecture as the Megatron-LM, but is pretrained on a different dataset - PubMed, a large biomedical text corpus, which achieves better performance in biomedical downstream tasks than the original Megatron-LM.
Examples of using BioMegatron on biomedical downstream tasks can be found at (can be executed with Google’s Colab): NeMo/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb and NeMo/tutorials/nlp/Token_Classification-BioMegatron.ipynb.
Model Parallelism¶
Megatron-LM is a highly optimized and efficient library for training large language models. With Megatron model parallelism, language models can be trained with billions of weights and then used in NeMo for downstream tasks.
NeMo handles pretrained model parallel checkpoints from Megatron-LM automatically and model parallel models in NeMo have the all the same features as other NeMo Models.
Note
Currently, NeMo only supports tensor model parallelism.
Training¶
All of the necessary logic to train model parallel models in NeMo with PyTorch Lightning is contained in the NLPDDPPlugin
.
The NLPDDPPlugin
subclasses the PyTorch Lightning training type plugin DDPPlugin
.
See plugins for more information on PyTorch Lightning Plugins.
To enable model parallel training in NeMo:
trainer = Trainer(plugins=[NLPDDPPlugin()], **cfg.trainer)
Megatron-LM checkpoints have a specific format. One checkpoint is saved for each model parallel rank:
iter_0080000/
├── mp_rank_00
│ └── model_optim_rng.pt
└── mp_rank_01
└── model_optim_rng.pt
To start fine-tuning from a Megatron-LM checkpoint, simply pass the path to the Megatron-LM checkpoint via the language model config:
model.language_model.lm_checkpoint=/raid/megatron/bert/iter_0080000 \
We also need to input the model configuration. This can be done via json:
{
"hidden-size": 1024,
"num-attention-heads": 16,
"num-layers": 24,
"max-seq-length": 512
}
And input via command line:
model.language_model.config_file=/raid/data/megatron/bert/config.json \
Or the model configuration can be input via YAML:
model:
language_model:
config:
hidden_size: 1024
num_attention_heads: 16
num_layers: 24
max_position_embeddings: 512
Additionally, Megatron-LM requires a vocab file:
model.tokenizer.vocab_file=/path/to/vocab.txt
If using the Megatron-LM default tokenizer for training BERT the vocab file can be omitted:
# uncased model
model.tokenizer.tokenizer_name=megatron-bert-uncased
# cased model
model.tokenizer.tokenizer_name=megatron-bert-uncased
Auto-Resume¶
Resuming training with NeMo experiment manager and PyTorch Lightning works exactly the same as other NeMo models. While training with PTL, model parallel checkpoint will be saved and loaded properly.
checkpoints/
├── mp_rank_00
│ ├── mp_autoresume-last.ckpt
│ ├── mp_autoresume---val_loss=0.35-epoch=0.ckpt
│ ├── mp_autoresume---val_loss=0.38-epoch=1.ckpt
│ └── mp_autoresume---val_loss=0.39-epoch=2.ckpt
└── mp_rank_01
├── mp_autoresume-last.ckpt
├── mp_autoresume---val_loss=0.35-epoch=0.ckpt
├── mp_autoresume---val_loss=0.38-epoch=1.ckpt
└── mp_autoresume---val_loss=0.39-epoch=2.ckpt
Save and Restore¶
Model parallel .nemo files behave the same as all other .nemo files. Calling .save_to
will save
a checkpoint for each model parallel rank inside the .nemo file:
text_class_350m
├── megatron-bert-uncased_encoder_config.json
├── megatron_checkpoint_version.json
├── model_config.yaml
├── mp_rank_00
│ └── model_weights.ckpt
├── mp_rank_01
│ └── model_weights.ckpt
├── tokenizer_vocab_dict.json
└── tokenizer.vocab_file
When restoring a model parallel .nemo file, we must pass in the Trainer
as model parallel requires DDP:
model = TokenClassificationModel.restore_from(cfg.pretrained_model, trainer=trainer)
Evaluation¶
Since model parallel models always require more than one GPU, the Trainer
is needed for evaluation:
trainer = pl.Trainer(plugins=[NLPDDPPlugin()], **cfg.trainer)
model = TextClassificationModel.restore_from(cfg.model.nemo_path, trainer=trainer)
model.setup_test_data(test_data_config=cfg.model.test_ds)
trainer.test(model=model, ckpt_path=None)
References¶
- NLP-MEGATRON1
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.