Megatron-LM for Downstream Tasks

Megatron-LM [NLP-MEGATRON1] is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. Unlike BERT, the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture) are swapped, which allowed the models to continue to improve as they were scaled up. This model reaches higher scores compared to BERT on a range of Natural Language Processing (NLP) tasks. More details on efficient, model-parallel and multi-node pre-training of GPT and BERT using mixed precision could be found in Megatron-LM github repo.

Fine-tuning

Pre-trained Megatron-LM (BERT) can be used for most of the NLP downstream tasks from NeMo/examples/nlp, specify the model.language_model.pretrained_model_name parameter as follows:

model.language_model.pretrained_model_name=megatron-bert-345m-uncased

Available pre-trained Megatron-LM models:

  • megatron-bert-345m-cased

  • megatron-bert-345m-uncased

  • biomegatron-bert-345m-uncased

  • biomegatron-bert-345m-cased

For example, to fine-tune SQuAD v1.1 with Megatron-LM, run:

python question_answering_squad.py \
       model.train_ds.file=<TRAIN_JSON_FILE> \
       model.validation_ds=<VAL_JSON_FILE> \
       model.language_model.pretrained_model_name=megatron-bert-345m-uncased

If you have a different checkpoint or model configuration (pre-trained with Megatron-LM github repo), use model.language_model.pretrained_model_name=megatron-bert-uncased or model.language_model.pretrained_model_name=megatron-bert-cased and specify --bert_config and --bert_checkpoint for your model.

Note

Megatron-LM has its own set of training arguments (including tokenizer) that are ignored during fine-tuning in NeMo. Please use downstream task training scripts for all NeMo supported arguments.

BioMegatron

BioMegatron has the same network architecture as the Megatron-LM, but is pretrained on a different dataset - PubMed, a large biomedical text corpus, which achieves better performance in biomedical downstream tasks than the original Megatron-LM.

Examples of using BioMegatron on biomedical downstream tasks could be found at: NeMo/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb and NeMo/tutorials/nlp/Token_Classification-BioMegatron.ipynb (can be executed with Google’s Colab).

References

NLP-MEGATRON1

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.