NeMo Megatron

Megatron-LM [nlp-megatron7] is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. Currently NeMo Megatron supports 3 types of models:

  • GPT-style models (decoder only)

  • T5/BART-style models (encoder-decoder)

  • BERT-style models (encoder only)

Note

We recommend using NeMo Megatron containers for pre-training, tuning and running inference with large (1B and above) Megatrons.

Megatron-LM is a highly optimized and efficient library for training large language models. With Megatron model parallelism, language models can be trained with billions of weights and then used in NeMo for downstream tasks.

NeMo handles pretrained model parallel checkpoints from Megatron-LM automatically and model parallel models in NeMo have the all the same features as other NeMo Models.

Note

Currently, NeMo only supports tensor model parallelism.

Training

All of the necessary logic to train model parallel models in NeMo with PyTorch Lightning is contained in the NLPDDPStrategy. The NLPDDPStrategy subclasses the PyTorch Lightning strategy type DDPStrategy. See strategies for more information on PyTorch Lightning Strategies

To enable model parallel training in NeMo:

Copy
Copied!
            

trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer)

Megatron-LM checkpoints have a specific format. One checkpoint is saved for each model parallel rank:

Copy
Copied!
            

iter_0080000/ ├── mp_rank_00 │ └── model_optim_rng.pt └── mp_rank_01 └── model_optim_rng.pt

To start fine-tuning from a Megatron-LM checkpoint, simply pass the path to the Megatron-LM checkpoint via the language model config:

Copy
Copied!
            

model.language_model.lm_checkpoint=/raid/megatron/bert/iter_0080000 \

We also need to input the model configuration. This can be done via json:

Copy
Copied!
            

{ "hidden-size": 1024, "num-attention-heads": 16, "num-layers": 24, "max-seq-length": 512 }

And input via command line:

Copy
Copied!
            

model.language_model.config_file=/raid/data/megatron/bert/config.json \

Or the model configuration can be input via YAML:

Copy
Copied!
            

model: language_model: config: hidden_size: 1024 num_attention_heads: 16 num_layers: 24 max_position_embeddings: 512

Additionally, Megatron-LM requires a vocab file:

Copy
Copied!
            

model.tokenizer.vocab_file=/path/to/vocab.txt

If using the Megatron-LM default tokenizer for training BERT the vocab file can be omitted:

Copy
Copied!
            

# uncased model model.tokenizer.tokenizer_name=megatron-bert-uncased

Copy
Copied!
            

# cased model model.tokenizer.tokenizer_name=megatron-bert-uncased

Auto-Resume

Resuming training with NeMo experiment manager and PyTorch Lightning works exactly the same as other NeMo models. While training with PTL, model parallel checkpoint will be saved and loaded properly.

Copy
Copied!
            

checkpoints/ ├── mp_rank_00 │ ├── mp_autoresume-last.ckpt │ ├── mp_autoresume---val_loss=0.35-epoch=0.ckpt │ ├── mp_autoresume---val_loss=0.38-epoch=1.ckpt │ └── mp_autoresume---val_loss=0.39-epoch=2.ckpt └── mp_rank_01 ├── mp_autoresume-last.ckpt ├── mp_autoresume---val_loss=0.35-epoch=0.ckpt ├── mp_autoresume---val_loss=0.38-epoch=1.ckpt └── mp_autoresume---val_loss=0.39-epoch=2.ckpt

Save and Restore

Model parallel .nemo files behave the same as all other .nemo files. Calling .save_to will save a checkpoint for each model parallel rank inside the .nemo file:

Copy
Copied!
            

text_class_350m ├── megatron-bert-uncased_encoder_config.json ├── megatron_checkpoint_version.json ├── model_config.yaml ├── mp_rank_00 │ └── model_weights.ckpt ├── mp_rank_01 │ └── model_weights.ckpt ├── tokenizer_vocab_dict.json └── tokenizer.vocab_file

When restoring a model parallel .nemo file, we must pass in the Trainer as model parallel requires DDP:

Copy
Copied!
            

model = TokenClassificationModel.restore_from(cfg.pretrained_model, trainer=trainer)

Evaluation

Since model parallel models always require more than one GPU, the Trainer is needed for evaluation:

Copy
Copied!
            

trainer = pl.Trainer(strategy=NLPDDPStrategy(), **cfg.trainer) model = TextClassificationModel.restore_from(cfg.model.nemo_path, trainer=trainer) model.setup_test_data(test_data_config=cfg.model.test_ds) trainer.test(model=model, ckpt_path=None)

BioMegatron has the same network architecture as the Megatron-LM, but is pretrained on a different dataset - PubMed, a large biomedical text corpus, which achieves better performance in biomedical downstream tasks than the original Megatron-LM.

Examples of using BioMegatron on biomedical downstream tasks can be found at (can be executed with Google’s Colab): NeMo/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb and NeMo/tutorials/nlp/Token_Classification-BioMegatron.ipynb.

[NLP-MEGATRON1]

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. 2023. arXiv:2306.15595.

[NLP-MEGATRON2]

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. Kerple: kernelized relative positional embedding for length extrapolation. 2022. arXiv:2205.09921.

[NLP-MEGATRON3]

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, and Peter J. Ramadge. Dissecting transformer length extrapolation via the lens of receptive field analysis. 2023. arXiv:2212.10356.

[NLP-MEGATRON4]

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: fast and memory-efficient exact attention with io-awareness. 2022. arXiv:2205.14135.

[NLP-MEGATRON5]

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: attention with linear biases enables input length extrapolation. 2022. arXiv:2108.12409.

[NLP-MEGATRON6]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. 2018. arXiv:1803.02155.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

[NLP-MEGATRON8]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer with rotary position embedding. 2022. arXiv:2104.09864.

[NLP-MEGATRON9]

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. 2022. arXiv:2212.10554.

[NLP-MEGATRON10]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2023. arXiv:1706.03762.

© Copyright 2023-2024, NVIDIA. Last updated on Apr 22, 2024.