Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
BERT#
BERT is an autoencoding language model with a final loss composed of:
masked language model loss
next sentence prediction
The model architecture is published in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [NLP-BERT1]. The model is originally trained on English Wikipedia and BookCorpus. BERT is often used as a language model encoder for downstream tasks, for example, Token Classification Model with Named Entity Recognition (NER), Text Classification model, Question Answering, etc. Domain-specific BERT models can be advantageous for a wide range of applications. One notable application is the domain-specific BERT in a biomedical setting, e.g. BioBERT [NLP-BERT2] or its improved derivative BioMegatron [NLP-BERT3]. For the latter, refer to NeMo Megatron.
Quick Start Guide#
from nemo.collections.nlp.models import BERTLMModel
# to get the list of pre-trained models
BERTLMModel.list_available_models()
# Download and load the pre-trained BERT-based model
model = BERTLMModel.from_pretrained("bertbaseuncased")
Available Models#
Model |
Pretrained Checkpoint |
---|---|
BERT-base uncased |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:bertbaseuncased |
BERT-large uncased |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:bertlargeuncased |
Data Input for the BERT model#
Data preprocessing can be either done on-the-fly during training or offline before training. The latter is optimized and recommended
for large text corpora. This was also used in the original paper to train the model on Wikipedia and BookCorpus. For on-the-fly data
processing, provide text files with sentences for training and validation, where words are separated by spaces, i.e.: [WORD] [SPACE] [WORD] [SPACE] [WORD]
.
To use this pipeline in training, use the dedicated configuration file NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml.
To process data offline in advance, refer to the BERT Quick Start Guide.
To recreate the original Wikipedia and BookCorpus datasets, follow steps 1-5 in the Quick Start Guide and run the script ./data/create_datasets_from_start.sh
inside the Docker container.
The downloaded
folder should include two sub folders lower_case_[0,1]_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
and lower_case_[0,1]_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
, containing sequences of length 128 with a maximum of 20 masked tokens
and sequences of length 512 with a maximum of 80 masked tokens respectively. To use this pipeline in training, use the dedicated configuration file NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml
and specify the path to the created hd5f files.
Training the BERT model#
Example of model configuration for on-the-fly data preprocessing: NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml. Example of model configuration for offline data preprocessing: NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml.
The specification can be grouped into three categories:
Parameters that describe the training process: trainer
Parameters that describe the datasets: model.train_ds, model.validation_ds
Parameters that describe the model: model, model.tokenizer, model.language_model
More details about parameters in the config file can be found below:
Parameter |
Data Type |
Description |
model.only_mlm_loss |
bool |
Only uses masked language model without next sentence prediction. |
train_ds.data_file |
string |
Name of the text file or hdf5 data directory. |
train_ds.num_samples |
integer |
Number of samples to use from the training dataset, |
More details about parameters for offline data preprocessing can be found below:
Parameter |
Data Type |
Description |
train_ds.max_predictions_per_seq |
integer |
Maximum number of masked tokens in a sequence in the preprocessed data. |
More details about parameters for online data preprocessing can be found below:
Parameter |
Data Type |
Description |
model.max_seq_length |
integer |
The maximum total input sequence length after tokenization. |
model.mask_prob |
float |
Probability of masking a token in the input text during data processing. |
model.short_seq_prob |
float |
Probability of having a sequence shorter than the maximum sequence length. |
Note
For offline data preprocessing, model.tokenizer is null. For downstream task, use the same tokenizer that was used for offline preprocessing. For online data preprocessing, model.tokenizer needs to be specified. See also Model NLP for details.
Example of the command for training the model:
python bert_pretraining.py \
model.train_ds.data_file=<PATH_TO_DATA> \
trainer.max_epochs=<NUM_EPOCHS> \
trainer.devices=[<CHANGE_TO_GPU(s)_YOU_WANT_TO_USE>] \
trainer.accelerator='gpu'
Fine-tuning on Downstream Tasks#
To use a trained BERT model checkpoint on a NeMo NLP downstream task, e.g. Question Answering, specify
model.language_model.lm_checkpoint=<PATH_TO_CHECKPOINT>
.
References#
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. 2019. arXiv:1901.08746.
Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, and Raghav Mani. Biomegatron: larger biomedical domain language model. 2020. arXiv:2010.06060.