.. _bert_pretraining:

BERT
=====================================================

BERT is an autoencoding language model with a final loss composed of 1. masked language model loss and 2. next sentence prediction.
The model architecture is published in `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ :cite:`nlp-bert-devlin2018bert`.
The model is originally trained on English Wikipedia and BookCorpus.
BERT is often used an language model encoder for downstream tasks, e.g. :ref:`token_classification`, :ref:`text_classification`, :ref:`question_answering`, etc.
Domain-specific BERT models can be advantageous for a wide range of applications. One notable is domain-specific BERT in a biomedical setting,
e.g. BioBERT :cite:`nlp-bert-lee2019biobert` or its improved derivative BioMegatron :cite:`nlp-bert-shin2020biomegatron`. For the latter refer to :ref:`megatron_finetuning`.



Quick Start
-----------

.. code-block:: python

    from nemo.collections.nlp.models import BERTLMModel

    # to get the list of pre-trained models
    BERTLMModel.list_available_models()

    # Download and load the pre-trained BERT-based model
    model = BERTLMModel.from_pretrained("bertbaseuncased")

Available Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: *Pretrained Models*
   :widths: 5 10
   :header-rows: 1

   * - Model
     - Pretrained Checkpoint
   * - bertbaseuncased
     - https://ngc.nvidia.com/catalog/models/nvidia:nemo:bertbaseuncased
   * - bertlargeuncased
     - https://ngc.nvidia.com/catalog/models/nvidia:nemo:bertlargeuncased


.. _dataset_bert_pretraining:

Data Input for BERT Model
-----------------------------------------

Data preprocessing can be either done on the fly during training or offline before training. The latter is optimized and recommended for large text corpora. This was also used in original paper to train the model on Wikipedia and BookCorpus.
For on the fly data processing, provide text files with sentences for training and validation, where words are separated by spaces, i.e.: [WORD] [SPACE] [WORD] [SPACE] [WORD]. 
To use this pipeline in training use the dedicated configuration file `NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml`.


To process data offline in advance, go to `https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#quick-start-guide <https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#quick-start-guide>`__.
To recreate the original Wikipedia and BookCorpus datasets, follow steps 1-5 in the Quick-Start-Guide and run the script ``./data/create_datasets_from_start.sh`` inside the docker container.
The downloaded folder should include two sub folders `lower_case_[0,1]_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5`
and `lower_case_[0,1]_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5`, containing sequences of length 128 with a maximum of 20 masked tokens
and sequences of length 512 with a maximum of 80 masked tokens respectively.
To use this pipeline in training use the dedicated configuration file `NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml` and specify the path to the created hd5f files.


Training BERT Model
-----------------------------------

Example of model configuration for on the fly data preprocessing: `NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml>`__.
Example of model configuration for offline data preprocessing: `NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml>`__.

The specification can be roughly grouped into three categories:

* Parameters that describe the training process: **trainer**
* Parameters that describe the datasets: **model.train_ds**, **model.validation_ds**
* Parameters that describe the model: **model**, **model.tokenizer**, **model.language_model**


More details about parameters in the config file:


+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **Parameter**                             | **Data Type**   | **Description**                                                                                              |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.only_mlm_loss                       | bool            | Only uses masked language model without next sentence prediction                                             |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| train_ds.data_file                        | string          | Name of the text file or hdf5 data directory                                                                 |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| train_ds.num_samples                      | integer         | Number of samples to use from the training dataset, -1 mean all                                              |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+

More details about parameters for offline data preprocessing:

+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **Parameter**                             | **Data Type**   | **Description**                                                                                              |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| train_ds.max_predictions_per_seq          | integer         | Maximum number of masked tokens in a sequence in the preprocessed data                                       |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+


More details about parameters for online data preprocessing:


+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **Parameter**                             | **Data Type**   | **Description**                                                                                              |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.max_seq_length                      | integer         | The maximum total input sequence length after tokenization                                                   |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.mask_prob                           | float           | Probability of masking a token in the input text during data processing                                      |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.short_seq_prob                      | float           | Probability of having a sequence shorter than the maximum sequence length                                    |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+

.. note::

    For offline data preprocessing **model.tokenizer** is null. For downstream task use the same tokenizer that was used for offline preprocessing.

    For online data preprocessing **model.tokenizer** needs to be specified. See also :ref:`nlp_model` for details.

Example of the command for training the model:

.. code::

    python bert_pretraining.py \
           model.train_ds.data_file=<PATH_TO_DATA>  \
           trainer.max_epochs=<NUM_EPOCHS> \
           trainer.gpus=[<CHANGE_TO_GPU(s)_YOU_WANT_TO_USE>]


Finetuning on Downstream Tasks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To use trained BERT model checkpoint on an NeMo NLP downstream task, e.g. :ref:`question_answering`, specify :code:`model.language_model.lm_checkpoint=<PATH_TO_CHECKPOINT>`.



References
----------

.. bibliography:: nlp_all.bib
    :style: plain
    :labelprefix: NLP-BERT
    :keyprefix: nlp-bert-