.. _nlp_model:

Model NLP
=========

The config file of NLP models contain three main sections:

    - trainer: Trainer section contains the configs for PTL training and you may find more info at :doc:`../../introduction/core.html#model-training` and `PTL Trainer class API <https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-class-api>'.
    - exp_manager: the configs of experiment manager. You can find more info at :doc:`../../introduction/core.html#experiment-manager`
    - model: contains the configs of the datasets, model architecture, tokenizer, optimizer, scheduler, etc.

The following sub-sections of the model section are shared among most of the NLP models.
    - tokenizer: specifies the tokenizer
    - language_model: specifies the underlying model to be used as the encoder
    - optim: the configs of the optimizer and scheduler :doc:`../../introduction/core.html`

The 'tokenizer' and 'language_model' sections have the following parameters:

+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **Parameter**                             | **Data Type**   |  **Description**                                                                                             |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.tokenizer.tokenizer_name            | string          | Tokenizer name, will be filled automatically based on model.language_model.pretrained_model_name             |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.tokenizer.vocab_file                | string          | Path to tokenizer vocabulary                                                                                 |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.tokenizer.tokenizer_model           | string          | Path to tokenizer model (only for sentencepiece tokenizer)                                                   |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.language_model.pretrained_model_name| string          | Pre-trained language model name, for example: `bert-base-cased` or `bert-base-uncased`                       |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.language_model.lm_checkpoint        | string          | Path to the pre-trained language model checkpoint                                                            |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.language_model.config_file          | string          | Path to the pre-trained language model config file                                                           |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| model.language_model.config               | dictionary      | Config of the pre-trained language model                                                                     |
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+

The parameter `model.language_model.pretrained_model_name` can be one of the following:
    - `megatron-bert-345m-uncased`, `megatron-bert-345m-cased`, `biomegatron-bert-345m-uncased`, `biomegatron-bert-345m-cased`, `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
    - `distilbert-base-uncased`, `distilbert-base-cased`,
    - `roberta-base`, `roberta-large`, `distilroberta-base`
    - `albert-base-v1`, `albert-large-v1`, `albert-xlarge-v1`, `albert-xxlarge-v1`, `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`