BERT#

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep learning model designed for natural language processing tasks, developed by Google in 2018. Unlike traditional models that read text sequentially (left-to-right or right-to-left), BERT reads text in both directions simultaneously, capturing context from both sides of a word. This bidirectional approach allows BERT to better understand nuances and meanings in language, improving performance on tasks like question answering, sentiment analysis, and language inference.

NeMo 2.0 Pretraining Recipes#

We provide recipes for pretraining BERT models for the following sizes: base(110m) and large(330m) using NeMo 2.0 and NeMo-Run. These recipes configure a run.Partial for one of the nemo.collections.llm api functions introduced in NeMo 2.0. The recipes are hosted in bert_110m, and bert_340m files.

Note

The pretraining recipes use the MockDataModule for the data argument. You are expected to replace the MockDataModule with your own custom dataset.

We provide an example below on how to invoke the default recipe and override the data argument:

from nemo.collections import llm

pretrain = llm.bert_110m.pretrain_recipe(
    name="bert_base_pretrain",
    dir=f"/path/to/checkpoints",
    num_nodes=2,
    num_gpus_per_node=8,
    bert_type="megatron",
)

# # To override the data argument
# dataloader = a_function_that_configures_your_custom_dataset(
#     global_batch_size=global_batch_size,
#     micro_batch_size=micro_batch_size,
#     seq_length=pretrain.model.config.seq_length,
# )
# pretrain.data = dataloader

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.
bert_type can be either huggingface or megatron. huggingface refer to the model architecture on huggingface/google-bert , while megatron refer to the model architecture strictly follow Megatron-LM. The major differences between the two is Megatron uses Pre-LayerNorm while HuggingFace uses Post-LayerNorm after MLP and attention modules.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which just runs the pretraining locally in a separate process. You can use it as follows:

import nemo_run as run

run.run(pretrain, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(pretrain, direct=True)

A comprehensive list of pretraining recipes that we currently support or plan to support soon is provided below for reference:

Recipe	Status
Hugging Face BERT-Base (110M)	Yes
Hugging Face BERT-large (340M)	Yes
Megatron BERT-Base (110M)	Yes
Megatron BERT-large (340M)	Yes