ESM-2nv: Data Preprocessing and Model Training Using BioNeMo#

NOTE: This notebook has been tested on both an A1000 GPU and an A100, and is compatible with BioNeMo Framework v1.6, v1.7 and v1.8. The expected runtime is less than 1 hour on the A1000 and ~3 minutes on the A100.

Demo Objectives#

The purpose of this tutorial is to provide an example use case for training a BioNeMo large language model using the BioNeMo framework. In this tutorial, you will gain experience in:

  1. Preprocessing the UniRef50 and UniRef90 data for ESM-2nv.

  2. Pretraining and continuing training from a checkpoint for ESM-2nv.

  3. Performing inference with ESM-2nv.

Overview - ESM-2nv Model#

ESM-2nv is based on the public ESM-2 model, which is a BERT architecture trained on millions of protein sequences from the UniProt database. ESM-2nv learns the patterns and dependencies between amino acids that ultimately give rise to a protein’s 2D structure. These can include properties such as alpha helix or beta sheet, as well as cellular location, thermostability, solubility, and other protein properties. For more information, check the ESM-2nv model card

This ESM-2nv model training example walkthrough will show how to utilize compute resources, download and preprocess datasets, and perform model training on single and multiple nodes.

The model was trained on UniRef50 and UniRef90 protein sequences, truncated to a maximum length of 1022 amino acids.

Setup#

Ensure that you have read through the Getting Started section, can run the BioNeMo Framework Docker container, and have configured the NGC Command Line Interface (CLI) within the container. It is assumed that this notebook is being executed from within the container. Additionally, this tutorial depends on the ESM-2nv model.

NOTE Some of the cells below generate long text output. We're using
%%capture --no-display --no-stderr cell_output
to suppress this output. Comment or delete this line in the cells below to restore full output.

Import and install all required packages#

import os
import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Home Directory#

bionemo_home = "/workspace/bionemo"
os.environ['BIONEMO_HOME'] = bionemo_home
os.chdir(bionemo_home)

Download Model Checkpoints#

The following code will download the pretrained model esm2nv_650M_converted.nemo from the NGC registry.

In BioNeMo FW, there are numerous ESM models available, including ESM-1nv, ESM-2nv 8M with randomly initialized weights, ESM-2nv fine-tuned for secondary structure downstream prediction tasks with LoRA, ESM-2nv 650M, and ESM-2nv 3B. We also have a configuration file for training ESM-2nv 15B available at examples/protein/esm2nv/conf/pretrain_esm2_15B.yaml if needed.

For demo purposes, we have chosen to showcase the ESM-2nv 650M model. For more details on the ESM-1nv or ESM-2nv, consult the corresponding model cards. To find the model names and checkpoint names, please see the artifacts_paths.yaml file.

# Define the NGC CLI API KEY and ORG for the model download
# If these variables are not already set in the container, uncomment below
# to define and set with your API KEY and ORG
#api_key = <your_api_key>
#ngc_cli_org = <ngc_cli_org>
# Update the environment variable
#os.environ['NGC_CLI_API_KEY'] = api_key
#os.environ['NGC_CLI_ORG'] = ngc_cli_org

# Set variables and paths for model and checkpoint
model_name = "esm2nv" # change to esm1nv for ESM1
model_version = "esm2nv_650m" # change to esm1nv for ESM1
actual_checkpoint_name = "esm2nv_650M_converted.nemo" #  change to esm1nv.nemo for ESM1
model_path = os.path.join(bionemo_home, 'models')
checkpoint_path = os.path.join(model_path, actual_checkpoint_name)
os.environ['MODEL_PATH'] = model_path
%%capture --no-display --no-stderr cell_output
if not os.path.exists(checkpoint_path):
    !cd /workspace/bionemo && \
    python download_artifacts.py --model_dir models --models {model_version}
else:
    print(f"Model {model_version} already exists at {model_path}.")

Preprocess Data for ESM-2nv#

To briefly showcase the model training capabilities of the BioNeMo Framework, we will use the UniRef50 and UniRef90 datasets to provide a diverse yet non-redundant set of protein sequences. By using both, the model can learn from a wide range of sequence variants while avoiding redundancy. This helps in capturing diverse features and patterns that are relevant for protein function and structure prediction, while also preventing overfitting and improving generalization. For demo purposes, a portion of the sample datasets is located in ${bionemo_home}/examples/tests/test_data/uniref202104_esm2_qc.

The data is stored in a zip file, so run the following command to extract the raw FASTA files and a cluster mapping file:

%%capture --no-display --no-stderr cell_output
# Define the path to the extracted directory
datapath_dir = os.path.join(bionemo_home, 'examples/tests/test_data/protein/uniref50_90')

# Define the path to the zip file
zip_file = f"{datapath_dir}.zip"

# Check if the directory already exists
if not os.path.exists(datapath_dir): 
    ! unzip {zip_file} -d {bionemo_home}/examples/tests/test_data/
else:
    pass

The mapping.tsv file is used to associate protein sequences with their respective clusters. This helps to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.

The cluster_mapping.tsv is used to associate protein sequences with their respective clusters, helping to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.

The mapping.tsv file is used to associate protein sequences with their respective clusters. This helps to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.

train_uf50_fasta = os.path.join(bionemo_home, f'{datapath_dir}/uniref50_train_filt.fasta')
train_uf90_fasta = os.path.join(bionemo_home, f'{datapath_dir}/ur90_ur50_sampler.fasta')
train_cluster_mapping_tsv = os.path.join(bionemo_home, f'{datapath_dir}/mapping.tsv')

Using the unzipped contents of this file, we first create the preprocessed /train, /val, and /test folders, organizing protein sequences into batch CSV files. It is important to utilize both datasets if you plan to use ESM-2 as originally created. However, if you use your own data, as demonstrated in this notebook, you can opt to use a single data source.

The same approach applies to the clustering mapping file. The ESM2Preprocess class can handle clustering indirectly as part of the dataset preparation process. It leverages UniRef50 to UniRef90 clustering mappings to organize protein sequences, ensuring that data is appropriately clustered for training and validation.

Please note that this script does not perform clustering itself but relies on pre-defined clustering mappings provided in a TSV file format to organize protein sequences. The expected format is a TSV file where the first column represents the cluster ID (FASTA header in UniRef50) and the second column lists the members separated by commas. The members correspond to entries in the UniRef90 FASTA file.

The preprocessing steps are:

  1. Download the dataset from a specified URL or NGC registry.

  2. Extract and decompress the downloaded data if necessary.

  3. Index the FASTA file using pyfastx to facilitate data access.

  4. Split the dataset into training, validation, and test sets.

  5. Convert the FASTA sequences into CSV format, dividing them into multiple files if needed.

  6. Generate additional files like memmaps or sorted FASTA files if required for specific use cases.

For more details about the preprocessing steps, please consult the .../bionemo/data/preprocess/protein/preprocess.py file and the documentation found here.

To preprocess the data defined in the previous cell, use the pretrain.py script and set the do_training parameter to False, as shown below:

%%capture --no-display --no-stderr cell_output
! python examples/protein/esm2nv/pretrain.py \
    --config-path=conf \
    --config-name=pretrain_esm2_650M \
    ++do_training=False \
    ++do_preprocessing=True \
    ++model.data.val_size=500 \
    ++model.data.test_size=100 \
    ++model.data.train.uf50_datapath={train_uf50_fasta} \
    ++model.data.train.uf90_datapath={train_uf90_fasta} \
    ++model.data.train.cluster_mapping_tsv={train_cluster_mapping_tsv} \
    ++model.data.dataset_path={datapath_dir}

Command Line and YAML Configuration for pretrain.py#

Parameters starting with -- are passed as command line arguments to pretrain.py. Below are examples of such parameters:

  • --config-path and --config-name:
    These specify the folder and the YAML file name for the configuration. The path is relative to pretrain.py. For instance:

    • config-path: Refers to the configuration folder, e.g., examples/protein/esm2nv/conf.

    • config-name: Refers to the YAML configuration file, e.g., pretrain_esm2_650M.yaml.

    The full path for the configuration file in this example would be:
    examples/protein/esm2nv/conf/pretrain_esm2_650M.yaml.

Parameters starting with ++ are configurable within the YAML file. Below are some examples of such parameters found in the pretrain_esm2_650M.yaml file, which inherits from base_config.yaml:

  • do_training:
    Set to False if you only want to preprocess the data without initiating training.

  • model.data.val_size and model.data.test_size:
    These specify the sizes of the validation and test datasets, respectively.

  • model.data.train.uf50_datapath:
    Specifies the path to the UniRef50 FASTA file.

  • model.data.train.uf90_datapath:
    Specifies the path to the UniRef90 FASTA file.

  • model.data.train.cluster_mapping_tsv:
    Specifies the path to the mapping file that maps UniRef50 clusters to UniRef90 sequences.

  • model.data.dataset_path:
    Specifies the path to the output directory for the preprocessed UniRef50 and UniRef90 data. After processing, the following directories will be created:

    • uf50:
      Contains train/test/val splits, each with files like x000.csv.

    • uf90:
      Contains a folder named uf90_csvs, with files like x000.csv. Note that there will be no train/test/val splits in this directory, as UniRef90 is only used during training.

Changes can also be made directly to the YAML file instead of overwriting arguments through the command line.

Pretrain from Scratch#

Now we will perform pretraining of ESM-2 from scratch using our prepared data and the parameters provided in the pretrain_esm2_650M.yaml config file located in the ${bionemo_home}/examples/protein/esm2nv/conf folder.

For the purpose of this demo example, we will shorten the time required for training by setting the following parameters: ++trainer.max_steps=1 and ++val_check_interval=1. Users can update these parameters by editing the .yaml config file or by overriding config arguments at runtime using Hydra, as shown in the example below.

  • trainer.devices: Specifies the number of GPUs to use.

  • trainer.max_steps: Sets the maximum number of training steps.

  • trainer.val_check_interval: Determines how often to run validation.

  • trainer.limit_train_batches and trainer.limit_val_batches: Limit the number of batches for training and validation respectively.

  • micro_batch_size: Refers to the number of samples processed in a single forward/backward pass before performing a weight update.

Lastly, you can change the configuration used to pretrain_esm2_8M if you have hardware constraints.

%%capture --no-display --no-stderr cell_output
! python examples/protein/esm2nv/pretrain.py \
  --config-path=conf \
  --config-name=pretrain_esm2_650M \
  name={model_name}_pretrain \
  ++do_training=True \
  ++model.data.dataset_path={datapath_dir} \
  ++trainer.devices=1 \
  ++model.micro_batch_size=1 \
  ++trainer.max_steps=1 \
  ++trainer.val_check_interval=1 \
  ++exp_manager.create_wandb_logger=False \
  ++trainer.limit_train_batches=1 \
  ++trainer.limit_val_batches=1

As the ESM-2nv model training job is launched, BioNeMo will print out details related to compute resources, model training configuration, and the dataset being used for training. As the job progresses, it will also print various details related to the test/train/validation steps and accuracy metrics at set intervals.

Upon completion of the training process, it will print details related to log files, model checkpoints, and so on, which will also be saved in the directory as configured (usually /result).

Finally, if Weights and Biases logging was enabled (for example, ++exp_manager.create_wandb_logger=True), you can visualize the model training progress and resulting metrics. The summary will also be printed in the terminal at the end of the training job.

Continue Training from an Existing Checkpoint#

To continue the pretraining of the foundation model, use the pretrain.py script and set exp_manager.resume_if_exists=True to load the model weights and previous run’s metadata. This configuration also picks up the learning rate from the scheduler at the end of the previous run.

%%capture --no-display --no-stderr cell_output
! python examples/protein/esm2nv/pretrain.py \
  --config-path=conf \
  --config-name=pretrain_esm2_650M \
  name={model_name}_continued_training \
  ++do_training=True \
  ++model.data.dataset_path={datapath_dir} \
  ++trainer.devices=1 \
  ++model.micro_batch_size=1 \
  ++trainer.max_steps=2 \
  ++trainer.val_check_interval=1 \
  ++exp_manager.create_wandb_logger=False \
  ++exp_manager.resume_if_exists=True \
  ++trainer.limit_train_batches=1 \
  ++trainer.limit_val_batches=1

If Weights and Biases logging was enabled (for example, ++exp_manager.create_wandb_logger=True), you can also visualize the model training progress and resulting metrics. The summary will also be printed in the terminal at the end of the training job.

In other notebooks, you can explore how to perform inference on your own data, cluster such embeddings, bring and preprocess your own data for training your own ESM model, and finetune existing ESM models.