ESM-2nv: Data Preprocessing and Model Training Using BioNeMo
Contents
ESM-2nv: Data Preprocessing and Model Training Using BioNeMo#
Demo Objectives#
The purpose of this tutorial is to provide an example use case for training a BioNeMo large language model using the BioNeMo framework. In this tutorial, you will gain experience in:
Preprocessing the UniRef50 and UniRef90 data for ESM-2nv.
Pretraining and continuing training from a checkpoint for ESM-2nv.
Performing inference with ESM-2nv.
Overview - ESM-2nv Model#
ESM-2nv is based on the public ESM-2 model, which is a BERT architecture trained on millions of protein sequences from the UniProt database. ESM-2nv learns the patterns and dependencies between amino acids that ultimately give rise to a protein’s 2D structure. These can include properties such as alpha helix or beta sheet, as well as cellular location, thermostability, solubility, and other protein properties. For more information, check the ESM-2nv model card
This ESM-2nv model training example walkthrough will show how to utilize compute resources, download and preprocess datasets, and perform model training on single and multiple nodes.
The model was trained on UniRef50 and UniRef90 protein sequences, truncated to a maximum length of 1022 amino acids.
Setup#
Ensure that you have read through the Getting Started section, can run the BioNeMo Framework Docker container, and have configured the NGC Command Line Interface (CLI) within the container. It is assumed that this notebook is being executed from within the container. Additionally, this tutorial depends on the ESM-2nv model.
%%capture --no-display --no-stderr cell_outputto suppress this output. Comment or delete this line in the cells below to restore full output.
Import and install all required packages#
import os
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
Home Directory#
bionemo_home = "/workspace/bionemo"
os.environ['BIONEMO_HOME'] = bionemo_home
os.chdir(bionemo_home)
Download Model Checkpoints#
The following code will download the pretrained model esm2nv_650M_converted.nemo
from the NGC registry.
In BioNeMo FW, there are numerous ESM models available, including ESM-1nv, ESM-2nv 8M with randomly initialized weights, ESM-2nv fine-tuned for secondary structure downstream prediction tasks with LoRA, ESM-2nv 650M, and ESM-2nv 3B. We also have a configuration file for training ESM-2nv 15B available at examples/protein/esm2nv/conf/pretrain_esm2_15B.yaml
if needed.
For demo purposes, we have chosen to showcase the ESM-2nv 650M model. For more details on the ESM-1nv or ESM-2nv, consult the corresponding model cards. To find the model names and checkpoint names, please see the artifacts_paths.yaml
file.
# Define the NGC CLI API KEY and ORG for the model download
# If these variables are not already set in the container, uncomment below
# to define and set with your API KEY and ORG
#api_key = <your_api_key>
#ngc_cli_org = <ngc_cli_org>
# Update the environment variable
#os.environ['NGC_CLI_API_KEY'] = api_key
#os.environ['NGC_CLI_ORG'] = ngc_cli_org
# Set variables and paths for model and checkpoint
model_name = "esm2nv" # change to esm1nv for ESM1
model_version = "esm2nv_650m" # change to esm1nv for ESM1
actual_checkpoint_name = "esm2nv_650M_converted.nemo" # change to esm1nv.nemo for ESM1
model_path = os.path.join(bionemo_home, 'models')
checkpoint_path = os.path.join(model_path, actual_checkpoint_name)
os.environ['MODEL_PATH'] = model_path
%%capture --no-display --no-stderr cell_output
if not os.path.exists(checkpoint_path):
!cd /workspace/bionemo && \
python download_artifacts.py --model_dir models --models {model_version}
else:
print(f"Model {model_version} already exists at {model_path}.")
Preprocess Data for ESM-2nv#
To briefly showcase the model training capabilities of the BioNeMo Framework, we will use the UniRef50 and UniRef90 datasets to provide a diverse yet non-redundant set of protein sequences. By using both, the model can learn from a wide range of sequence variants while avoiding redundancy. This helps in capturing diverse features and patterns that are relevant for protein function and structure prediction, while also preventing overfitting and improving generalization. For demo purposes, a portion of the sample datasets is located in ${bionemo_home}/examples/tests/test_data/uniref202104_esm2_qc
.
The data is stored in a zip file, so run the following command to extract the raw FASTA files and a cluster mapping file:
%%capture --no-display --no-stderr cell_output
# Define the path to the extracted directory
datapath_dir = os.path.join(bionemo_home, 'examples/tests/test_data/protein/uniref50_90')
# Define the path to the zip file
zip_file = f"{datapath_dir}.zip"
# Check if the directory already exists
if not os.path.exists(datapath_dir):
! unzip {zip_file} -d {bionemo_home}/examples/tests/test_data/
else:
pass
The mapping.tsv
file is used to associate protein sequences with their respective clusters. This helps to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.
The cluster_mapping.tsv
is used to associate protein sequences with their respective clusters, helping to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.
The mapping.tsv
file is used to associate protein sequences with their respective clusters. This helps to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.
train_uf50_fasta = os.path.join(bionemo_home, f'{datapath_dir}/uniref50_train_filt.fasta')
train_uf90_fasta = os.path.join(bionemo_home, f'{datapath_dir}/ur90_ur50_sampler.fasta')
train_cluster_mapping_tsv = os.path.join(bionemo_home, f'{datapath_dir}/mapping.tsv')
Using the unzipped contents of this file, we first create the preprocessed /train
, /val
, and /test
folders, organizing protein sequences into batch CSV files. It is important to utilize both datasets if you plan to use ESM-2 as originally created. However, if you use your own data, as demonstrated in this notebook, you can opt to use a single data source.
The same approach applies to the clustering mapping file. The ESM2Preprocess
class can handle clustering indirectly as part of the dataset preparation process. It leverages UniRef50 to UniRef90 clustering mappings to organize protein sequences, ensuring that data is appropriately clustered for training and validation.
Please note that this script does not perform clustering itself but relies on pre-defined clustering mappings provided in a TSV file format to organize protein sequences. The expected format is a TSV file where the first column represents the cluster ID (FASTA header in UniRef50) and the second column lists the members separated by commas. The members correspond to entries in the UniRef90 FASTA file.
The preprocessing steps are:
Download the dataset from a specified URL or NGC registry.
Extract and decompress the downloaded data if necessary.
Index the FASTA file using
pyfastx
to facilitate data access.Split the dataset into training, validation, and test sets.
Convert the FASTA sequences into CSV format, dividing them into multiple files if needed.
Generate additional files like memmaps or sorted FASTA files if required for specific use cases.
For more details about the preprocessing steps, please consult the .../bionemo/data/preprocess/protein/preprocess.py
file and the documentation found here.
To preprocess the data defined in the previous cell, use the pretrain.py
script and set the do_training
parameter to False
, as shown below:
%%capture --no-display --no-stderr cell_output
! python examples/protein/esm2nv/pretrain.py \
--config-path=conf \
--config-name=pretrain_esm2_650M \
++do_training=False \
++do_preprocessing=True \
++model.data.val_size=500 \
++model.data.test_size=100 \
++model.data.train.uf50_datapath={train_uf50_fasta} \
++model.data.train.uf90_datapath={train_uf90_fasta} \
++model.data.train.cluster_mapping_tsv={train_cluster_mapping_tsv} \
++model.data.dataset_path={datapath_dir}
Command Line and YAML Configuration for pretrain.py
#
Parameters starting with --
are passed as command line arguments to pretrain.py
. Below are examples of such parameters:
--config-path
and--config-name
:
These specify the folder and the YAML file name for the configuration. The path is relative topretrain.py
. For instance:config-path
: Refers to the configuration folder, e.g.,examples/protein/esm2nv/conf
.config-name
: Refers to the YAML configuration file, e.g.,pretrain_esm2_650M.yaml
.
The full path for the configuration file in this example would be:
examples/protein/esm2nv/conf/pretrain_esm2_650M.yaml
.
Parameters starting with ++
are configurable within the YAML file. Below are some examples of such parameters found in the pretrain_esm2_650M.yaml
file, which inherits from base_config.yaml
:
do_training
:
Set toFalse
if you only want to preprocess the data without initiating training.model.data.val_size
andmodel.data.test_size
:
These specify the sizes of the validation and test datasets, respectively.model.data.train.uf50_datapath
:
Specifies the path to the UniRef50 FASTA file.model.data.train.uf90_datapath
:
Specifies the path to the UniRef90 FASTA file.model.data.train.cluster_mapping_tsv
:
Specifies the path to the mapping file that maps UniRef50 clusters to UniRef90 sequences.model.data.dataset_path
:
Specifies the path to the output directory for the preprocessed UniRef50 and UniRef90 data. After processing, the following directories will be created:uf50
:
Containstrain
/test
/val
splits, each with files likex000.csv
.uf90
:
Contains a folder nameduf90_csvs
, with files likex000.csv
. Note that there will be no train/test/val splits in this directory, as UniRef90 is only used during training.
Changes can also be made directly to the YAML file instead of overwriting arguments through the command line.
Pretrain from Scratch#
Now we will perform pretraining of ESM-2 from scratch using our prepared data and the parameters provided in the pretrain_esm2_650M.yaml
config file located in the ${bionemo_home}/examples/protein/esm2nv/conf
folder.
For the purpose of this demo example, we will shorten the time required for training by setting the following parameters: ++trainer.max_steps=1
and ++val_check_interval=1
. Users can update these parameters by editing the .yaml
config file or by overriding config arguments at runtime using Hydra, as shown in the example below.
trainer.devices
: Specifies the number of GPUs to use.trainer.max_steps
: Sets the maximum number of training steps.trainer.val_check_interval
: Determines how often to run validation.trainer.limit_train_batches
andtrainer.limit_val_batches
: Limit the number of batches for training and validation respectively.micro_batch_size
: Refers to the number of samples processed in a single forward/backward pass before performing a weight update.
Lastly, you can change the configuration used to pretrain_esm2_8M
if you have hardware constraints.
%%capture --no-display --no-stderr cell_output
! python examples/protein/esm2nv/pretrain.py \
--config-path=conf \
--config-name=pretrain_esm2_650M \
name={model_name}_pretrain \
++do_training=True \
++model.data.dataset_path={datapath_dir} \
++trainer.devices=1 \
++model.micro_batch_size=1 \
++trainer.max_steps=1 \
++trainer.val_check_interval=1 \
++exp_manager.create_wandb_logger=False \
++trainer.limit_train_batches=1 \
++trainer.limit_val_batches=1
As the ESM-2nv model training job is launched, BioNeMo will print out details related to compute resources, model training configuration, and the dataset being used for training. As the job progresses, it will also print various details related to the test/train/validation steps and accuracy metrics at set intervals.
Upon completion of the training process, it will print details related to log files, model checkpoints, and so on, which will also be saved in the directory as configured (usually /result
).
Finally, if Weights and Biases logging was enabled (for example, ++exp_manager.create_wandb_logger=True
), you can visualize the model training progress and resulting metrics. The summary will also be printed in the terminal at the end of the training job.
Continue Training from an Existing Checkpoint#
To continue the pretraining of the foundation model, use the pretrain.py
script and set exp_manager.resume_if_exists=True
to load the model weights and previous run’s metadata. This configuration also picks up the learning rate from the scheduler at the end of the previous run.
%%capture --no-display --no-stderr cell_output
! python examples/protein/esm2nv/pretrain.py \
--config-path=conf \
--config-name=pretrain_esm2_650M \
name={model_name}_continued_training \
++do_training=True \
++model.data.dataset_path={datapath_dir} \
++trainer.devices=1 \
++model.micro_batch_size=1 \
++trainer.max_steps=2 \
++trainer.val_check_interval=1 \
++exp_manager.create_wandb_logger=False \
++exp_manager.resume_if_exists=True \
++trainer.limit_train_batches=1 \
++trainer.limit_val_batches=1
If Weights and Biases logging was enabled (for example, ++exp_manager.create_wandb_logger=True
), you can also visualize the model training progress and resulting metrics. The summary will also be printed in the terminal at the end of the training job.
In other notebooks, you can explore how to perform inference on your own data, cluster such embeddings, bring and preprocess your own data for training your own ESM model, and finetune existing ESM models.