Performing inference on OAS sequences with ESM-2nv
Contents
Performing inference on OAS sequences with ESM-2nv#
Demo Objectives:#
Learn how to bring your own dataset for ESM-2nv inference.
Load a pretrained ESM-2nv model and perform inference on the prepared input in the previous step.
Relevance:
Antibodies are among the most successful therapeutics in clinical trials and on the market. They have demonstrated high efficacy and specificity in treating a variety of diseases, contributing to the pharmaceutical market. As of recent years, antibodies have become a dominant class of bio-pharmaceuticals, with several blockbuster drugs generating substantial revenue. For instance, monoclonal antibodies used in oncology, autoimmune diseases, and infectious diseases have achieved widespread clinical success and market penetration.
Their success is reflected in their ability to specifically target disease-causing agents or cells, reducing side effects compared to traditional treatments. Market reports consistently highlight antibodies as a leading category in bio-pharmaceuticals, underscoring their pivotal role in modern medicine’s therapeutic landscape.
We will use ESM-2nv to create embeddings of heavy chain variable domain (VHs) sequences of antibodies found in the OAS database.
Setup#
Ensure that you have read through the Getting Started section, can run the BioNeMo Framework Docker container, and have configured the NGC Command Line Interface (CLI) within the container. It is assumed that this notebook is being executed from within the container.
%%capture --no-display --no-stderr cell_outputto suppress this output. Comment or delete this line in the cells below to restore full output.
You can use this notebook for both ESM-2nv and ESM-1nv by making minor code changes.
Import and install all required packages#
import os
import gzip
import shutil
import warnings
import pandas as pd
import pickle as pkl
import urllib.request
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
Home Directory#
bionemo_home = "/workspace/bionemo"
os.environ['BIONEMO_HOME'] = bionemo_home
os.chdir(bionemo_home)
Download Model Checkpoints#
The following code will download the pretrained model esmn2nv_650M_converted.nemo
from the NGC registry.
In BioNeMo FW, there are numerous ESM models available, including ESM-1nv, ESM-2nv 8M with randomly initialized weights, ESM-2nv fine-tuned to secondary structure downstream prediction tasks with LoRA, ESM-2nv 650M, and ESM-2nv 3B. We also have a configuration file for training ESM-2nv 15B available at examples/protein/esm2nv/conf/pretrain_esm2_15B.yaml
if needed.
For demo purposes, we have chosen to showcase the ESM-2nv 650M model. For more details on the ESM-1nv or ESM-2nv, consult the corresponding model cards. To find the model names and checkpoint names, please see the artifacts_paths.yaml
file.
# Define the NGC CLI API KEY and ORG for the model download
# If these variables are not already set in the container, uncomment below
# to define and set with your API KEY and ORG
# api_key = <YOUR_API_KEY>
# ngc_cli_org = <YOUR_ORG>
# Update the environment variable
# os.environ['NGC_CLI_API_KEY'] = api_key
# os.environ['NGC_CLI_ORG'] = ngc_cli_org
# Set variables and paths for model and checkpoint
model_name = "esm2nv" # for esm1nv change this to "esm1nv"
model_version = "esm2nv_650m" # for esm1nv change this to "esm1nv"
actual_checkpoint_name = "esm2nv_650M_converted.nemo" # for esm1nv change this to "esm1nv_converted.nemo"
model_path = os.path.join(bionemo_home, 'models')
checkpoint_path = os.path.join(model_path, actual_checkpoint_name)
os.environ['MODEL_PATH'] = model_path
%%capture --no-display --no-stderr cell_output
if not os.path.exists(checkpoint_path):
!cd /workspace/bionemo && \
python download_artifacts.py --model_dir models --models {model_version}
else:
print(f"Model {model_version} already exists at {model_path}.")
Dataset preparation#
Here we will download the dataset and unzip it. The data was sourced from the OAS database.
data_links = [
'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz']
base_data_dir = os.path.join(bionemo_home, 'data', 'OAS_paired')
if not os.path.exists(base_data_dir):
os.makedirs(base_data_dir)
for file in data_links:
data_file = os.path.join(base_data_dir, os.path.basename(file))
if not os.path.isfile(data_file):
# Corrected wget command
urllib.request.urlretrieve(file, data_file)
# Unzip the file
try:
with gzip.open(data_file, 'rb') as f_in:
with open(data_file[:-3], 'wb') as f_out: # Remove .gz extension for the output file
shutil.copyfileobj(f_in, f_out)
except OSError as e:
print(f"Error opening the file {data_file}: {e}")
Create a BioNeMo-Compatible Data Format for Inference#
Now that we have the raw data, we need to process it to be compatible with BioNeMo’s expected input. We will convert the data into a CSV file. As we are not performing training in this demo, we will not need to create the typical train
, validation
, and test
splits.
def transform_and_save_csv(input_data_path: str, output_data_path: str, columns_to_keep: list) -> None:
"""
Transforms the input CSV by keeping only specified columns and saves the result to output path.
"""
try:
# Read the CSV with only specified columns
df = pd.read_csv(input_data_path, skiprows=[0], usecols=columns_to_keep)
# Write the filtered data to a new CSV file
df.to_csv(output_data_path, index=False)
print(f"Columns {columns_to_keep} have been selected and saved to {output_data_path}")
except Exception as e:
print(f"Error occurred: {e}")
raise
data_path = f'{base_data_dir}/SRR10358524_paired.csv'
columns_to_keep = ['sequence_id_heavy', 'sequence_alignment_aa_heavy']
filtered_data_path = f'{base_data_dir}/filtered_csv'
! mkdir -p {filtered_data_path}
processed_data_file = f'{filtered_data_path}/filtered_data.csv'
transform_and_save_csv(data_path, processed_data_file, columns_to_keep)
Columns ['sequence_id_heavy', 'sequence_alignment_aa_heavy'] have been selected and saved to /workspace/bionemo/data/OAS_paired/filtered_csv/filtered_data.csv
Perform Inference Using the Existing ESM Model#
To perform inference on the antibody sequences using the infer.py script, ensure that the following requirements are met:
Create and prepare a designated output directory to store the results.
Specify a file path within this directory where the embeddings will be saved. This can be in
pkl
orh5
format.
If the output format is pkl
and the file is saved using the pickle
module, the predictions (embeddings and/or hidden states) will be serialized, and the file will contain a dictionary where each key is a sequence identifier and the corresponding value is the predicted output for that sequence.
model_config_path = os.path.join(bionemo_home, f'examples/protein/{model_name}/conf')
output_dir = f'{base_data_dir}/filtered_csv/inference_output' # where we want to save the output
! mkdir -p {output_dir}
inference_results = f'{output_dir}/{model_name}_oas.pkl' # the name of the output file
The input file is expected to have an id
column at index 0 and a sequence
column at index 1, as specified by default in the /bionemo/examples/conf/base_infer_config.yaml
config file. Change these parameters to suit your data files in your .yaml
file or by using Hydra to override the default settings.
It is also important to specify which model we want to use for inference by setting model.downstream_task.restore_path
to the checkpoint_path
variable.
The results will be saved in the output_dir
specified, and a log folder will be created for the experiment run and the inference_results
file.
%%capture --no-display --no-stderr cell_output
! python /workspace/bionemo/bionemo/model/infer.py \
--config-dir {model_config_path} \
--config-name infer \
++name={model_name}_Inference_OAS \
++model.downstream_task.restore_path={checkpoint_path} \
++model.data.dataset_path={processed_data_file} \
++exp_manager.exp_dir={output_dir} \
++model.inference_output_file={inference_results} \
++model.data.output_fname={inference_results}
We can now access the embeddings saved in the pkl
format as follows:
# Loadup the parameter model predictions
with open(inference_results, 'rb') as fd:
infer_results = pkl.load(fd)
print(f"This is a sequence: {infer_results[-1]['sequence']}") # the sequence that was embedded
print(f"The number of features for a single embedded sequence: {infer_results[-1]['embeddings'].shape}") # the number of features for a single embedded sequence
print(f"Inspecting the features vector for a sequence: {infer_results[-1]['embeddings']}") # inspecting the features
This is a sequence: QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGWINPNSGGTNYAQKFQGRVTMTRDTSISTAYMELSRLRSDDTAVYYCARESQIVVVPAAIEDYYYYGMDVWGQGTTVTVSS
The number of features for a single embedded sequence: (1280,)
Inspecting the features vector for a sequence: [-0.01926641 -0.04979213 -0.10104819 ... -0.18573356 0.03264603
0.140413 ]
Call to Action#
In your own time:
Generate embeddings on the UniRef50 data.
Use the embeddings here generated.
Cluster both sets of embeddings (proteins and antibodies) using UMAP and see if you can identify any patterns. You might gain inspiration from the protein clustering notebook.