Performing inference on OAS sequences with ESM-2nv#

NOTE This notebook was tested on a single A1000 GPU and is compatible with BioNeMo Framework v1.6, v1.7, and v1.8 with an expected runtime of less than one 1 hour.

Demo Objectives:#

  1. Learn how to bring your own dataset for ESM-2nv inference.

  2. Load a pretrained ESM-2nv model and perform inference on the prepared input in the previous step.

Relevance:

Antibodies are among the most successful therapeutics in clinical trials and on the market. They have demonstrated high efficacy and specificity in treating a variety of diseases, contributing to the pharmaceutical market. As of recent years, antibodies have become a dominant class of bio-pharmaceuticals, with several blockbuster drugs generating substantial revenue. For instance, monoclonal antibodies used in oncology, autoimmune diseases, and infectious diseases have achieved widespread clinical success and market penetration.

Their success is reflected in their ability to specifically target disease-causing agents or cells, reducing side effects compared to traditional treatments. Market reports consistently highlight antibodies as a leading category in bio-pharmaceuticals, underscoring their pivotal role in modern medicine’s therapeutic landscape.

We will use ESM-2nv to create embeddings of heavy chain variable domain (VHs) sequences of antibodies found in the OAS database.

Setup#

Ensure that you have read through the Getting Started section, can run the BioNeMo Framework Docker container, and have configured the NGC Command Line Interface (CLI) within the container. It is assumed that this notebook is being executed from within the container.

NOTE Some of the cells below generate long text output. We're using
%%capture --no-display --no-stderr cell_output
to suppress this output. Comment or delete this line in the cells below to restore full output.

You can use this notebook for both ESM-2nv and ESM-1nv by making minor code changes.

Import and install all required packages#

import os
import gzip
import shutil
import warnings

import pandas as pd
import pickle as pkl
import urllib.request

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Home Directory#

bionemo_home = "/workspace/bionemo"
os.environ['BIONEMO_HOME'] = bionemo_home
os.chdir(bionemo_home)

Download Model Checkpoints#

The following code will download the pretrained model esmn2nv_650M_converted.nemo from the NGC registry.

In BioNeMo FW, there are numerous ESM models available, including ESM-1nv, ESM-2nv 8M with randomly initialized weights, ESM-2nv fine-tuned to secondary structure downstream prediction tasks with LoRA, ESM-2nv 650M, and ESM-2nv 3B. We also have a configuration file for training ESM-2nv 15B available at examples/protein/esm2nv/conf/pretrain_esm2_15B.yaml if needed.

For demo purposes, we have chosen to showcase the ESM-2nv 650M model. For more details on the ESM-1nv or ESM-2nv, consult the corresponding model cards. To find the model names and checkpoint names, please see the artifacts_paths.yaml file.

# Define the NGC CLI API KEY and ORG for the model download
# If these variables are not already set in the container, uncomment below
# to define and set with your API KEY and ORG
# api_key = <YOUR_API_KEY>
# ngc_cli_org = <YOUR_ORG>
# Update the environment variable
# os.environ['NGC_CLI_API_KEY'] = api_key
# os.environ['NGC_CLI_ORG'] = ngc_cli_org

# Set variables and paths for model and checkpoint
model_name = "esm2nv" # for esm1nv change this to "esm1nv"
model_version = "esm2nv_650m" # for esm1nv change this to "esm1nv"
actual_checkpoint_name = "esm2nv_650M_converted.nemo" # for esm1nv change this to "esm1nv_converted.nemo"
model_path = os.path.join(bionemo_home, 'models')
checkpoint_path = os.path.join(model_path, actual_checkpoint_name)
os.environ['MODEL_PATH'] = model_path
%%capture --no-display --no-stderr cell_output
if not os.path.exists(checkpoint_path):
    !cd /workspace/bionemo && \
    python download_artifacts.py --model_dir models --models {model_version}
else:
    print(f"Model {model_version} already exists at {model_path}.")

Dataset preparation#

Here we will download the dataset and unzip it. The data was sourced from the OAS database.

data_links = [
    'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz']

base_data_dir = os.path.join(bionemo_home, 'data', 'OAS_paired')
if not os.path.exists(base_data_dir):
    os.makedirs(base_data_dir)

for file in data_links:
    data_file = os.path.join(base_data_dir, os.path.basename(file))
    if not os.path.isfile(data_file):
        # Corrected wget command
        urllib.request.urlretrieve(file, data_file)

    # Unzip the file
    try:
        with gzip.open(data_file, 'rb') as f_in:
            with open(data_file[:-3], 'wb') as f_out:  # Remove .gz extension for the output file
                shutil.copyfileobj(f_in, f_out)
    except OSError as e:
        print(f"Error opening the file {data_file}: {e}")

Create a BioNeMo-Compatible Data Format for Inference#

Now that we have the raw data, we need to process it to be compatible with BioNeMo’s expected input. We will convert the data into a CSV file. As we are not performing training in this demo, we will not need to create the typical train, validation, and test splits.

def transform_and_save_csv(input_data_path: str, output_data_path: str, columns_to_keep: list) -> None:
    """
    Transforms the input CSV by keeping only specified columns and saves the result to output path.
    """
    try:
        # Read the CSV with only specified columns
        df = pd.read_csv(input_data_path, skiprows=[0], usecols=columns_to_keep)
        
        # Write the filtered data to a new CSV file
        df.to_csv(output_data_path, index=False)
        
        print(f"Columns {columns_to_keep} have been selected and saved to {output_data_path}")
    
    except Exception as e:
        print(f"Error occurred: {e}")
        raise

data_path = f'{base_data_dir}/SRR10358524_paired.csv'
columns_to_keep = ['sequence_id_heavy', 'sequence_alignment_aa_heavy']

filtered_data_path = f'{base_data_dir}/filtered_csv'
! mkdir -p {filtered_data_path}
processed_data_file = f'{filtered_data_path}/filtered_data.csv'
transform_and_save_csv(data_path, processed_data_file, columns_to_keep)
Columns ['sequence_id_heavy', 'sequence_alignment_aa_heavy'] have been selected and saved to /workspace/bionemo/data/OAS_paired/filtered_csv/filtered_data.csv

Perform Inference Using the Existing ESM Model#

To perform inference on the antibody sequences using the infer.py script, ensure that the following requirements are met:

  1. Create and prepare a designated output directory to store the results.

  2. Specify a file path within this directory where the embeddings will be saved. This can be in pkl or h5 format.

If the output format is pkl and the file is saved using the pickle module, the predictions (embeddings and/or hidden states) will be serialized, and the file will contain a dictionary where each key is a sequence identifier and the corresponding value is the predicted output for that sequence.

model_config_path = os.path.join(bionemo_home, f'examples/protein/{model_name}/conf')
output_dir = f'{base_data_dir}/filtered_csv/inference_output' # where we want to save the output 
! mkdir -p {output_dir}
inference_results = f'{output_dir}/{model_name}_oas.pkl' # the name of the output file

The input file is expected to have an id column at index 0 and a sequence column at index 1, as specified by default in the /bionemo/examples/conf/base_infer_config.yaml config file. Change these parameters to suit your data files in your .yaml file or by using Hydra to override the default settings.

It is also important to specify which model we want to use for inference by setting model.downstream_task.restore_path to the checkpoint_path variable. The results will be saved in the output_dir specified, and a log folder will be created for the experiment run and the inference_results file.

%%capture --no-display --no-stderr cell_output
! python /workspace/bionemo/bionemo/model/infer.py \
    --config-dir {model_config_path} \
    --config-name infer \
    ++name={model_name}_Inference_OAS \
    ++model.downstream_task.restore_path={checkpoint_path} \
    ++model.data.dataset_path={processed_data_file} \
    ++exp_manager.exp_dir={output_dir} \
    ++model.inference_output_file={inference_results} \
    ++model.data.output_fname={inference_results}

We can now access the embeddings saved in the pkl format as follows:

# Loadup the parameter model predictions
with open(inference_results, 'rb') as fd:
     infer_results = pkl.load(fd)
print(f"This is a sequence: {infer_results[-1]['sequence']}") # the sequence that was embedded
print(f"The number of features for a single embedded sequence: {infer_results[-1]['embeddings'].shape}") # the number of features for a single embedded sequence
print(f"Inspecting the features vector for a sequence: {infer_results[-1]['embeddings']}") # inspecting the features
This is a sequence: QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGWINPNSGGTNYAQKFQGRVTMTRDTSISTAYMELSRLRSDDTAVYYCARESQIVVVPAAIEDYYYYGMDVWGQGTTVTVSS
The number of features for a single embedded sequence: (1280,)
Inspecting the features vector for a sequence: [-0.01926641 -0.04979213 -0.10104819 ... -0.18573356  0.03264603
  0.140413  ]

Call to Action#

In your own time:

  • Generate embeddings on the UniRef50 data.

  • Use the embeddings here generated.

  • Cluster both sets of embeddings (proteins and antibodies) using UMAP and see if you can identify any patterns. You might gain inspiration from the protein clustering notebook.