Generating Embeddings for Protein Clustering#

In this example we will use the BioNeMo Framework to show how to

  • Set up the essential BioNeMo building blocks for ML inference

    • Launch a BioNeMo Docker container

    • Prepare data

    • Download pretrained models

    • Launch an inference server for a pretrained model

  • Run inference on protein sequence data

  • Walk through a protein sequence clustering task with example data

    • Extract embeddings

    • Build simple visualizations

    • Perform a Clustering task with the embeddings

Prerequisites#

Before diving in, ensure you have all necessary prerequisites. If this is your first time using BioNeMo, we recommend following the quickstart guide first.

Additionally, this tutorial depends on: - The ESM-2 model. - The local inference server.

Launch a Docker container#

Once you have successfully set up the BioNeMo Framework Docker image following the ‘Setup’ instructions at Quickstart Guide, identify the image ID by running ‘docker images’ and finding the correct BioNeMo image that you’ve downloaded.
With the required model name, model path details, you can now run the container locally with docker run -it -p 8888:8888 --gpus all -v <model directory>:<mount destination> <BioNeMo_Image> <model>. This command will set up a ‘Jupyter Lab’ that can be found on localhost:8888.

This example has been built upon a container running in a local machine with 2 x A6000 RTX GPUs.

Prepare data#

ESM2-nv can be used to run inference on protein sequence data, resulting in embeddings.
In this example, we will work with a sample csv file with sequence data from the UniRef50 dataset. Our sample dataset consists of 5,001 proteins, and is available within the BioNeMo Framework GitHub repository at the following location: ./examples/tests/test_data/uniref202104_esm2_qc

The data is stored in zip file, so run the following command to extract the various pre-training and inference files.

!unzip -o /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc.zip -d /workspace/bionemo/examples/tests/test_data/
Archive:  /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc.zip
  inflating: /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc/README.md  
  inflating: /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc/ur90_ur50_sampler.fasta  
  inflating: /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc/uniref50_train_filt.fasta  
  inflating: /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc/mapping.tsv  
  inflating: /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc/uniref50_inference_example.csv  

For the purpose of this inference example, we simply use the uniref50_inference_example.csv file. This file is a two-column, comma-separated file where the first column is the sequence IDs and the second column is the protein sequences.

Download pre-trained Models#

BioNeMo Framework comes with the basic building blocks for ML inference, namely the NeMo optimized pre-trained ML models and the framework to use these models for pre-training and inference. We also provide a set of scripts to download these models from NVIDIA NGC.
As of the pre-release version (01 NOV 2023), we have the following models available:

  • Molecule/MegaMolBART

  • Protein/ESM-1nv

  • Protein/ESM-2nv: 650M and 3B variants.

  • Protein/ProtT5nv

In your active Docker container, download the pre-trained models with

./launch.sh download

The command above will invoke a routine that downloads the models from published addresses in download_bionemo_models.

Once execution is complete, you will see files with the .nemo extension within the folder specified in the DOCKER_MODELS_PATH configuration of your .env file.

Refer to the Quickstart Guide for configuring your Docker container and for more info about the .env file.

Configuring paths in your active container#

.nemo Files
Make sure the models downloaded are within the desired project mount directory such as /models/protein/esm2nv; otherwise, you may have to move the .nemo files, such as esm2nv_650M_converted.nemo from wherever it has been downloaded into.

Jupyter Lab’s Workspace
If you type pwd in the Jupyter Lab’s terminal, this may not show BioNeMo’s original directory in your machine but yet another directory such as /workspace, according to the project mount directory you’ve set in the .env file.

Setting up the notebook in your active container#

In this example, we will be working with the ESM2-nv model and a subset of the UniRef50 data (5k proteins) available in the BioNeMo framework.

ESM-2nv was developed using the BioNeMo framework. The model uses an architecture called Bidirectional Encoder Representations from Transformers (BERT) and is based on the ESM-2 model

You could run the code as provided in the cells below - into a Jupyter Notebook launched from the JupyterLab interface running inside the container. For that, you should first run the following two steps on your terminal:

  1. Copy the notebook to the example directory of ESM2-nv:

cp protein-esm2nv-clustering.ipynb /workspace/bionemo/examples/esm2nv/nbs
  1. Launch a gRPC client with the ESM2-nv-650M model:

python3 -m bionemo.model.protein.esm1nv.grpc.service --model esm2nv_650M

Run inference on protein sequence data#

Once the gRPC client is launched with your pre-trained model of choice, the BioNeMo-way is simple to generate embeddings:

from infer import ESMInferenceWrapper
# Initalize the Inference Wrapper that will invoke the pre-trained model
esm = ESMInferenceWrapper()
# Work with a set of sequences (these are invalid sequences despite valid amino-acids)
sequence_examples = ["TRANSCENDINGLIMITS", "MASTERMINDING"]
# Generate Embeddings
embeddings = esm.seq_to_embedding(sequence_examples)

# Print the shape of embeddings
print(f"Shape of embeddings (2, 1280): {embeddings.shape}")
Shape of embeddings (2, 1280): torch.Size([2, 1280])

This requires fewer lines of code than, for example, Rostlab’s Example in Hugging Face. This is because the ESMInferenceWrapper BioNeMo class takes care of the tokenization process and handles the configuration of the hidden states.

Protein sequence clustering example#

Since we are now familiar with the basic building blocks of the BioNeMo Framework to generate protein embeddings, let’s tackle and example including data vizualization and clustering with the UniRef50 sample data.

To perform this in an efficient manner, we will utilize open-source CUDA accelerated data science and machiene learning libraries such as cuML, cuDF, cuPY. For more information, check out NVIDIA RAPIDS.

cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming. Similarly, cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs. In most cases, cuML’s Python API matches the API from scikit-learn.

Note

The following cells containing python code snippets can be directly copied and executed into a Python environment such as a Jupyter notebook running in a BioNemo container.

!which python
/bin/python
%%time
import cudf                     # Using cudf from NVIDIA RAPIDS for big dataframes (alternatively, pandas can be used for simple dataframe manipulations)
import cupy as cp               # To process arrays on the GPU
import cuml                     # We will use DBSCAN for clustering with unkown labels and clusters
import numpy as np              # numpy for simple array manipulations
import matplotlib.pyplot as plt # Matplotlib for data vizualization
import pandas as pd             # simple dataframe manipulation
import torch                    # To deal with tensors
from cuml.manifold import TSNE as cumlTSNE
NOTE! Installing ujson may make loading annotations faster.
CPU times: user 3.76 s, sys: 4.61 s, total: 8.38 s
Wall time: 8.03 s

For this example, we will use a small dataset of 5k proteins that we share in the BioNeMo FW and that can be found at /examples/tests/test_data/uniref202104_esm2_q

%%time
import os

data_path = "/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc"
sequence_df = pd.read_csv(os.path.join(data_path, "uniref50_inference_example.csv"), sep=",", header=0)
CPU times: user 13.8 ms, sys: 0 ns, total: 13.8 ms
Wall time: 12.8 ms
sequence_df.head()
record_name sequence
0 UniRef50_A0A007 MGYIHTALKSAGFHHVIQVDTPALGLDSEGLRKLLADFEPDLVGVS...
1 UniRef50_A0A009DWD5 MGAKLTNHDPNKMALIESINYFNKYYQQDIKIYEAVLKP
2 UniRef50_A0A009DWJ5 MAMQHVTDRATRMIHHSGRGSQYCSELYQSALRHYGVCPSMTDGGDC
3 UniRef50_A0A009DY31 MEHAREQRVKRTQRDYSFAFKMMVVHEVEKGQITYKQAQAKYGNLSSI
4 UniRef50_A0A009DYA3 MISETCYRYQAKLSDDNTFIAEQLIELTEENPDWSFGLCFSYLQHV...

Let’s have a look on the sample dataset…

%%time
# Define the bin function
def bin_data(s, start, end, step):
    # Create the bins
    bins = list(range(start, end+1, step)) + [np.inf]

    # Create the labels
    labels = [f'{i+1}-{i+step}' for i in range(start, end, step)] + ['1001+']
    
    # Cut the Series
    s_cut = pd.cut(s, bins=bins, labels=labels, include_lowest=True, right=False)
    
    return s_cut

# Let's create a column to capture the sequence lengths  
sequence_df['sequence_length'] = [len(m) for m in sequence_df.sequence]  

# we'll work with the sequences and their lengths
s = sequence_df.sequence_length.to_numpy()

# Bin the data
s_binned = bin_data(s, start=0, end=1000, step=100)

# Count the frequency of each bin
s_counts = s_binned.value_counts()

# Generate colors from the viridis colormap
colors = plt.cm.viridis(np.linspace(0, 1, len(s_counts)))

# Create a 1x2 grid of subplots
fig, axs = plt.subplots(1, 2, figsize=(16, 6))

# Create the bar plot on the first subplot
axs[0].bar(s_counts.index, s_counts.values, color=colors)
axs[0].set_title('Sequence Lengths - bin size = 100')
axs[0].set_xlabel('Quantity of Residues')
axs[0].set_ylabel('Frequency')
axs[0].tick_params(axis='x', rotation=45)

# Create the pie chart on the second subplot
axs[1].pie(s_counts.values, labels=s_counts.index, autopct='%1.1f%%', colors=colors)
axs[1].set_title('Residues Lengths Relative to the Dataset')

# Display the plots
plt.tight_layout()
plt.show()
../_images/fb6c4adf279599edf7a2bc912df60c1b98ad9ab29edfc6de302479b38391f099.png
CPU times: user 894 ms, sys: 4.04 s, total: 4.93 s
Wall time: 208 ms

ESM-2nv limits#

The current limit for ESM-2nv is 1023 Residues based on the nemo checkpoints, so we will have to decrease the dataset accordingly.

Data Cleaning#

We can also ignore the test and validation sets for this example (since we’re not training a model).

In this example, we will use two groups of proteins:

  1. Up to 100 amino acids, which can include functional types such as

  • Regulatory Peptides: For example, Insulin, with 51 amino acids.

  • Neuropeptides: Small protein-like molecules used by neurons to communicate with each other. Substance P, a neuropeptide involved in pain perception, contains 11 amino acids.

  • Antimicrobial Peptides: These are an essential part of the innate immune response and have antibiotic properties. An example is defensins, which typically contain 18-45 amino acids.

  • Toxins: Many toxins produced by plants, animals, and bacteria are small peptides. For example, Melittin, the main toxin in bee venom, has 26 amino acids.

  • Cytokines: Some small cytokines, often referred to as chemokines, have less than 100 amino acids. An example is Interleukin-8 (IL-8), a chemokine produced by macrophages, with 99 amino acids.

  1. Between 400 and 1000 amino acids, with functional types such as

    • G-Protein Coupled Receptors (GPCRs): This is a large family of proteins that span the cell membrane and receive signals from outside the cell. GPCRs are involved in a vast number of physiological processes and they typically have around 350-1000 amino acids.

    • Ion Channels: Ion channels are a class of proteins that allow ions to pass through the membrane in response to a signal. The length of these proteins varies. Some are within the 400-1000 amino acid range, while others are up to 4,000 amino acids.

    • Cytochrome P450 Enzymes: These enzymes are involved in the metabolism of various molecules and chemicals within cells. They are typically around 400-500 amino acids long.

    • Serine/Threonine Protein Kinases: This is a large family of enzymes that catalyze the addition of a phosphate group to serine or threonine residues in other proteins. While the length varies significantly, many members of this family are within the 1,000 amino acid range.

    • GTPases: These are a family of hydrolase enzymes that can bind and hydrolyze guanosine triphosphate (GTP). The proteins are critical regulators of a wide variety of cellular processes. Smaller GTPase proteins are 400-500 amino acids in length, with larger GTPases having lengths around 800 amino acids.

%%time
sequence_df = sequence_df[(sequence_df.sequence_length < 100) | ((sequence_df.sequence_length > 400) &  (sequence_df.sequence_length < 1000))]
print("Max seq. length (sample dataset)", sequence_df.sequence_length.max())
print("Length of dataset", len(sequence_df))
Max seq. length (sample dataset) 997
Length of dataset 2323
CPU times: user 702 µs, sys: 1.44 ms, total: 2.14 ms
Wall time: 1.83 ms
sequence_df.head()
record_name sequence sequence_length
0 UniRef50_A0A007 MGYIHTALKSAGFHHVIQVDTPALGLDSEGLRKLLADFEPDLVGVS... 407
1 UniRef50_A0A009DWD5 MGAKLTNHDPNKMALIESINYFNKYYQQDIKIYEAVLKP 39
2 UniRef50_A0A009DWJ5 MAMQHVTDRATRMIHHSGRGSQYCSELYQSALRHYGVCPSMTDGGDC 47
3 UniRef50_A0A009DY31 MEHAREQRVKRTQRDYSFAFKMMVVHEVEKGQITYKQAQAKYGNLSSI 48
4 UniRef50_A0A009DYA3 MISETCYRYQAKLSDDNTFIAEQLIELTEENPDWSFGLCFSYLQHV... 50
%%time
# Define the bin function
def bin_data(s, start, end, step):
    # Create the bins
    bins = list(range(start, end+1, step)) + [np.inf]

    # Create the labels
    labels = [f'{i+1}-{i+step}' for i in range(start, end, step)] + ['1000+']
    
    # Cut the Series
    s_cut = pd.cut(s, bins=bins, labels=labels, include_lowest=True, right=False)
    
    return s_cut

# we'll work with the sequences and their lengths
s = sequence_df.sequence_length.to_numpy()

# Bin the data
s_binned = bin_data(s, start=0, end=1000, step=100)

# Count the frequency of each bin
s_counts = s_binned.value_counts()

# Generate colors from the viridis colormap
colors = plt.cm.viridis(np.linspace(0, 1, len(s_counts)))

# Create a 1x2 grid of subplots
fig, axs = plt.subplots(1, 2, figsize=(16, 6))

# Create the bar plot on the first subplot
axs[0].bar(s_counts.index, s_counts.values, color=colors)
axs[0].set_title('Sequence Lengths - bin size = 100')
axs[0].set_xlabel('Quantity of Residues')
axs[0].set_ylabel('Frequency')
axs[0].tick_params(axis='x', rotation=45)

# Create the pie chart on the second subplot
axs[1].pie(s_counts.values, labels=s_counts.index, autopct='%1.1f%%', colors=colors)
axs[1].set_title('Residues Lengths Relative to the Dataset')

# Display the plots
plt.tight_layout()
plt.show()
../_images/b9f319d787e9917ac257b5e797dd7a488dcebe3a0345c32acedd3d3ee561daae.png
CPU times: user 190 ms, sys: 150 ms, total: 340 ms
Wall time: 171 ms

Using the Inference Wrapper

We will obtain the embeddings for the desired sequences based on the pre-trained models we have previously downloaded.

To avoid memory issues, we’ll submit sequences individually.

In this exampe we are creating a embeddings vectors to be clustered. We know the vector’s dimension by ESM-2nv models specifications.

Defines a Scaling Function

%%time
# Initalize the Inference Wrapper that will invoke the pre-trained model
esm = ESMInferenceWrapper()

# Defines the number of samples per amino acid group
samples = 500
# Set the hidden dim for ESM2nv-650M model
hidden_dim = 1280

# Creates a balanced dataset
indexes_100 = np.random.choice(len(sequence_df[sequence_df.sequence_length < 100]),samples,replace=False)
indexes_400 = np.random.choice(len(sequence_df[sequence_df.sequence_length > 400]),samples,replace=False)
balanced_samples = [*indexes_100, *indexes_400]

# Initializes an embeddings tensor
prot_embeddings = torch.empty(len(balanced_samples), hidden_dim, dtype=torch.float32).cuda()

# Build the embeddings tensor based on mean pooling
for _, i in enumerate(balanced_samples):
    # Obtain embeddings. 
    embeddings = esm.seq_to_embedding(sequence_df.sequence.to_list()[i])
    prot_embeddings[_,:] = embeddings.mean(dim=0)
CPU times: user 15.4 s, sys: 1.57 s, total: 17 s
Wall time: 1min 15s
# Auxiliary matrix shape (to be used in clustering) must be (2 x samples, 1280)
prot_embeddings.shape
torch.Size([1000, 1280])

DBSCAN

Script extracted from RAPIDS API tutorial.

More information available at: https://docs.rapids.ai/api/cuml/stable/api/#clustering

%%time
ary = cp.asarray(prot_embeddings)
prev_output_type = cuml.global_settings.output_type
cuml.set_global_output_type('cudf')
dbscan_float = cuml.DBSCAN(eps=.56, min_samples=50)
dbscan_float.fit(ary)
cuml.set_global_output_type(prev_output_type)
print("Num. Of Clusters: ", dbscan_float.labels_.nunique())

import matplotlib.ticker as ticker
labels = dbscan_float.labels_.values.get()

# Count the number of samples in each cluster
unique_labels, counts = np.unique(labels, return_counts=True)

# Plot the histogram
plt.figure(figsize=(10, 6))
plt.bar(unique_labels, counts, color=plt.cm.viridis(counts / float(max(counts))))
plt.xlabel('Cluster Labels')
plt.ylabel('Embedding Samples')
plt.title('Embeddings per Cluster')

# Ensure all labels are displayed on x-axis and they are integers
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
ax.set_xticks(unique_labels)

plt.show()
Num. Of Clusters:  2
../_images/27e1bb221e55764ddb1884da1659f98491d02d267578c87139532368b079c077.png
CPU times: user 273 ms, sys: 100 ms, total: 373 ms
Wall time: 291 ms

Using T-SNE to reduce dimensionality so that we can visualize the clusters of data

%%time
# Hyperparameters picked arbitrarily for this example
tsne = cumlTSNE(n_components=2, perplexity=6, learning_rate=100, random_state=42)
tsne_embedding = cudf.DataFrame(tsne.fit_transform(ary))

# Create scatterplot
colors = dbscan_float.labels_.values.get()
plt.scatter(tsne_embedding.iloc[:,0].to_numpy(), tsne_embedding.iloc[:,1].to_numpy(), c=colors)
# Add a colorbar
plt.colorbar(label='Color scale')
# Show plot
plt.show()
../_images/d8535b575a046595760d7b5e986a83fb7336575059a3d8f3ab05d4732e6da786.png
CPU times: user 578 ms, sys: 135 ms, total: 714 ms
Wall time: 562 ms

The plot above is for the sake of exemplification based on a subset of random proteins and should not be representative of a full application.

Conclusion#

In this example, you saw how to operate inference on ESM2-nv with BioNeMo and integrate it with classifc clustering and dataframe manipulation activities. We also walked through the modular process to set up a BioNeMo Docker container, launch a remote inference client, and generate embeddings.