Fine-Tune ESM-2nv on FLIP Data for Sequence-Level Classification, Regression, Token-Level Classification, and with LoRA Adapters
Contents
Fine-Tune ESM-2nv on FLIP Data for Sequence-Level Classification, Regression, Token-Level Classification, and with LoRA Adapters#
Demo Objectives#
Downstream Head Fine-Tuning
Objective: Utilize fine-tuned ESM-2nv models for predicting antibody function with an additional prediction head.
Steps: Collect the data using the existing scripts in BioNeMo for preprocessing, and use the existing downstream prediction head training scripts in BioNeMo for sequence-level classification, sequence-level regression, token-level classification, and with LoRA adapters.
For these purposes, we will use the Fitness Landscape Inference for Proteins (FLIP) evaluation dataset. The FLIP datasets are used to evaluate the performance of protein language models on five specific downstream tasks related to proteins. These tasks include secondary structure prediction, conservation analysis, subcellular localization, meltome analysis, and GB1 activity measurement.
Setup#
Ensure that you have read through the Getting Started section, can run the BioNeMo Framework Docker container, and have configured the NGC Command Line Interface (CLI) within the container. It is assumed that this notebook is being executed from within the container.
%%capture --no-display --no-stderr cell_outputto suppress this output. Comment or delete this line in the cells below to restore full output.
You can use this notebook for both ESM-2nv and ESM-1nv (except for LoRA) by making minor code changes.
Import and Install All Required Packages#
import os
import pandas as pd
import warnings
from bionemo.data import FLIPPreprocess
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
Home Directory#
bionemo_home = '/workspace/bionemo'
bionemo_home = os.environ['BIONEMO_HOME']
os.chdir(bionemo_home)
Download Model Checkpoints#
The following code will download the pretrained model esmn2nv_650M_converted.nemo
from the NGC registry.
In BioNeMo FW, there are numerous ESM models available, including ESM-1nv, ESM-2nv 8M with randomly initialized weights, ESM-2nv fine-tuned to secondary structure downstream prediction tasks with LoRA, ESM-2nv 650M, and ESM-2nv 3B. We also have a configuration file for training ESM-2nv 15B available at examples/protein/esm2nv/conf/pretrain_esm2_15B.yaml
if needed.
For demo purposes, we have chosen to showcase the ESM-2nv 650M model. For more details on the ESM-1nv or ESM-2nv, consult the corresponding model cards. To find the model names and checkpoint names please see the artifacts_paths.yaml
file.
# Define the NGC CLI API KEY and ORG for the model download
# If these variables are not already set in the container, uncomment below
# to define and set with your API KEY and ORG
#api_key = <your_api_key>
#ngc_cli_org = <ngc_cli_org>
# Update the environment variable
#os.environ['NGC_CLI_API_KEY'] = api_key
#os.environ['NGC_CLI_ORG'] = ngc_cli_org
# Set variables and paths for model and checkpoint
model_name = "esm2nv" # change to esm1 for ESM1
model_version = "esm2nv_650m" # change to esm1nv for ESM1
actual_checkpoint_name = "esm2nv_650M_converted.nemo" # change to esm1nv.nemo for ESM1
model_path = os.path.join(bionemo_home, 'models')
checkpoint_path = os.path.join(model_path, actual_checkpoint_name)
os.environ['MODEL_PATH'] = model_path
%%capture --no-display --no-stderr cell_output
if not os.path.exists(checkpoint_path):
!cd /workspace/bionemo && \
python download_artifacts.py --model_dir models --models {model_version}
else:
print(f"Model {model_version} already exists at {model_path}.")
Download the FLIP Data and Preprocess It#
The code below uses the FLIP preprocessing method to download and preprocess the public FLIP data into a BioNeMo-compatible format. It will create a folder data/FLIP
with subdirectories containing the data.
In this demo, we are going to predict various properties of protein sequences:
The protein’s subcellular localization (
scl
).The melting temperature of a protein (
meltome
).The secondary structure of an amino acid (
secondary_structure
).
preprocessor = FLIPPreprocess()
for task in ["scl", "meltome", "secondary_structure"]:
task_dir = f'{bionemo_home}/data/FLIP/{task}'
preprocessor.prepare_dataset(task_name=task, output_dir=task_dir)
[NeMo I 2024-08-28 13:27:30 flip_preprocess:114] mixed_soft.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:30 flip_preprocess:237] FLIP data download complete.
[NeMo I 2024-08-28 13:27:30 flip_preprocess:239] Processing FLIP dataset.
[NeMo I 2024-08-28 13:27:30 flip_preprocess:245] Writing processed dataset files to /workspace/bionemo/data/FLIP/scl...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:159] Saving train split...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:159] Saving val split...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:159] Saving test split...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:257] FLIP dataset preprocessing completed
[NeMo I 2024-08-28 13:27:32 flip_preprocess:114] mixed_split.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:32 flip_preprocess:237] FLIP data download complete.
[NeMo I 2024-08-28 13:27:32 flip_preprocess:239] Processing FLIP dataset.
[NeMo I 2024-08-28 13:27:32 flip_preprocess:245] Writing processed dataset files to /workspace/bionemo/data/FLIP/meltome...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:159] Saving train split...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:159] Saving val split...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:159] Saving test split...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:257] FLIP dataset preprocessing completed
[NeMo I 2024-08-28 13:27:34 flip_preprocess:114] sequences.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:35 flip_preprocess:114] sampled.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:36 flip_preprocess:114] resolved.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:36 flip_preprocess:237] FLIP data download complete.
[NeMo I 2024-08-28 13:27:36 flip_preprocess:239] Processing FLIP dataset.
[NeMo I 2024-08-28 13:27:36 flip_preprocess:245] Writing processed dataset files to /workspace/bionemo/data/FLIP/secondary_structure...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:159] Saving train split...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:159] Saving val split...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:159] Saving test split...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:257] FLIP dataset preprocessing completed
For demo purposes, we will subsample the datasets to enable faster execution of the notebook.
def subsample_data(path: str, subsample_type: str, column: str = None, subset_fraction: float = 0.05, random_seed: int = 0) -> None:
"""
Subsamples the dataset based on the specified subsample type.
"""
# Load the data
df = pd.read_csv(path)
# Function to sample a subset of the data by class
def sample_by_class(df, fraction):
return df.groupby(column, group_keys=False).apply(lambda x: x.sample(frac=fraction, random_state=random_seed))
# Function to sample a subset of the data by continuous variable
def sample_by_continuous(df, fraction):
stratify_bins = pd.qcut(df[column], q=10, duplicates='drop')
return df.groupby(stratify_bins, group_keys=False).apply(lambda x: x.sample(frac=fraction, random_state=random_seed))
# Perform the appropriate subsampling based on the specified type
if subsample_type == 'class':
subset = sample_by_class(df, subset_fraction)
elif subsample_type == 'continuous':
subset = sample_by_continuous(df, subset_fraction)
elif subsample_type == 'random':
subset = df.sample(frac=subset_fraction, random_state=random_seed)
else:
raise ValueError("Invalid subsample_type. Choose from 'class', 'continuous', or 'random'.")
# Save the subset to the original CSV file
subset.to_csv(path, index=False)
# Print statement to confirm subsampling
print(f"File has been subsampled and saved: {path}")
# Example usage
for data_set in ["train", "val", "test"]:
subsample_data(path=f"data/FLIP/scl/{data_set}/x000.csv", subsample_type='class', column='scl_label', random_seed=42)
subsample_data(path=f"data/FLIP/meltome/{data_set}/x000.csv", subsample_type='continuous', column='target', random_seed=42)
subsample_data(path=f"data/FLIP/secondary_structure/{data_set}/x000.csv", subsample_type='random', random_seed=42)
File has been subsampled and saved: data/FLIP/scl/train/x000.csv
File has been subsampled and saved: data/FLIP/meltome/train/x000.csv
File has been subsampled and saved: data/FLIP/secondary_structure/train/x000.csv
File has been subsampled and saved: data/FLIP/scl/val/x000.csv
File has been subsampled and saved: data/FLIP/meltome/val/x000.csv
File has been subsampled and saved: data/FLIP/secondary_structure/val/x000.csv
File has been subsampled and saved: data/FLIP/scl/test/x000.csv
File has been subsampled and saved: data/FLIP/meltome/test/x000.csv
File has been subsampled and saved: data/FLIP/secondary_structure/test/x000.csv
Fine-tuning#
The BioNeMo framework supports easy fine-tuning on downstream tasks by loading the pretrained model, which can be frozen or unfrozen, and adding a task-specific head. BioNeMo also provides example config files for downstream task fine-tuning of ESM-2nv and ESM-1nv on some FLIP tasks.
A pretrained ESM model can be provided using a path to a NeMo model (via restore_encoder_path
). This is done through:
Adding
model.restore_encoder_path:
To the config yaml.Passing
model.restore_encoder_path:
As a command-line argument into your script.
Method 1: Standard Fine-Tuning for Sequence Level Classification Tasks#
In this example, we will predict the 10 subcellular localization sites of proteins as described in the FLIP dataset. Under the data/FLIP/scl
folder, you will see the correct expected structure for BioNeMo:
data/path/
train/
x000.csv
val/
x000.csv
test/
x000.csv
By inspecting the file, you will see three columns:
id
: The sequence ID.sequence
: The protein sequence.target
: The corresponding class of the sequence.
The CSV files should be named exactly as x000.csv
. You can provide a list of such files by specifying it as a list in the config file. For instance, if you have 50 csv files, you can specify this by setting x[000..049]
to take files named x000.csv
up to a file named x0049.csv
.
To run this downstream task, we have included an example downstream_flip_scl
configuration file. For your own custom downstream tasks, you can create your own YAML file or override existing ones using HYDRA by specifying the following fields:
restore_from_path
: Set to the path of the pretrained model checkpoint.nemo
file.trainer.devices
,trainer.num_nodes
: Set it to the number of GPU and nodes, respectively.trainer.max_epochs
: Set to the number of epochs you want to train.trainer.val_check_interval
: Set to the number of steps to run validation.model.micro_batch_size
: Set to the micro batch size for training.data.task_name
: Can be anything.data.task_type
: The current options aretoken-level-classification
,classification
(sequence level), andregression
(sequence level).preprocessed_data_path
: Set to the path of the parent folder ofdataset_path
. See -dataset_path
for how this env is used.dataset_path
: Set to the folder that containstrain
/val
/test
folders.dataset.train
,dataset.val
,dataset.test
: Set to the CSV name or ranges.sequence_column
: Set to the name of the column containing the sequence, e.g.sequence
in this example.target_column
: Set to the name of the column containing the target, e.g.scl_label
in this example.target_size
: Number of classes in each label for classification.num_classes
: Set totarget_size
.encoder_frozen
: Used to set the encoder trainable or frozen, True by default.
This task will use the CrossEntropyLoss
and add an MLPmodel
task head with ReLU
as activation function, LayerNorm
, and Dropout
set at 0.25
as specified in the ../model/core/mlp_model.py
file.
For the purpose of this demo example, we will shorten the time required for training by setting the following parameters: ++trainer.max_steps=1
, ++val_check_interval=1
, ++limit_val_batches
and ++limit_test_batches
, reducing the number of batches for validation and testing to 1. Users can update these parameters by editing the .yaml
config file or by overriding config arguments at runtime using Hydra, as shown in the example below.
configuration_folder = os.path.join(bionemo_home, f'examples/protein/{model_name}/conf')
scl_df = pd.read_csv(f'{bionemo_home}/data/FLIP/scl/train/x000.csv')
scl_df.head()
id | sequence | scl_label | |
---|---|---|---|
0 | Sequence867 | MQGSKGVENPAFVPSSPDTPRRASASPSQVEVSAVASRNQNGGSQP... | Cell_membrane |
1 | Sequence439 | MNVSHASVHPVEDPPAAATEVENPPRVRMDDMEGMPGTLLGLALRF... | Cell_membrane |
2 | Sequence342 | MKMASSLAFLLLNFHVSLFLVQLLTPCSAQFSVLGPSGPILAMVGE... | Cell_membrane |
3 | Sequence735 | MENPPNETEAKQIQTNEGKKTKGGIITMPFIIANEAFEKVASYGLL... | Cell_membrane |
4 | Sequence784 | MKSFNTEGHNHSTAESGDAYTVSDPTKNVDEDGREKRTGTWLTASA... | Cell_membrane |
%%capture --no-display --no-stderr cell_output
!cd {bionemo_home} && python examples/protein/downstream/downstream_flip.py \
--config-path={configuration_folder} \
--config-name=downstream_flip_scl \
name={model_name}-finetuned-scl \
++trainer.devices=1 \
++trainer.max_epochs=1 \
++trainer.val_check_interval=1 \
++trainer.limit_test_batches=1 \
++trainer.limit_val_batches=1 \
++model.micro_batch_size=1 \
++trainer.max_steps=1 \
++exp_manager.create_wandb_logger=false
Method 2: Standard Fine-Tuning for Sequence-Level Regression Tasks#
In this example, we will predict the melting temperature of proteins as described in the FLIP dataset.
As before, we need our files to be in the correct format with the appropriate naming. Thanks to the preprocessing steps we carried out at the beginning of this notebook, the data is already in the right format. Again, we have a custom downstream_flip_meltome.yaml
configuration file with all the correct settings.
Inside it, you should pay attention to:
loss_func
: This time it isMSELoss
.task_name
:meltome
.sequence_column
: The name of the column where the protein sequence is located.target_column
: This is the target column in our file, which is calledtarget
.target_sizes
: This is the number of classes in each label; in this case, it will be 1, as it is a regression task.
meltome_df = pd.read_csv(f'{bionemo_home}/data/FLIP/meltome/train/x000.csv')
meltome_df.head()
id | sequence | target | |
---|---|---|---|
0 | Sequence10771 | MSWPTLTVRLQQKVIRYLDYESRCNLRICSKDDKDSVDSVKFNPKT... | 35.898600 |
1 | Sequence3319 | MRLVKQEYVLDGLDCSNCARKIENGVKGIKGINGCAVNFAASTLTV... | 38.732746 |
2 | Sequence3630 | MSSFDRRIEAACKFDDERYYKQYHRYFDVLAQVHSVVETINGAQML... | 39.144778 |
3 | Sequence8135 | MDAEDGFDPTLLKKKKKKKTTFDLDAALGLEDDTKKEDPQDEASAE... | 40.476706 |
4 | Sequence3437 | MSYYNKRNQEPLPKEDVSTWECTKEDCNGWTRKNFASSDTPLCPLC... | 36.754142 |
meltome_df = pd.read_csv(f'{bionemo_home}/data/FLIP/meltome/train/x000.csv')
meltome_df.head()
id | sequence | target | |
---|---|---|---|
0 | Sequence10771 | MSWPTLTVRLQQKVIRYLDYESRCNLRICSKDDKDSVDSVKFNPKT... | 35.898600 |
1 | Sequence3319 | MRLVKQEYVLDGLDCSNCARKIENGVKGIKGINGCAVNFAASTLTV... | 38.732746 |
2 | Sequence3630 | MSSFDRRIEAACKFDDERYYKQYHRYFDVLAQVHSVVETINGAQML... | 39.144778 |
3 | Sequence8135 | MDAEDGFDPTLLKKKKKKKTTFDLDAALGLEDDTKKEDPQDEASAE... | 40.476706 |
4 | Sequence3437 | MSYYNKRNQEPLPKEDVSTWECTKEDCNGWTRKNFASSDTPLCPLC... | 36.754142 |
%%capture --no-display --no-stderr cell_output
!cd {bionemo_home} && python examples/protein/downstream/downstream_flip.py \
--config-path={configuration_folder} \
--config-name=downstream_flip_meltome \
name={model_name}-finetuned-meltome \
++trainer.devices=1 \
++trainer.max_epochs=1 \
++trainer.val_check_interval=1 \
++model.micro_batch_size=1 \
++trainer.max_steps=1 \
++exp_manager.create_wandb_logger=false
Method 3: Standard finetune for token level classification tasks#
In this example, we will predict three structure states of proteins, as described in the FLIP dataset. For each amino acid in the sequence, the model predicts whether it is found in a helix, sheet, or coil structure.
For the target column (e.g.
3state
), use a sequence of the same length as the protein sequence. Each character in the sequence represents a class (e.g.C
for coil,H
for helix,E
for sheet)You can also apply a mask column. For example, the
resolved
column uses a sequence of 1 and 0s that is the same length of the protein sequence. 1 = experimentally resolved, 0 = not resolved.The loss will only be calculated for the resolved positions as this is specified under the
mask_column
.The
loss_fn
no longer needs to be set as it is pre-built in the/model/protein/downstream/protein_model_finetuning.py
underbuild_loss_fn
. ThePerTokenMaskedCrossEntropyLoss
in this function is further defined in/model/core/cnn.py
You can have multiple target columns in the same dataset by setting them as a list under the target_column
, for instance for this task you can have:
target_column
: [“3state”, “8state”]target_size
: [3, 8]mask_column
: [“resolved”, “resolved”]
In doing so, the loss will be calculated for both columns.
You can remove tha masking by setting mask_column
: [null]
In this instance, as we are doing a token level classification task, we will be attaching a ConvNet
head based on ../bionemo/model/core/cnn.py
which uses the PerTokenMaskedCrossEntropyLoss
class as loss function with ReLU
as activation function.
secondary_structure_df = pd.read_csv(f'{bionemo_home}/data/FLIP/secondary_structure/train/x000.csv')
secondary_structure_df.head()
id | sequence | 3state | resolved | |
---|---|---|---|---|
0 | 5kar-A | DRHHHHHHKLQLGRFWHISDLHLDPNYTVSKDPLQVCPSAGSQPVL... | CCCCCCCCCCCCEEEEEECCCCECCCCCCCCCCCCCCHHHCCCCCC... | 0000000000011111111111111111111111111111111111... |
1 | 1j2j-B | NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK | CCCCCCHHHHHHHHHHHCCCCHHHHHHHHHHHHHHHHHHHCCCCC | 001111111111111111111111111111111111111111100 |
2 | 5x6s-B | SGSLQQVTDFGDNPTNVGMYIYVPNNLASNPGIVVAIHYCTGTGPG... | CCEEEEECCCCCCCCCCEEEEEECCCCCCCCCEEEEECCCCCCHHH... | 0111111111111111111111111111111111111111111111... |
3 | 5jrc-C | MGSSHHHHHHSSGLVPRGSHMASMTGGQQMGRGSMLPNLDNLKEEY... | CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHH... | 0000000000000000000000000000000001111111111111... |
4 | 2rjo-A | MSLGQTTLACSFRSLTNPYYTAFNKGAQSFAKSVGLPYVPLTTEGS... | CCCCCCEEEEEECCCCCHHHHHHHHHHHHHHHHHCCCEEEEECCCC... | 0011111111111111111111111111111111111111111111... |
secondary_structure_df = pd.read_csv(f'{bionemo_home}/data/FLIP/secondary_structure/train/x000.csv')
secondary_structure_df.head()
id | sequence | 3state | resolved | |
---|---|---|---|---|
0 | 5kar-A | DRHHHHHHKLQLGRFWHISDLHLDPNYTVSKDPLQVCPSAGSQPVL... | CCCCCCCCCCCCEEEEEECCCCECCCCCCCCCCCCCCHHHCCCCCC... | 0000000000011111111111111111111111111111111111... |
1 | 1j2j-B | NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK | CCCCCCHHHHHHHHHHHCCCCHHHHHHHHHHHHHHHHHHHCCCCC | 001111111111111111111111111111111111111111100 |
2 | 5x6s-B | SGSLQQVTDFGDNPTNVGMYIYVPNNLASNPGIVVAIHYCTGTGPG... | CCEEEEECCCCCCCCCCEEEEEECCCCCCCCCEEEEECCCCCCHHH... | 0111111111111111111111111111111111111111111111... |
3 | 5jrc-C | MGSSHHHHHHSSGLVPRGSHMASMTGGQQMGRGSMLPNLDNLKEEY... | CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHH... | 0000000000000000000000000000000001111111111111... |
4 | 2rjo-A | MSLGQTTLACSFRSLTNPYYTAFNKGAQSFAKSVGLPYVPLTTEGS... | CCCCCCEEEEEECCCCCHHHHHHHHHHHHHHHHHCCCEEEEECCCC... | 0011111111111111111111111111111111111111111111... |
%%capture --no-display --no-stderr cell_output
!cd {bionemo_home} && python examples/protein/downstream/downstream_flip.py \
--config-path={configuration_folder} \
--config-name=downstream_flip_sec_str \
name={model_name}-finetuned-sec-str \
++trainer.devices=1 \
++trainer.max_epochs=1 \
++trainer.val_check_interval=1 \
++model.micro_batch_size=1 \
++trainer.max_steps=1 \
++exp_manager.create_wandb_logger=false
Method 4: LoRA Fine-Tuning for Token-Level Classification Task#
In this example, we will replicate the fine-tuning using the 3-state structure of proteins with LoRA adapters.
Low-Rank Adaptation (LoRA) is a parameter-efficient strategy designed to adapt large pretrained language models to downstream tasks while avoiding challenges associated with full fine-tuning. Unlike traditional fine-tuning approaches that adjust all parameters within a pretrained model, LoRA maintains the core weights of the pretrained model as frozen. Instead, it introduces trainable rank decomposition matrices, known as LoRA adapters, into each layer of the Transformer architecture. These adapters are smaller matrices that approximate the original weight matrices, thereby reducing the number of trainable parameters.
In the context of antibody sequences, where data availability may be limited, LoRA offers several advantages. By focusing on adapting these smaller adapter matrices rather than the entire model, LoRA makes fine-tuning more efficient and less susceptible to overfitting. This is particularly beneficial for tasks requiring adaptation to specific protein sequences, where preserving the learned features of the pretrained ESM-2nv model is crucial.
By integrating LoRA into BioNeMo’s fine-tuning pipeline for ESM-2nv models, you can leverage the robustness of pretrained models while tailoring them to the unique characteristics of antibody sequences. This extension not only enhances model performance but also ensures adaptability and efficiency in handling specialized protein sequence data.
Key Adjustments in the YAML File:
model.peft.enabled=True
: Enables the PEFT (Parameter-Efficient Fine-Tuning) technique, specifically using LoRA (lora
).model.peft.peft_scheme="lora"
: Specifies that LoRA is used as the adaptation method.++model.peft.lora_tuning.adapter_dim=32
: Sets the dimensionality of the adapter layers used in LoRA.++model.peft.lora_tuning.adapter_dropout=0.0
: Specifies the dropout rate for the adapter layers in LoRA.++model.peft.lora_tuning.column_init_method="xavier"
: Defines the initialization method for the column weights of the adapter layers in LoRA.++model.peft.lora_tuning.row_init_method="zero"
: Specifies the initialization method for the row weights of the adapter layers in LoRA.++model.peft.lora_tuning.layer_selection=null
: Determines which layers to apply LoRA adapters to. Ifnull
, adapters are applied to all layers.++model.peft.lora_tuning.weight_tying=False
: Specifies whether weight tying is used in LoRA.++model.peft.lora_tuning.position_embedding_strategy=null
: Used only whenweight_tying
isTrue
. Specifies the strategy for position embeddings in LoRA.
Following these instructions and reimplementing the ESM2nvLoRAModel
class in the bionemo/model/protein/esm1nv/esm1nv_model.py
script for ESM-1, you can perform LoRA.
%%capture --no-display --no-stderr cell_output
! cd /workspace/bionemo && python examples/protein/downstream/downstream_flip.py \
--config-path={configuration_folder} \
--config-name=downstream_sec_str_LORA \
name={model_name}-finetuned-sec-str_LORA \
++trainer.devices=1 \
++trainer.max_epochs=1 \
++trainer.val_check_interval=1 \
++model.micro_batch_size=1 \
++trainer.max_steps=1 \
++exp_manager.create_wandb_logger=false \
++exp_manager.resume_if_exists=false
In this demo, we have learned how to use the existing preprocessing script for the FLIP dataset, perform fine-tuning for different downstream tasks, and apply LoRA adaptors on the FLIP data. In this other notebook, you will see how to bring your own data for fine-tuning purposes.