Fine-Tune ESM-2nv on FLIP Data for Sequence-Level Classification, Regression, Token-Level Classification, and with LoRA Adapters#

NOTE: This notebook has been tested on both an A1000 GPU and an A100, and is compatible with BioNeMo Framework versions 1.6, 1.7, and 1.8. The expected runtime is approximately 2 hours on the A1000 and 10 minutes on the A100. Both tests were performed for the esm2nv-650M model.

Demo Objectives#

Downstream Head Fine-Tuning

Objective: Utilize fine-tuned ESM-2nv models for predicting antibody function with an additional prediction head.
Steps: Collect the data using the existing scripts in BioNeMo for preprocessing, and use the existing downstream prediction head training scripts in BioNeMo for sequence-level classification, sequence-level regression, token-level classification, and with LoRA adapters.

For these purposes, we will use the Fitness Landscape Inference for Proteins (FLIP) evaluation dataset. The FLIP datasets are used to evaluate the performance of protein language models on five specific downstream tasks related to proteins. These tasks include secondary structure prediction, conservation analysis, subcellular localization, meltome analysis, and GB1 activity measurement.

Setup#

Ensure that you have read through the Getting Started section, can run the BioNeMo Framework Docker container, and have configured the NGC Command Line Interface (CLI) within the container. It is assumed that this notebook is being executed from within the container.

NOTE: Some of the cells below generate long text output. We're using

%%capture --no-display --no-stderr cell_output

to suppress this output. Comment or delete this line in the cells below to restore full output.

You can use this notebook for both ESM-2nv and ESM-1nv (except for LoRA) by making minor code changes.

Import and Install All Required Packages#

import os
import pandas as pd
import warnings
from bionemo.data import FLIPPreprocess

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Home Directory#

bionemo_home = '/workspace/bionemo'
bionemo_home = os.environ['BIONEMO_HOME']
os.chdir(bionemo_home)

Download Model Checkpoints#

The following code will download the pretrained model esmn2nv_650M_converted.nemo from the NGC registry.

In BioNeMo FW, there are numerous ESM models available, including ESM-1nv, ESM-2nv 8M with randomly initialized weights, ESM-2nv fine-tuned to secondary structure downstream prediction tasks with LoRA, ESM-2nv 650M, and ESM-2nv 3B. We also have a configuration file for training ESM-2nv 15B available at examples/protein/esm2nv/conf/pretrain_esm2_15B.yaml if needed.

For demo purposes, we have chosen to showcase the ESM-2nv 650M model. For more details on the ESM-1nv or ESM-2nv, consult the corresponding model cards. To find the model names and checkpoint names please see the artifacts_paths.yaml file.

# Define the NGC CLI API KEY and ORG for the model download
# If these variables are not already set in the container, uncomment below
# to define and set with your API KEY and ORG
#api_key = <your_api_key>
#ngc_cli_org = <ngc_cli_org>
# Update the environment variable
#os.environ['NGC_CLI_API_KEY'] = api_key
#os.environ['NGC_CLI_ORG'] = ngc_cli_org

# Set variables and paths for model and checkpoint
model_name = "esm2nv" # change to esm1 for ESM1
model_version = "esm2nv_650m" # change to esm1nv for ESM1
actual_checkpoint_name = "esm2nv_650M_converted.nemo" #  change to esm1nv.nemo for ESM1
model_path = os.path.join(bionemo_home, 'models')
checkpoint_path = os.path.join(model_path, actual_checkpoint_name)
os.environ['MODEL_PATH'] = model_path

%%capture --no-display --no-stderr cell_output
if not os.path.exists(checkpoint_path):
    !cd /workspace/bionemo && \
    python download_artifacts.py --model_dir models --models {model_version}
else:
    print(f"Model {model_version} already exists at {model_path}.")

Download the FLIP Data and Preprocess It#

The code below uses the FLIP preprocessing method to download and preprocess the public FLIP data into a BioNeMo-compatible format. It will create a folder data/FLIP with subdirectories containing the data.

In this demo, we are going to predict various properties of protein sequences:

The protein’s subcellular localization (scl).
The melting temperature of a protein (meltome).
The secondary structure of an amino acid (secondary_structure).

preprocessor = FLIPPreprocess()
for task in ["scl", "meltome", "secondary_structure"]:
    task_dir = f'{bionemo_home}/data/FLIP/{task}'
    preprocessor.prepare_dataset(task_name=task, output_dir=task_dir)

[NeMo I 2024-08-28 13:27:30 flip_preprocess:114] mixed_soft.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:30 flip_preprocess:237] FLIP data download complete.
[NeMo I 2024-08-28 13:27:30 flip_preprocess:239] Processing FLIP dataset.
[NeMo I 2024-08-28 13:27:30 flip_preprocess:245] Writing processed dataset files to /workspace/bionemo/data/FLIP/scl...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:159] Saving train split...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:159] Saving val split...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:159] Saving test split...
[NeMo I 2024-08-28 13:27:30 flip_preprocess:257] FLIP dataset preprocessing completed
[NeMo I 2024-08-28 13:27:32 flip_preprocess:114] mixed_split.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:32 flip_preprocess:237] FLIP data download complete.
[NeMo I 2024-08-28 13:27:32 flip_preprocess:239] Processing FLIP dataset.
[NeMo I 2024-08-28 13:27:32 flip_preprocess:245] Writing processed dataset files to /workspace/bionemo/data/FLIP/meltome...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:159] Saving train split...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:159] Saving val split...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:159] Saving test split...
[NeMo I 2024-08-28 13:27:33 flip_preprocess:257] FLIP dataset preprocessing completed
[NeMo I 2024-08-28 13:27:34 flip_preprocess:114] sequences.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:35 flip_preprocess:114] sampled.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:36 flip_preprocess:114] resolved.fasta downloaded successfully!
[NeMo I 2024-08-28 13:27:36 flip_preprocess:237] FLIP data download complete.
[NeMo I 2024-08-28 13:27:36 flip_preprocess:239] Processing FLIP dataset.
[NeMo I 2024-08-28 13:27:36 flip_preprocess:245] Writing processed dataset files to /workspace/bionemo/data/FLIP/secondary_structure...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:159] Saving train split...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:159] Saving val split...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:159] Saving test split...
[NeMo I 2024-08-28 13:27:36 flip_preprocess:257] FLIP dataset preprocessing completed

For demo purposes, we will subsample the datasets to enable faster execution of the notebook.

def subsample_data(path: str, subsample_type: str, column: str = None, subset_fraction: float = 0.05, random_seed: int = 0) -> None:
    """
    Subsamples the dataset based on the specified subsample type.
    """
    # Load the data
    df = pd.read_csv(path)

    # Function to sample a subset of the data by class
    def sample_by_class(df, fraction):
        return df.groupby(column, group_keys=False).apply(lambda x: x.sample(frac=fraction, random_state=random_seed))
    
    # Function to sample a subset of the data by continuous variable
    def sample_by_continuous(df, fraction):
        stratify_bins = pd.qcut(df[column], q=10, duplicates='drop')
        return df.groupby(stratify_bins, group_keys=False).apply(lambda x: x.sample(frac=fraction, random_state=random_seed))
    
    # Perform the appropriate subsampling based on the specified type
    if subsample_type == 'class':
        subset = sample_by_class(df, subset_fraction)
    elif subsample_type == 'continuous':
        subset = sample_by_continuous(df, subset_fraction)
    elif subsample_type == 'random':
        subset = df.sample(frac=subset_fraction, random_state=random_seed)
    else:
        raise ValueError("Invalid subsample_type. Choose from 'class', 'continuous', or 'random'.")

    # Save the subset to the original CSV file
    subset.to_csv(path, index=False)
    
    # Print statement to confirm subsampling
    print(f"File has been subsampled and saved: {path}")

# Example usage
for data_set in ["train", "val", "test"]:
    subsample_data(path=f"data/FLIP/scl/{data_set}/x000.csv", subsample_type='class', column='scl_label', random_seed=42)
    subsample_data(path=f"data/FLIP/meltome/{data_set}/x000.csv", subsample_type='continuous', column='target', random_seed=42)
    subsample_data(path=f"data/FLIP/secondary_structure/{data_set}/x000.csv", subsample_type='random', random_seed=42)

File has been subsampled and saved: data/FLIP/scl/train/x000.csv
File has been subsampled and saved: data/FLIP/meltome/train/x000.csv
File has been subsampled and saved: data/FLIP/secondary_structure/train/x000.csv
File has been subsampled and saved: data/FLIP/scl/val/x000.csv
File has been subsampled and saved: data/FLIP/meltome/val/x000.csv
File has been subsampled and saved: data/FLIP/secondary_structure/val/x000.csv
File has been subsampled and saved: data/FLIP/scl/test/x000.csv
File has been subsampled and saved: data/FLIP/meltome/test/x000.csv
File has been subsampled and saved: data/FLIP/secondary_structure/test/x000.csv

Fine-tuning#

The BioNeMo framework supports easy fine-tuning on downstream tasks by loading the pretrained model, which can be frozen or unfrozen, and adding a task-specific head. BioNeMo also provides example config files for downstream task fine-tuning of ESM-2nv and ESM-1nv on some FLIP tasks.

A pretrained ESM model can be provided using a path to a NeMo model (via restore_encoder_path). This is done through:

Adding model.restore_encoder_path: To the config yaml.
Passing model.restore_encoder_path: As a command-line argument into your script.

Method 1: Standard Fine-Tuning for Sequence Level Classification Tasks#

In this example, we will predict the 10 subcellular localization sites of proteins as described in the FLIP dataset. Under the data/FLIP/scl folder, you will see the correct expected structure for BioNeMo:

data/path/
    train/
        x000.csv
    val/
        x000.csv
    test/
        x000.csv

By inspecting the file, you will see three columns:

id: The sequence ID.
sequence: The protein sequence.
target: The corresponding class of the sequence.

The CSV files should be named exactly as x000.csv. You can provide a list of such files by specifying it as a list in the config file. For instance, if you have 50 csv files, you can specify this by setting x[000..049] to take files named x000.csv up to a file named x0049.csv.

To run this downstream task, we have included an example downstream_flip_scl configuration file. For your own custom downstream tasks, you can create your own YAML file or override existing ones using HYDRA by specifying the following fields:

restore_from_path: Set to the path of the pretrained model checkpoint .nemo file.
trainer.devices, trainer.num_nodes: Set it to the number of GPU and nodes, respectively.
trainer.max_epochs: Set to the number of epochs you want to train.
trainer.val_check_interval: Set to the number of steps to run validation.
model.micro_batch_size: Set to the micro batch size for training.
data.task_name: Can be anything.
data.task_type: The current options are token-level-classification, classification (sequence level), and regression (sequence level).
preprocessed_data_path: Set to the path of the parent folder of dataset_path. See - dataset_path for how this env is used.
dataset_path: Set to the folder that contains train/val/test folders.
dataset.train, dataset.val, dataset.test: Set to the CSV name or ranges.
sequence_column: Set to the name of the column containing the sequence, e.g. sequence in this example.
target_column: Set to the name of the column containing the target, e.g. scl_label in this example.
target_size: Number of classes in each label for classification.
num_classes: Set to target_size.
encoder_frozen: Used to set the encoder trainable or frozen, True by default.

This task will use the CrossEntropyLoss and add an MLPmodel task head with ReLU as activation function, LayerNorm, and Dropout set at 0.25 as specified in the ../model/core/mlp_model.py file.

For the purpose of this demo example, we will shorten the time required for training by setting the following parameters: ++trainer.max_steps=1, ++val_check_interval=1, ++limit_val_batches and ++limit_test_batches, reducing the number of batches for validation and testing to 1. Users can update these parameters by editing the .yaml config file or by overriding config arguments at runtime using Hydra, as shown in the example below.

configuration_folder = os.path.join(bionemo_home, f'examples/protein/{model_name}/conf')
scl_df = pd.read_csv(f'{bionemo_home}/data/FLIP/scl/train/x000.csv')
scl_df.head()

	id	sequence	scl_label
0	Sequence867	MQGSKGVENPAFVPSSPDTPRRASASPSQVEVSAVASRNQNGGSQP...	Cell_membrane
1	Sequence439	MNVSHASVHPVEDPPAAATEVENPPRVRMDDMEGMPGTLLGLALRF...	Cell_membrane
2	Sequence342	MKMASSLAFLLLNFHVSLFLVQLLTPCSAQFSVLGPSGPILAMVGE...	Cell_membrane
3	Sequence735	MENPPNETEAKQIQTNEGKKTKGGIITMPFIIANEAFEKVASYGLL...	Cell_membrane
4	Sequence784	MKSFNTEGHNHSTAESGDAYTVSDPTKNVDEDGREKRTGTWLTASA...	Cell_membrane

%%capture --no-display --no-stderr cell_output
!cd {bionemo_home} && python examples/protein/downstream/downstream_flip.py \
    --config-path={configuration_folder} \
    --config-name=downstream_flip_scl \
    name={model_name}-finetuned-scl \
    ++trainer.devices=1 \
    ++trainer.max_epochs=1 \
    ++trainer.val_check_interval=1 \
    ++trainer.limit_test_batches=1 \
    ++trainer.limit_val_batches=1 \
    ++model.micro_batch_size=1 \
    ++trainer.max_steps=1 \
    ++exp_manager.create_wandb_logger=false

Method 2: Standard Fine-Tuning for Sequence-Level Regression Tasks#

In this example, we will predict the melting temperature of proteins as described in the FLIP dataset.

As before, we need our files to be in the correct format with the appropriate naming. Thanks to the preprocessing steps we carried out at the beginning of this notebook, the data is already in the right format. Again, we have a custom downstream_flip_meltome.yaml configuration file with all the correct settings.

Inside it, you should pay attention to:

loss_func: This time it is MSELoss.
task_name: meltome.
sequence_column: The name of the column where the protein sequence is located.
target_column: This is the target column in our file, which is called target.
target_sizes: This is the number of classes in each label; in this case, it will be 1, as it is a regression task.

meltome_df = pd.read_csv(f'{bionemo_home}/data/FLIP/meltome/train/x000.csv')
meltome_df.head()

	id	sequence	target
0	Sequence10771	MSWPTLTVRLQQKVIRYLDYESRCNLRICSKDDKDSVDSVKFNPKT...	35.898600
1	Sequence3319	MRLVKQEYVLDGLDCSNCARKIENGVKGIKGINGCAVNFAASTLTV...	38.732746
2	Sequence3630	MSSFDRRIEAACKFDDERYYKQYHRYFDVLAQVHSVVETINGAQML...	39.144778
3	Sequence8135	MDAEDGFDPTLLKKKKKKKTTFDLDAALGLEDDTKKEDPQDEASAE...	40.476706
4	Sequence3437	MSYYNKRNQEPLPKEDVSTWECTKEDCNGWTRKNFASSDTPLCPLC...	36.754142

meltome_df = pd.read_csv(f'{bionemo_home}/data/FLIP/meltome/train/x000.csv')
meltome_df.head()

	id	sequence	target
0	Sequence10771	MSWPTLTVRLQQKVIRYLDYESRCNLRICSKDDKDSVDSVKFNPKT...	35.898600
1	Sequence3319	MRLVKQEYVLDGLDCSNCARKIENGVKGIKGINGCAVNFAASTLTV...	38.732746
2	Sequence3630	MSSFDRRIEAACKFDDERYYKQYHRYFDVLAQVHSVVETINGAQML...	39.144778
3	Sequence8135	MDAEDGFDPTLLKKKKKKKTTFDLDAALGLEDDTKKEDPQDEASAE...	40.476706
4	Sequence3437	MSYYNKRNQEPLPKEDVSTWECTKEDCNGWTRKNFASSDTPLCPLC...	36.754142

%%capture --no-display --no-stderr cell_output
!cd {bionemo_home} && python examples/protein/downstream/downstream_flip.py \
    --config-path={configuration_folder} \
    --config-name=downstream_flip_meltome \
    name={model_name}-finetuned-meltome \
    ++trainer.devices=1 \
    ++trainer.max_epochs=1 \
    ++trainer.val_check_interval=1 \
    ++model.micro_batch_size=1 \
    ++trainer.max_steps=1 \
    ++exp_manager.create_wandb_logger=false

Method 3: Standard finetune for token level classification tasks#

In this example, we will predict three structure states of proteins, as described in the FLIP dataset. For each amino acid in the sequence, the model predicts whether it is found in a helix, sheet, or coil structure.

For the target column (e.g. 3state), use a sequence of the same length as the protein sequence. Each character in the sequence represents a class (e.g. C for coil, H for helix, E for sheet)
You can also apply a mask column. For example, the resolved column uses a sequence of 1 and 0s that is the same length of the protein sequence. 1 = experimentally resolved, 0 = not resolved.
The loss will only be calculated for the resolved positions as this is specified under the mask_column.
The loss_fn no longer needs to be set as it is pre-built in the /model/protein/downstream/protein_model_finetuning.py under build_loss_fn. The PerTokenMaskedCrossEntropyLoss in this function is further defined in /model/core/cnn.py

You can have multiple target columns in the same dataset by setting them as a list under the target_column, for instance for this task you can have:

target_column: [“3state”, “8state”]
target_size: [3, 8]
mask_column: [“resolved”, “resolved”]

In doing so, the loss will be calculated for both columns.

You can remove tha masking by setting mask_column: [null]

In this instance, as we are doing a token level classification task, we will be attaching a ConvNet head based on ../bionemo/model/core/cnn.py which uses the PerTokenMaskedCrossEntropyLoss class as loss function with ReLU as activation function.

secondary_structure_df = pd.read_csv(f'{bionemo_home}/data/FLIP/secondary_structure/train/x000.csv')
secondary_structure_df.head()

	id	sequence	3state	resolved
0	5kar-A	DRHHHHHHKLQLGRFWHISDLHLDPNYTVSKDPLQVCPSAGSQPVL...	CCCCCCCCCCCCEEEEEECCCCECCCCCCCCCCCCCCHHHCCCCCC...	0000000000011111111111111111111111111111111111...
1	1j2j-B	NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK	CCCCCCHHHHHHHHHHHCCCCHHHHHHHHHHHHHHHHHHHCCCCC	001111111111111111111111111111111111111111100
2	5x6s-B	SGSLQQVTDFGDNPTNVGMYIYVPNNLASNPGIVVAIHYCTGTGPG...	CCEEEEECCCCCCCCCCEEEEEECCCCCCCCCEEEEECCCCCCHHH...	0111111111111111111111111111111111111111111111...
3	5jrc-C	MGSSHHHHHHSSGLVPRGSHMASMTGGQQMGRGSMLPNLDNLKEEY...	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHH...	0000000000000000000000000000000001111111111111...
4	2rjo-A	MSLGQTTLACSFRSLTNPYYTAFNKGAQSFAKSVGLPYVPLTTEGS...	CCCCCCEEEEEECCCCCHHHHHHHHHHHHHHHHHCCCEEEEECCCC...	0011111111111111111111111111111111111111111111...

secondary_structure_df = pd.read_csv(f'{bionemo_home}/data/FLIP/secondary_structure/train/x000.csv')
secondary_structure_df.head()

	id	sequence	3state	resolved
0	5kar-A	DRHHHHHHKLQLGRFWHISDLHLDPNYTVSKDPLQVCPSAGSQPVL...	CCCCCCCCCCCCEEEEEECCCCECCCCCCCCCCCCCCHHHCCCCCC...	0000000000011111111111111111111111111111111111...
1	1j2j-B	NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK	CCCCCCHHHHHHHHHHHCCCCHHHHHHHHHHHHHHHHHHHCCCCC	001111111111111111111111111111111111111111100
2	5x6s-B	SGSLQQVTDFGDNPTNVGMYIYVPNNLASNPGIVVAIHYCTGTGPG...	CCEEEEECCCCCCCCCCEEEEEECCCCCCCCCEEEEECCCCCCHHH...	0111111111111111111111111111111111111111111111...
3	5jrc-C	MGSSHHHHHHSSGLVPRGSHMASMTGGQQMGRGSMLPNLDNLKEEY...	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHH...	0000000000000000000000000000000001111111111111...
4	2rjo-A	MSLGQTTLACSFRSLTNPYYTAFNKGAQSFAKSVGLPYVPLTTEGS...	CCCCCCEEEEEECCCCCHHHHHHHHHHHHHHHHHCCCEEEEECCCC...	0011111111111111111111111111111111111111111111...

%%capture --no-display --no-stderr cell_output
!cd {bionemo_home} && python examples/protein/downstream/downstream_flip.py \
    --config-path={configuration_folder} \
    --config-name=downstream_flip_sec_str \
    name={model_name}-finetuned-sec-str \
    ++trainer.devices=1 \
    ++trainer.max_epochs=1 \
    ++trainer.val_check_interval=1 \
    ++model.micro_batch_size=1 \
    ++trainer.max_steps=1 \
    ++exp_manager.create_wandb_logger=false

Method 4: LoRA Fine-Tuning for Token-Level Classification Task#

In this example, we will replicate the fine-tuning using the 3-state structure of proteins with LoRA adapters.

Low-Rank Adaptation (LoRA) is a parameter-efficient strategy designed to adapt large pretrained language models to downstream tasks while avoiding challenges associated with full fine-tuning. Unlike traditional fine-tuning approaches that adjust all parameters within a pretrained model, LoRA maintains the core weights of the pretrained model as frozen. Instead, it introduces trainable rank decomposition matrices, known as LoRA adapters, into each layer of the Transformer architecture. These adapters are smaller matrices that approximate the original weight matrices, thereby reducing the number of trainable parameters.

In the context of antibody sequences, where data availability may be limited, LoRA offers several advantages. By focusing on adapting these smaller adapter matrices rather than the entire model, LoRA makes fine-tuning more efficient and less susceptible to overfitting. This is particularly beneficial for tasks requiring adaptation to specific protein sequences, where preserving the learned features of the pretrained ESM-2nv model is crucial.

By integrating LoRA into BioNeMo’s fine-tuning pipeline for ESM-2nv models, you can leverage the robustness of pretrained models while tailoring them to the unique characteristics of antibody sequences. This extension not only enhances model performance but also ensures adaptability and efficiency in handling specialized protein sequence data.

Key Adjustments in the YAML File:

model.peft.enabled=True: Enables the PEFT (Parameter-Efficient Fine-Tuning) technique, specifically using LoRA (lora).
model.peft.peft_scheme="lora": Specifies that LoRA is used as the adaptation method.
++model.peft.lora_tuning.adapter_dim=32: Sets the dimensionality of the adapter layers used in LoRA.
++model.peft.lora_tuning.adapter_dropout=0.0: Specifies the dropout rate for the adapter layers in LoRA.
++model.peft.lora_tuning.column_init_method="xavier": Defines the initialization method for the column weights of the adapter layers in LoRA.
++model.peft.lora_tuning.row_init_method="zero": Specifies the initialization method for the row weights of the adapter layers in LoRA.
++model.peft.lora_tuning.layer_selection=null: Determines which layers to apply LoRA adapters to. If null, adapters are applied to all layers.
++model.peft.lora_tuning.weight_tying=False: Specifies whether weight tying is used in LoRA.
++model.peft.lora_tuning.position_embedding_strategy=null: Used only when weight_tying is True. Specifies the strategy for position embeddings in LoRA.

NOTE: LoRA is currently not supported for ESM-1nv.

Following these instructions and reimplementing the ESM2nvLoRAModel class in the bionemo/model/protein/esm1nv/esm1nv_model.py script for ESM-1, you can perform LoRA.

%%capture --no-display --no-stderr cell_output
! cd /workspace/bionemo && python examples/protein/downstream/downstream_flip.py \
    --config-path={configuration_folder} \
    --config-name=downstream_sec_str_LORA \
     name={model_name}-finetuned-sec-str_LORA \
    ++trainer.devices=1 \
    ++trainer.max_epochs=1 \
    ++trainer.val_check_interval=1 \
    ++model.micro_batch_size=1 \
    ++trainer.max_steps=1 \
    ++exp_manager.create_wandb_logger=false \
    ++exp_manager.resume_if_exists=false

In this demo, we have learned how to use the existing preprocessing script for the FLIP dataset, perform fine-tuning for different downstream tasks, and apply LoRA adaptors on the FLIP data. In this other notebook, you will see how to bring your own data for fine-tuning purposes.

NVIDIA BioNeMo Framework

Fine-Tune ESM-2nv on FLIP Data for Sequence-Level Classification, Regression, Token-Level Classification, and with LoRA Adapters

Contents