Pretrain, Fine-tune, and Perform Inference with DNABERT for Splice Site Prediction#

NOTE This notebook was tested on a single RT5880 Ada Generation GPU using BioNeMo Framework v1.8 with an expected runtime of approximately <20 min.

Demo Objectives#

  1. Preprocess Data

    • Objective: Download and prepare genomic data (FASTA and GFF3 files) for training and evaluation.

    • Steps:

      • Download and process FASTA files.

      • Extract splice site information from GFF3 files.

      • Organize data into train, validation, and test sets.

  2. Pretrain DNABERT

    • Objective: Pretrain the DNABERT model on the processed genomic data.

    • Steps:

      • Load and preprocess genomic data.

      • Configure the DNABERT pretraining environment.

      • Execute pretraining and save the model checkpoint.

      • Implement further pretraining from a checkpoint.

  3. Fine-tune DNABERT for Splice Site Prediction

    • Objective: Fine-tune DNABERT for accurate splice site prediction.

    • Steps:

      • Prepare train, validation, and test datasets for the splice site prediction task.

      • Load the pretrained model.

      • Set up the fine-tuning environment.

      • Train on splice site data and evaluate performance.

Setup#

Ensure that you have read through the Getting Started section, can run the BioNeMo Framework docker container, and have configured the NGC Command Line Interface (CLI) within the container. It is assumed that this notebook is being executed from within the container.

NOTE: Some of the cells below can generate long text output. We're using:
%%capture --no-display --no-stderr cell_output
to suppress this output. Comment or delete this line in the cells below to restore full output.

Import and install all required packages#

import os
import warnings

from bionemo.data.preprocess.dna.preprocess import (
    GRCh38Ensembl99FastaResourcePreprocessor,
    GRCh38Ensembl99GFF3ResourcePreprocessor
)

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Home Directory#

Set the home directory as follows:

bionemo_home = "/workspace/bionemo"
os.environ['BIONEMO_HOME'] = bionemo_home
os.chdir(bionemo_home)

Download Model Checkpoints#

The following code will download the pretrained model dnabert-86M.nemo from the NGC registry.

# Define the NGC CLI API KEY and ORG for the model download
# If these variables are not already set in the container, uncomment below
# to define and set with your API KEY and ORG

# api_key = <YOUR_API_KEY>
# ngc_cli_org = <YOUR_NGC_ORG>
# Update the environment variables 
# os.environ['NGC_CLI_API_KEY'] = api_key
# os.environ['NGC_CLI_ORG'] = ngc_cli_org
# Set variables and paths for model and checkpoint
model_name = "dnabert" 
model_version = "dnabert-86M" 
actual_checkpoint_name = "dnabert-86M.nemo" 
model_path = os.path.join(bionemo_home, 'models')
checkpoint_path = os.path.join(model_path, actual_checkpoint_name)
os.environ['MODEL_PATH'] = model_path
%%capture --no-display --no-stderr cell_output
if not os.path.exists(checkpoint_path):
    !cd /workspace/bionemo && \
    python download_artifacts.py --model_dir models --models {model_name}
else:
    print(f"Model {model_name} already exists at {model_path}.")

A Small Note About Config Files and Model Options#

When working with different variants of DNABERT, such as the regular version or the xsmall version, it’s important to ensure that the configuration files used for pretraining, fine-tuning, and downstream tasks match the model architecture. If you pretrain using the regular DNABERT, you must use the corresponding configuration file for fine-tuning and other tasks. Conversely, if you opt for the xsmall version of DNABERT, make sure to adjust your configuration files accordingly to maintain consistency. The configuration files for these models can be found in the examples/dna/dnabert/conf/ directory. Mismatched configurations can lead to errors, particularly size mismatches in model layers, as the model architectures differ between variants. For the purposes of this tutorial, we are using the dnabert_xsmall config, which makes the tutorial easily executable on a single GPU, as the dnabert_xsmall model only has 8.1M parameters compared to the regular DNABert’s 86M.

1. Preprocessing data#

DNABERT model version dnabert-86M.nemo was pretrained on the GRCh38 human genome assembly downloaded from NCBI. From here, you can perform downstream tasks such as splice site prediction, which will be demonstrated later on in the tutorial.

This script prepares the data by running the necessary preprocessing steps before you begin training DNABERT on the GRCh38 human genome assembly. The preprocessing steps include downloading the genomic data, chunking the sequences, and organizing the data into structured FASTA/CSV files for easy access during training. The preprocessed data is stored in the specified directory, which will be used in subsequent training and downstream tasks. The script splits the preprocessed genome data into training, validation, and testing sets. The genome data is split by chromosome: chromosomes 1 through 19 are placed in the training set, chromosome 20 is the validation set, and chromosome 21 is the test set.

%%capture --no-display --no-stderr cell_output
processed_pretraining_data = os.path.join(bionemo_home, 'data', 'GRCh38.p13')
config_dir = os.path.join(bionemo_home, 'examples', 'dna', 'dnabert', 'conf')

# Run the preprocessing step
!cd {bionemo_home} && python examples/dna/dnabert/pretrain.py \
  --config-path={config_dir} \
  --config-name=dnabert_xsmall \
  ++do_training=False \
  ++model.data.dataset_path={processed_pretraining_data}

Now, once we have obtained our preprocessed sequences, we can continue with the pretraining step.

2. Pretrain DNABert#

%%capture --no-display --no-stderr cell_output
! cd {bionemo_home} && python examples/dna/dnabert/pretrain.py \
  --config-path={config_dir} \
  --config-name=dnabert_xsmall \
  ++trainer.devices=1 \
  ++trainer.max_steps=1 \
  ++trainer.val_check_interval=1 \
  ++model.data.dataset_path={processed_pretraining_data} \
  ++model.data.dataset.train=chr\\[1..19\\].fna.gz.chunked.fa \
  ++model.data.dataset.val=chr20.fna.gz.chunked.fa \
  ++model.data.dataset.test=chr21.fna.gz.chunked.fa \
  ++exp_manager.create_wandb_logger=false

Further pre-training from a checkpoint#

In this section, we will explore how to resume training from a pre-existing checkpoint. The script below demonstrates how to initiate further training using a specific checkpoint, ensuring that your model benefits from all previous training efforts. To continue training the model, you must increase the max_steps value, as it retains the previous pretraining metadata and will stop if max_steps is already reached.

%%capture --no-display --no-stderr cell_output
# Run the pretraining script with these paths
! cd {bionemo_home} && python examples/dna/dnabert/pretrain.py \
  --config-path={config_dir} \
  --config-name=dnabert_xsmall \
  ++do_training=True \
  ++trainer.devices=1 \
  ++trainer.max_steps=2 \
  ++trainer.val_check_interval=1 \
  ++model.data.dataset_path={dataset_dir} \
  ++exp_manager.create_wandb_logger=false \
  ++exp_manager.resume_if_exists=True

3. Fine-tuning DNABert: Splice site prediction task#

Splice-site prediction task#

In this task, we will utilize the DNABERT model for predicting splice sites within the human genome. Splice sites are critical regions in the DNA sequence where introns are removed, and exons are joined together during gene expression. We will be working with a dataset specifically prepared using the GRCh38Ensembl99FastaResourcePreprocessor, which provides the necessary .fa.gz files containing raw DNA sequences. These files have been preprocessed to create training, validation, and test datasets tailored for the splice site prediction task.

Data Format for Splice Site Prediction Task#

For the splice site prediction task, the data is formatted in CSV files where each row corresponds to a genomic sequence centered around a potential splice site. The key columns in these CSV files include the following:

  • id: This column represents a unique identifier for each row in the dataset.

  • coord: This column specifies the exact coordinate or position on the chromosome where the splice site is located. The coordinate is typically the center of the sequence window that is being analyzed.

  • kind: This column indicates the type of splice site, where:

    • 0 = Donor site

    • 1 = Acceptor site

    • 2 = Negative example (non-splice site region)

  • transcript: This column contains detailed transcript information, including the identifier for the transcript and potentially the chromosome number.

The sequences themselves are extracted from FASTA files based on the coordinates provided in the CSV and are processed into k-mer representations suitable for input into DNABERT. Each sequence is typically centered on a candidate splice site, allowing the model to learn the characteristics that distinguish true splice sites from non-functional regions.

After running the preprocessor, the preprocessed data files are moved to their respective directories (train, val, test) to ensure they are correctly organized for the subsequent steps in the pipeline. This ensures that the fine-tuning process have access to the properly formatted and placed data.

Preprocessors Overview#

The preprocess.py file contains multiple classes that handle different types of genomic data, depending on the task and dataset. These preprocessors are designed to handle the retrieval, preparation, and formatting of data required for various types of genomic analyses:

  1. GRCh38p13_ResourcePreprocessor:

    • Description: This preprocessor is tailored for the GRCh38.p13 human genome assembly, specifically designed to download all primary chromosomes from this version. It handles the preparation of a set of files, ensuring that each chromosome’s sequence is correctly retrieved and stored for further genomic analyses.

  2. Hg38chromResourcePreprocessor:

    • Description: This preprocessor is designed to download the hg38 chromosome sequences from the UCSC Genome Browser. It is closely tied to specific datasets and provides a structured way to obtain and prepare these sequences for downstream analyses.

  3. GRCh38Ensembl99FastaResourcePreprocessor:

    • Description: This preprocessor is intended for downloading and preparing the FASTA files for the GRCh38 Ensembl release 99. It focuses on retrieving the chromosome sequences in the .fa.gz format, ensuring they are correctly formatted for tasks like sequence analysis or prediction.

  4. GRCh38Ensembl99GFF3ResourcePreprocessor:

    • Description: This preprocessor is used for downloading GFF3 files from Ensembl release 99, which contain annotations and features required for splice site prediction tasks, as utilized in the DNABERT publication. It ensures that the correct genomic annotations are available for these analyses.

  5. DNABERTPreprocessorDataClass:

    • Description: This class provides a structured way to initialize and configure the DNABERT preprocessing pipeline. It includes necessary configurations like genome directory paths, tokenizer models, and dataset configuration, essential for setting up the DNABERT model’s preprocessing phase.

  6. CorePromoterResourcePreparer:

    • Description: This preprocessor focuses on downloading the necessary files for core promoter prediction. It is tightly coupled with specific datasets, such as those from the HPDnew database, to ensure the correct files are prepared for analyzing promoter regions in the genome.

  7. BasenjiDatasetPreprocessor:

    • Description: This preprocessor is responsible for downloading the Basenji2 dataset in its original TFRecord format. It then converts the dataset to WebDataset format and reorganizes metadata. This preprocessor is essential for tasks involving the Basenji2 dataset, particularly in genomic prediction models.

Each of these preprocessors is designed for a specific type of genomic data or task, ensuring that the data is correctly retrieved, prepared, and formatted for downstream analyses.

%%capture --no-display --no-stderr cell_output
# Setting paths
fasta_directory = os.path.join(bionemo_home, 'examples/dna/data/splice-site-prediction/GRCh38.ensembl.99')

# Instantiating the preprocessor
preprocessor = GRCh38Ensembl99FastaResourcePreprocessor(root_directory=bionemo_home, dest_directory=fasta_directory)

# Running the preprocessor to download and prepare the dataset
downloaded_files = preprocessor.prepare()

# Output paths for reference
print("Downloaded Files:")
for file in downloaded_files:
    print(file)

Now, we need to use the GRCh38Ensembl99GFF3ResourcePreprocessor to generate the train, test, and val .csv files that follow the aforementioned format for the task.

%%capture --no-display --no-stderr cell_output
finetuning_dataset_dir = os.path.join(bionemo_home, 'examples/dna/data/splice-site-prediction/finetuning_data')

# Ensuring the target directory exists
os.makedirs(finetuning_dataset_dir, exist_ok=True)

# Instantiating the GFF3 preprocessor
gff3_preprocessor = GRCh38Ensembl99GFF3ResourcePreprocessor(
    root_directory=bionemo_home, 
    dest_directory=finetuning_dataset_dir  
)

# Run the preprocessor to download and prepare the dataset (train, val, test CSV files)
csv_files = gff3_preprocessor.prepare()

print("Generated CSV Files:")
for file in csv_files:
    print(file)

Running splice site prediction task#

Below, we set up the file paths for the command line arguments. We set do_prediction=True in the command line arguments to get a .txt file of predictions that can be used for evaluation purposes.

train_file = os.path.join(finetuning_dataset_dir, 'train.csv')
val_file = os.path.join(finetuning_dataset_dir, 'val.csv')
test_file = os.path.join(finetuning_dataset_dir, 'test.csv')

# Print to verify the paths
print("Config Directory:", config_dir)
print("Dataset Directory:", finetuning_dataset_dir)
print("Pretrained Model Path:", checkpoint_path)
print("Fasta Directory:", fasta_directory)
Config Directory: /workspace/bionemo/examples/dna/dnabert/conf
Dataset Directory: /workspace/bionemo/examples/dna/data/splice-site-prediction/finetuning_data
Pretrained Model Path: /workspace/bionemo/models/dnabert-86M.nemo
Fasta Directory: /workspace/bionemo/examples/dna/data/splice-site-prediction/GRCh38.ensembl.99
%%capture --no-display --no-stderr cell_output
! cd {bionemo_home} && python examples/dna/dnabert/downstream_splice_site.py \
  --config-path={config_dir} \
  --config-name=dnabert_config_splice_site \
  ++do_training=True \
  ++do_prediction=True \
  ++trainer.devices=1 \
  ++trainer.max_steps=3 \
  ++trainer.max_epochs=1 \
  ++trainer.val_check_interval=1 \
  ++model.encoder_frozen=False \
  ++model.data.dataset_path={finetuning_dataset_dir} \
  ++model.data.train_file={train_file} \
  ++model.data.val_file={val_file} \
  ++model.data.predict_file={test_file} \
  ++model.restore_encoder_path={checkpoint_path} \
  ++exp_manager.create_wandb_logger=false \
  ++exp_manager.resume_if_exists=False 

In this demo, we demonstrated how to fine-tune the DNABERT model for the task of splice site prediction. We covered the steps of data preprocessing, model pretraining, and fine-tuning on splice site data. Finally, we generated predictions on a test dataset, which are saved in results/nemo_experiments/dnabert-splicesite/dnabert-splicesite. These predictions can be further analyzed or used as a foundation for additional evaluations, depending on the specific needs of your project.