Adding the OAS Dataset: Downloading and Preprocessing#

Adding a new dataset to BioNeMo is a common task. This tutorial will show the developer how to accomplish this objective. The Observed Antibody Space (OAS) dataset will be used for this example. The OAS dataset is a database of antibody sequences containing over one billion sequences from 80 different studies for use in large scale analysis.

The task of adding a new dataset can be broken into three development tasks which can make use of associated base and helper classes in BioNeMo and NeMo. This dataset will be added to the ESM1-nv pre-training pipeline. There are three steps to this process:

  1. Preprocessing includes download of the raw data and any additional preparation steps, such as extracting the files. It also includes dividing the data into train, validation, and test splits. The preprocessing step can make use of two BioNeMo base classes, RemoteResource and ResourcePreprocessor, from bionemo.utils.remote and bionemo.data.preprocess, respectively. Their use is optional but they provide some basic functionality which can accelerate development. This step is covered by the current tutorial.

  2. Development of the new dataset class. Here, the NeMo dataset class CSVMemMapDataset will be used. This task will be covered by the next tutorial, Modifying the Dataset Class.

  3. Modification of the dataloader classes. This task will be covered by the third tutorial, Adding a Custom Dataloader. TODO FIX LINK WHEN TUTORIAL FINISHED

Accessing OAS Dataset#

The paired sequence subset of the data will be used for this tutorial. The tutorial requires a shell script containing url links for the appropriate files. This script cannot be directly downloaded from the website and must be generated from the paired sequences search page by selecting “search” without choosing any attributes, as described here.

The full dataset currently contains links to 158 sequence files. This tutorial will use a subset of the data – the first ten files. The contents of the file are shown below and, if preferred, can be copied directly instead of downloading from OAS. Save this file to $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_paired_subset_download.sh. The contents of oas_paired_subset_download.sh should look like this:

wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz

Downloading and Verifying Data#

The RemoteResource class is used to create the existing download location (if needed), download a file, and verify its checksum. If the dataset contains multiple files (as is the case with OAS data), then multiple RemoteResources will be used. In practice, this class is rarely interacted with directly. Instead, it is usually called as part of the second class, ResourcePreprocessor. ResourcePreprocessor will be used as the base class for creation of the OAS preprocessing class.

The creation of the OAS preprocessing class will require the implementation of two methods:

  1. get_remote_resources, which implements a RemoteResource for each file, downloads it, and verifies the checksum; and

  2. prepare, which performs any preprocessing on the data and splits into train, val, and test datasets.

Data Preprocessing Class#

First, let’s create the functionality to download the files. In the same directory as oas_paired_subset_download.sh, create a file called oas_preprocess.py. The path will be $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py. In this file, create a class based on ResourcePreprocessor that parses the URLs in the download script, returns a RemoteResource for each of the URLs, and downloads the files referenced by the URLs.

Here is an example of such a class for oas_preprocess.py. This class saves the downloaded files to /data/OASpaired/raw.

from bionemo.data.preprocess import ResourcePreprocessor
from bionemo.utils.remote import RemoteResource
from nemo.utils import logging
from dataclasses import dataclass
from typing import List, Optional
import re
import os

__all__ = ['OASPairedPreprocess']

# BIONEMO_HOME = os.getenv('BIONEMO_HOME', '/workspace/bionemo')
BIONEMO_HOME = '/workspace/bionemo' # FIXME
OAS_DOWNLOAD_LINKS_PATH = f'{BIONEMO_HOME}/bionemo/data/preprocess/protein/oas_paired_subset_download.sh'

@dataclass
class OASPairedPreprocess(ResourcePreprocessor):
    """OASPairedPreprocessor to download and preprocess OAS paired antibody heavy chain data."""
    root_directory: str = '/data/OASpaired'
    dest_directory: str = 'raw'

    def get_remote_resources(self, download_script_path:str = OAS_DOWNLOAD_LINKS_PATH) -> List[RemoteResource]:
        """Download and verify each file from the file at the provided download script path."""
        
        # Load the download script and parse the urls
        with open(download_script_path, 'r') as fh:
            url_list = [re.split('\s+', x.strip())[-1] for x in fh.readlines()]
            logging.info(f"The following URLs were parsed: {url_list}")
        
        # Checksums will be added later
        checksums = {'SRR11528761_paired.csv.gz': None, 
                     'SRR11528762_paired.csv.gz': None, 
                     'SRR10358523_paired.csv.gz': None, 
                     'SRR10358524_paired.csv.gz': None, 
                     'SRR10358525_paired.csv.gz': None, 
                     'SRR9179273_paired.csv.gz':  None, 
                     'SRR9179274_paired.csv.gz':  None, 
                     'SRR9179275_paired.csv.gz':  None, 
                     'SRR9179276_paired.csv.gz':  None, 
                     'SRR9179277_paired.csv.gz':  None}

        resources = list()
        for url in url_list:
            filename = os.path.basename(url)
            resource = RemoteResource(
                dest_directory=self.dest_directory,
                dest_filename=filename,
                root_directory=self.root_directory,
                checksum=checksums.get(filename),
                url=url
            )
            resources.append(resource)
            
        return resources

    def prepare(self):
        pass

Custom YAML Config#

A custom YAML configuration file is useful for making changes to the model and training configuration parameters. Copy the file $BIONEMO_WORKSPACE/examples/protein/esm1nv/conf/pretrain_small.yaml to examples/protein/esm1nv/conf/pretrain_oas.yaml.

To this new file, make the following modifications:

  • Delete the entire downstream task validation portion in the model section (model.dwnstr_task_validation). This can be reintroduced in the future to enable this functionality, but for now removing it will simplify working with the configuration file.

  • Give the training a new name – here esm1nv-oas has been chosen.

  • Set do_training to False since the focus is currently data preprocessing

  • Disable Weights and Biases logging for now by since it won’t be used for preprocessing by creating an exp_manager section and setting create_wandb_logger to False.

Here is what the new yaml config file looks like.

defaults:
  - base_config


###### Begin OAS Related Addtions ######

name: esm1nv-oas ### Add OAS to the name
do_training: False ### Set to False for preprocessing or True for training
exp_manager: 
  create_wandb_logger: False ### Disable Weights and Biases logger for demo

###### End OAS Related Addtions ######

restore_from_path: null # used when starting from a .nemo file

model:
  tokenizer:
    library: 'sentencepiece'
    type: null
    model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
    vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
  data:
    dataset_path: /data/uniref2022_05 # parent directory for data, contains train / val / test folders. Needs to be writeable for index creation.
    dataset: # inclusive range of data files to load x[000..049] or can a single file, e.g. x000
      train: x[000..049]
      test: x[000..049]
      val: x[000..049]
    micro_batch_size: ${model.micro_batch_size}
    num_workers: 10
    data_impl_kwargs:
      csv_mmap:
        data_col: 3 # 0-based
    modify_percent: 0.1 # Percentage of characters in a protein sequence to modify. (Modification means replacing with another amino acid or with a mask token)
    perturb_percent: 0.5 # Of the modify_percent, what percentage of characters are to be replaced with another amino acid.

Python Execution Script#

A python script to execute our job will also need to be created. In the directory examples/protein/esm1nv, copy the existing pre-train script pretrain.py to pretrain_oas.py. This will be the file which performs preprocessing and runs the pre-training once the pipeline is completed.

Make the following changes to the new pre-training file:

  • Remove the imports for UniRef50Preprocess and FLIPPreprocess

  • Add an import for OASPairedPreprocess from bionemo.data.preprocess.protein.oas_preprocess

  • Modify the section with the log Starting Preprocessing so that it downloads the data and calculates the MD5 checksums for each of the OAS files.

Here is an example:

from omegaconf.omegaconf import OmegaConf
from nemo.core.config import hydra_runner
from nemo.utils import logging
from bionemo.model.protein.esm1nv import ESM1nvModel
from bionemo.model.utils import setup_trainer
from bionemo.utils import BioNeMoSaveRestoreConnector
from bionemo.utils.callbacks.callback_utils import setup_dwnstr_task_validation_callbacks

from bionemo.data.preprocess.protein.oas_preprocess import OASPairedPreprocess ### Import OAS preprocessor
import os, hashlib ### Used for checksum verification

@hydra_runner(config_path="conf", config_name="pretrain_oas") ### Custom YAML config file
def main(cfg) -> None:
    logging.info("\n\n************** Experiment configuration ***********")
    logging.info(f'\n{OmegaConf.to_yaml(cfg)}')

    callbacks = setup_dwnstr_task_validation_callbacks(cfg)

    trainer = setup_trainer(cfg, callbacks=callbacks)
    if cfg.do_training:
        logging.info("************** Starting Training ***********")
        model = ESM1nvModel(cfg.model, trainer)
        trainer.fit(model)
        logging.info("************** Finished Training ***********")
    else:
        logging.info("************** Calculating Checksums ***********")

        ### Changes to calculate checksums
        oas_filepaths = [resource.download_resource(overwrite=True) 
                        for resource in OASPairedPreprocess().get_remote_resources()]

        for fully_qualified_dest_filename in oas_filepaths:    
            with open(fully_qualified_dest_filename, 'rb') as fh:
                filename = os.path.basename(fully_qualified_dest_filename)
                logging.info(f"\"{filename}\": \"{hashlib.md5(fh.read()).hexdigest()}\"")


if __name__ == '__main__':
    main()

Testing#

Run the pipeline with the following command:

cd examples/protein/esm1nv
python pretrain_oas.py

The end of the logged output is shown below:

[NeMo I 2023-08-17 16:44:39 pretrain_oas:26] ************** Calculating Checksums ***********
[NeMo I 2023-08-17 16:44:39 oas_preprocess:27] The following URLs were parsed: ['https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz']
[NeMo I 2023-08-17 16:44:39 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:44:44 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:44:44 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:45:18 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:18 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:45:20 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:20 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:45:22 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:22 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:45:24 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:24 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:45:31 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:31 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:45:56 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:56 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:46:05 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:46:05 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:46:55 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:46:55 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:47:59 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR11528761_paired.csv.gz": "3b671ee3d376445fdafd89932cb4687e"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR11528762_paired.csv.gz": "69988520b12162b1f0613a55236d13a7"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR10358523_paired.csv.gz": "fb5f7242f1f2b555c0bb798da449454e"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR10358524_paired.csv.gz": "5db80ccbf8f47c855daa0d4d13d13d59"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR10358525_paired.csv.gz": "d9df83c314bc426bb7ad00a3375a5994"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179273_paired.csv.gz": "8e91cb8b719c4a2d30b2467cc5f6f080"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179274_paired.csv.gz": "cc7b54cf168a86012773073bc6016cd2"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179275_paired.csv.gz": "1660af663e0bdcf9e4a2dc0d8f79bfae"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179276_paired.csv.gz": "56336af0a20afde929c263e628be6828"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179277_paired.csv.gz": "48ac0e0f4ded0df1e345c7fbbb161601"

Results#

This should have downloaded the ten sequence files and calculated their checksums. The files are found in the path (root_directory/dest_directory) as defined in the preprocessing class (/data/OASpaired/raw):

SRR10358523_paired.csv.gz  SRR11528762_paired.csv.gz  SRR9179276_paired.csv.gz
SRR10358524_paired.csv.gz  SRR9179273_paired.csv.gz   SRR9179277_paired.csv.gz
SRR10358525_paired.csv.gz  SRR9179274_paired.csv.gz
SRR11528761_paired.csv.gz  SRR9179275_paired.csv.gz

Decompressing OAS Sequence Files#

Data Preprocessing Class#

Now, the functionality to finish processing the data will be added. The following edits should be made to bionemo/data/preprocess/protein/oas_preprocess.py:

  • Add the checksum list to the checksum dictionary in get_remote_resources

  • Create a method prepare_resource that downloads each file, performs any additional processing (such as unzipping the files), and returns the final, full path of each file.

  • Create a method prepare that runs prepare_resource for each file.

from bionemo.data.preprocess import ResourcePreprocessor
from bionemo.utils.remote import RemoteResource
from nemo.utils import logging
from dataclasses import dataclass
from typing import List
import re
import os
import gzip
import shutil

__all__ = ['OASPairedPreprocess']

# BIONEMO_HOME = os.getenv('BIONEMO_HOME', '/workspace/bionemo')
BIONEMO_HOME = '/workspace/bionemo' # FIXME
OAS_DOWNLOAD_LINKS_PATH = f'{BIONEMO_HOME}/bionemo/data/preprocess/protein/oas_paired_subset_download.sh'

@dataclass
class OASPairedPreprocess(ResourcePreprocessor):
    """OASPairedPreprocess to download and preprocess OAS paired antibody heavy chain data."""
    root_directory: str = '/data'
    dest_directory: str = 'OASpaired/raw'
    

    def get_remote_resources(self, download_script_path:str = OAS_DOWNLOAD_LINKS_PATH) -> List[RemoteResource]:
        """Download and verify each file from the download script path."""
        
        with open(download_script_path, 'r') as fh:
            url_list = [re.split('\s+', x.strip())[1] for x in fh.readlines()]
        
        # Add calculated checksums
        checksums = {"SRR11528761_paired.csv.gz": "3b671ee3d376445fdafd89932cb4687e",
                     "SRR11528762_paired.csv.gz": "69988520b12162b1f0613a55236d13a7",
                     "SRR10358523_paired.csv.gz": "fb5f7242f1f2b555c0bb798da449454e",
                     "SRR10358524_paired.csv.gz": "5db80ccbf8f47c855daa0d4d13d13d59",
                     "SRR10358525_paired.csv.gz": "d9df83c314bc426bb7ad00a3375a5994",
                     "SRR9179273_paired.csv.gz": "8e91cb8b719c4a2d30b2467cc5f6f080",
                     "SRR9179274_paired.csv.gz": "cc7b54cf168a86012773073bc6016cd2",
                     "SRR9179275_paired.csv.gz": "1660af663e0bdcf9e4a2dc0d8f79bfae",
                     "SRR9179276_paired.csv.gz": "56336af0a20afde929c263e628be6828",
                     "SRR9179277_paired.csv.gz": "48ac0e0f4ded0df1e345c7fbbb161601"}

        resources = list()
        for url in url_list:
            filename = os.path.basename(url)
            resource = RemoteResource(
                dest_directory=self.dest_directory,
                dest_filename=filename,
                root_directory=self.root_directory,
                checksum=checksums.get(filename),
                url=url
            )
            resources.append(resource)
            
        return resources

    def prepare_resource(self, 
                         resource: RemoteResource, 
                         delete_gzipped: bool = False) -> str:
        """Logs and downloads the passed resource.

        resource: RemoteResource - Resource to be prepared.
        delete_gzipped: boolean, default: True - Delete gzipped file once extracted.
        
        Returns - the absolute destination path for the downloaded resource
        """
        logging.info(f"Downloading {resource.url}")
        fully_qualified_gz_filename = resource.download_resource(overwrite=False)
        

        logging.info(f"Extracting the gzipped file")
        fully_qualified_dest_filename = os.path.splitext(fully_qualified_gz_filename)[0]
        with gzip.open(fully_qualified_gz_filename, 'rb') as f_gz:
            with open(fully_qualified_dest_filename, 'wb') as f_ext:
                shutil.copyfileobj(f_gz, f_ext)
                
        if delete_gzipped:
            shutil.rmtree(fully_qualified_gz_filename)
            
        return fully_qualified_dest_filename
        
    
    def prepare(self):
        return [
            self.prepare_resource(resource) for resource in self.get_remote_resources()
        ]

Python Execution Script#

Now, modify the python pre-train script so that it creates an instance of the class and runs the prepare method. These are the final set of changes which need to be made to the pre-training script.

# ## Simplified ESM1nv training config to demonstrate new dataset

from omegaconf.omegaconf import OmegaConf
from bionemo.model.protein.esm1nv import ESM1nvModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from bionemo.model.utils import setup_trainer
from bionemo.utils.callbacks.callback_utils import setup_dwnstr_task_validation_callbacks

from bionemo.data.preprocess.protein.oas_preprocess import OASPairedPreprocess ### Import new preprocessor
import os, hashlib ### Used for checksum verification

@hydra_runner(config_path="conf", config_name="pretrain_oas") ### Custom YAML config file
def main(cfg) -> None:
    logging.info("\n\n************** Experiment configuration ***********")
    logging.info(f'\n{OmegaConf.to_yaml(cfg)}')

    callbacks = setup_dwnstr_task_validation_callbacks(cfg)

    trainer = setup_trainer(cfg, callbacks=callbacks)
    if cfg.do_training:
        logging.info("************** Starting Training ***********")
        model = ESM1nvModel(cfg.model, trainer)
        trainer.fit(model)
        logging.info("************** Finished Training ***********")
    else:
        logging.info("************** Starting Preprocessing ***********")
        preprocessor = OASPairedPreprocess() ### Create instance of preprocess class
        preprocessor.prepare() ### Prepare data


if __name__ == '__main__':
    main()

Testing#

Execute the pre-train script as before:

cd examples/protein/esm1nv
python pretrain_oas.py

Below is the relevant portion of the log statments:

[NeMo I 2023-08-17 16:48:11 pretrain_oas:27] ************** Starting Preprocessing ***********
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file

Results#

The original files already existed, so they were not downloaded. But each of the files has now been extracted.

SRR10358523_paired.csv	   SRR11528761_paired.csv.gz  SRR9179275_paired.csv
SRR10358523_paired.csv.gz  SRR11528762_paired.csv     SRR9179275_paired.csv.gz
SRR10358524_paired.csv	   SRR11528762_paired.csv.gz  SRR9179276_paired.csv
SRR10358524_paired.csv.gz  SRR9179273_paired.csv      SRR9179276_paired.csv.gz
SRR10358525_paired.csv	   SRR9179273_paired.csv.gz   SRR9179277_paired.csv
SRR10358525_paired.csv.gz  SRR9179274_paired.csv      SRR9179277_paired.csv.gz
SRR11528761_paired.csv	   SRR9179274_paired.csv.gz

The CSV files contain an extra row at the top and a lot of additional columns. See the first three lines from SRR10358523_paired.csv below, as an example.

These extra columns will increase seek time during training, so they should be removed. The files also need to be split and numbered consecutively for training, validation, and test splits, respectively.

"{""Run"": ""SRR10358523"", ""Link"": ""https://doi.org/10.1016/j.celrep.2019.12.027"", ""Author"": ""Eccles et al., 2020"", ""Species"": ""human"", ""Age"": ""33"", ""BSource"": ""PBMC"", ""BType"": ""RV+B-Cells"", ""Vaccine"": ""None"", ""Disease"": ""None"", ""Subject"": ""Healthy-1"", ""Longitudinal"": ""no"", ""Unique sequences"": 100, ""Isotype"": ""All"", ""Chain"": ""Paired""}"
sequence_id_heavy,sequence_heavy,locus_heavy,stop_codon_heavy,vj_in_frame_heavy,productive_heavy,rev_comp_heavy,v_call_heavy,d_call_heavy,j_call_heavy,sequence_alignment_heavy,germline_alignment_heavy,sequence_alignment_aa_heavy,germline_alignment_aa_heavy,v_alignment_start_heavy,v_alignment_end_heavy,d_alignment_start_heavy,d_alignment_end_heavy,j_alignment_start_heavy,j_alignment_end_heavy,v_sequence_alignment_heavy,v_sequence_alignment_aa_heavy,v_germline_alignment_heavy,v_germline_alignment_aa_heavy,d_sequence_alignment_heavy,d_sequence_alignment_aa_heavy,d_germline_alignment_heavy,d_germline_alignment_aa_heavy,j_sequence_alignment_heavy,j_sequence_alignment_aa_heavy,j_germline_alignment_heavy,j_germline_alignment_aa_heavy,fwr1_heavy,fwr1_aa_heavy,cdr1_heavy,cdr1_aa_heavy,fwr2_heavy,fwr2_aa_heavy,cdr2_heavy,cdr2_aa_heavy,fwr3_heavy,fwr3_aa_heavy,cdr3_heavy,cdr3_aa_heavy,junction_heavy,junction_length_heavy,junction_aa_heavy,junction_aa_length_heavy,v_score_heavy,d_score_heavy,j_score_heavy,v_cigar_heavy,d_cigar_heavy,j_cigar_heavy,v_support_heavy,d_support_heavy,j_support_heavy,v_identity_heavy,d_identity_heavy,j_identity_heavy,v_sequence_start_heavy,v_sequence_end_heavy,v_germline_start_heavy,v_germline_end_heavy,d_sequence_start_heavy,d_sequence_end_heavy,d_germline_start_heavy,d_germline_end_heavy,j_sequence_start_heavy,j_sequence_end_heavy,j_germline_start_heavy,j_germline_end_heavy,fwr1_start_heavy,fwr1_end_heavy,cdr1_start_heavy,cdr1_end_heavy,fwr2_start_heavy,fwr2_end_heavy,cdr2_start_heavy,cdr2_end_heavy,fwr3_start_heavy,fwr3_end_heavy,cdr3_start_heavy,cdr3_end_heavy,np1_heavy,np1_length_heavy,np2_heavy,np2_length_heavy,sequence_id_light,sequence_light,locus_light,stop_codon_light,vj_in_frame_light,productive_light,rev_comp_light,v_call_light,d_call_light,j_call_light,sequence_alignment_light,germline_alignment_light,sequence_alignment_aa_light,germline_alignment_aa_light,v_alignment_start_light,v_alignment_end_light,d_alignment_start_light,d_alignment_end_light,j_alignment_start_light,j_alignment_end_light,v_sequence_alignment_light,v_sequence_alignment_aa_light,v_germline_alignment_light,v_germline_alignment_aa_light,d_sequence_alignment_light,d_sequence_alignment_aa_light,d_germline_alignment_light,d_germline_alignment_aa_light,j_sequence_alignment_light,j_sequence_alignment_aa_light,j_germline_alignment_light,j_germline_alignment_aa_light,fwr1_light,fwr1_aa_light,cdr1_light,cdr1_aa_light,fwr2_light,fwr2_aa_light,cdr2_light,cdr2_aa_light,fwr3_light,fwr3_aa_light,cdr3_light,cdr3_aa_light,junction_light,junction_length_light,junction_aa_light,junction_aa_length_light,v_score_light,d_score_light,j_score_light,v_cigar_light,d_cigar_light,j_cigar_light,v_support_light,d_support_light,j_support_light,v_identity_light,d_identity_light,j_identity_light,v_sequence_start_light,v_sequence_end_light,v_germline_start_light,v_germline_end_light,d_sequence_start_light,d_sequence_end_light,d_germline_start_light,d_germline_end_light,j_sequence_start_light,j_sequence_end_light,j_germline_start_light,j_germline_end_light,fwr1_start_light,fwr1_end_light,cdr1_start_light,cdr1_end_light,fwr2_start_light,fwr2_end_light,cdr2_start_light,cdr2_end_light,fwr3_start_light,fwr3_end_light,cdr3_start_light,cdr3_end_light,np1_light,np1_length_light,np2_light,np2_length_light,ANARCI_numbering_light,ANARCI_numbering_heavy,ANARCI_status_light,ANARCI_status_heavy
AAACCTGGTCCGTCAG-1_contig_2,AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCATTTGGTGATCAGGACTGAACAGAGAGAACTCACCATGGAGTTTGGGCTGAGCTGGCTTTTTCTTGTGGCTATTTTAAAAGGTGTCCAGTGTGAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTG,H,F,T,T,F,IGHV3-23*01,IGHD6-19*01,IGHJ4*02,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGANNNNNNNNNNNNGCAGTGGCTGGTANNNNNNNNNNNNNNNTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDWPFWQWLVRRGERFDYWGQGTLVTVSS,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKXXXXXQWLVXXXXXFDYWGQGTLVTVSS,1.0,296.0,309.0,321.0,337.0,379.0,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGA,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAK,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGA,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAK,GCAGTGGCTGGTA,QWLV,GCAGTGGCTGGTA,QWLV,TTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,FDYWGQGTLVTVSS,TTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,FDYWGQGTLVTVSS,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCT,EVQLLESGGGLVQPGGSLRLSCAAS,GGATTCACCTTTAGCAGCTATGCC,GFTFSSYA,ATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCT,MSWVRQAPGKGLEWVSA,ATTAGTGGTAGTGGTGGTAGCACA,ISGSGGST,TACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGT,YYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYC,GCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTAC,AKDWPFWQWLVRRGERFDY,TGTGCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTACTGG,63.0,CAKDWPFWQWLVRRGERFDYW,21.0,463.037,25.682,83.363,136S296M154S,444S7N13M129S1N,472S5N43M71S,2.7109999999999982e-132,0.004878,4.5910000000000006e-20,100.0,100.0,100.0,137.0,432.0,1.0,296.0,445.0,457.0,8.0,20.0,473.0,515.0,6.0,48.0,137.0,211.0,212.0,235.0,236.0,286.0,287.0,310.0,311.0,424.0,425.0,481.0,TTGGCCGTTTTG,12.0,AGAAGGGGGGAGCGG,15.0,AAACCTGGTCCGTCAG-1_contig_1,TGGGGAGGAATCAGTCCCACTCAGGACACAGCATGGACATGAGGGTCCCCGCTCAGCTCCTGGGGCTCCTGCTGCTCTGGTTCCCAGGTGCCAGGTGTGACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTCGAACGTTCGGCCAAGGGACCAAGGTGGAAATCAAACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGC,K,F,T,T,F,IGKV1-17*01,,IGKJ1*01,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTCGAACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTCNNACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYPRTFGQGTKVEIK,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYPXTFGQGTKVEIK,1,286,,,289.0,322.0,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTC,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYP,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTC,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYP,,,,,ACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,TFGQGTKVEIK,ACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,TFGQGTKVEIK,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGT,DIQMTQSPSSLSASVGDRVTITCRAS,CAGGGCATTAGAAATGAT,QGIRND,TTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTAT,LGWYQQKPGKAPKRLIY,GCTGCATCC,AAS,AGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGT,SLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYC,CTACAGCATAATAGTTACCCTCGAACG,LQHNSYPRT,TGTCTACAGCATAATAGTTACCCTCGAACGTTC,33.0,CLQHNSYPRTF,11.0,447.456,,66.059,98S286M172S1N,,386S4N34M136S,1.2569999999999987e-127,,7.0420000000000004e-15,100.0,,100.0,99,384,1,286,,,,,387.0,420.0,5.0,38.0,99.0,176.0,177.0,194.0,195.0,245.0,246.0,254.0,255.0,362.0,363.0,389.0,GA,2.0,,,"{'fwk1': {'1 ': 'D', '2 ': 'I', '3 ': 'Q', '4 ': 'M', '5 ': 'T', '6 ': 'Q', '7 ': 'S', '8 ': 'P', '9 ': 'S', '10 ': 'S', '11 ': 'L', '12 ': 'S', '13 ': 'A', '14 ': 'S', '15 ': 'V', '16 ': 'G', '17 ': 'D', '18 ': 'R', '19 ': 'V', '20 ': 'T', '21 ': 'I', '22 ': 'T', '23 ': 'C', '24 ': 'R', '25 ': 'A', '26 ': 'S'}, 'cdrk1': {'27 ': 'Q', '28 ': 'G', '29 ': 'I', '36 ': 'R', '37 ': 'N', '38 ': 'D'}, 'fwk2': {'39 ': 'L', '40 ': 'G', '41 ': 'W', '42 ': 'Y', '43 ': 'Q', '44 ': 'Q', '45 ': 'K', '46 ': 'P', '47 ': 'G', '48 ': 'K', '49 ': 'A', '50 ': 'P', '51 ': 'K', '52 ': 'R', '53 ': 'L', '54 ': 'I', '55 ': 'Y'}, 'cdrk2': {'56 ': 'A', '57 ': 'A', '65 ': 'S'}, 'fwk3': {'66 ': 'S', '67 ': 'L', '68 ': 'Q', '69 ': 'S', '70 ': 'G', '71 ': 'V', '72 ': 'P', '74 ': 'S', '75 ': 'R', '76 ': 'F', '77 ': 'S', '78 ': 'G', '79 ': 'S', '80 ': 'G', '83 ': 'S', '84 ': 'G', '85 ': 'T', '86 ': 'E', '87 ': 'F', '88 ': 'T', '89 ': 'L', '90 ': 'T', '91 ': 'I', '92 ': 'S', '93 ': 'S', '94 ': 'L', '95 ': 'Q', '96 ': 'P', '97 ': 'E', '98 ': 'D', '99 ': 'F', '100 ': 'A', '101 ': 'T', '102 ': 'Y', '103 ': 'Y', '104 ': 'C'}, 'cdrk3': {'105 ': 'L', '106 ': 'Q', '107 ': 'H', '108 ': 'N', '109 ': 'S', '114 ': 'Y', '115 ': 'P', '116 ': 'R', '117 ': 'T'}, 'fwk4': {'118 ': 'F', '119 ': 'G', '120 ': 'Q', '121 ': 'G', '122 ': 'T', '123 ': 'K', '124 ': 'V', '125 ': 'E', '126 ': 'I', '127 ': 'K'}}","{'fwh1': {'1 ': 'E', '2 ': 'V', '3 ': 'Q', '4 ': 'L', '5 ': 'L', '6 ': 'E', '7 ': 'S', '8 ': 'G', '9 ': 'G', '11 ': 'G', '12 ': 'L', '13 ': 'V', '14 ': 'Q', '15 ': 'P', '16 ': 'G', '17 ': 'G', '18 ': 'S', '19 ': 'L', '20 ': 'R', '21 ': 'L', '22 ': 'S', '23 ': 'C', '24 ': 'A', '25 ': 'A', '26 ': 'S'}, 'cdrh1': {'27 ': 'G', '28 ': 'F', '29 ': 'T', '30 ': 'F', '35 ': 'S', '36 ': 'S', '37 ': 'Y', '38 ': 'A'}, 'fwh2': {'39 ': 'M', '40 ': 'S', '41 ': 'W', '42 ': 'V', '43 ': 'R', '44 ': 'Q', '45 ': 'A', '46 ': 'P', '47 ': 'G', '48 ': 'K', '49 ': 'G', '50 ': 'L', '51 ': 'E', '52 ': 'W', '53 ': 'V', '54 ': 'S', '55 ': 'A'}, 'cdrh2': {'56 ': 'I', '57 ': 'S', '58 ': 'G', '59 ': 'S', '62 ': 'G', '63 ': 'G', '64 ': 'S', '65 ': 'T'}, 'fwh3': {'66 ': 'Y', '67 ': 'Y', '68 ': 'A', '69 ': 'D', '70 ': 'S', '71 ': 'V', '72 ': 'K', '74 ': 'G', '75 ': 'R', '76 ': 'F', '77 ': 'T', '78 ': 'I', '79 ': 'S', '80 ': 'R', '81 ': 'D', '82 ': 'N', '83 ': 'S', '84 ': 'K', '85 ': 'N', '86 ': 'T', '87 ': 'L', '88 ': 'Y', '89 ': 'L', '90 ': 'Q', '91 ': 'M', '92 ': 'N', '93 ': 'S', '94 ': 'L', '95 ': 'R', '96 ': 'A', '97 ': 'E', '98 ': 'D', '99 ': 'T', '100 ': 'A', '101 ': 'V', '102 ': 'Y', '103 ': 'Y', '104 ': 'C'}, 'cdrh3': {'105 ': 'A', '106 ': 'K', '107 ': 'D', '108 ': 'W', '109 ': 'P', '110 ': 'F', '111 ': 'W', '111A': 'Q', '111B': 'W', '111C': 'L', '112C': 'V', '112B': 'R', '112A': 'R', '112 ': 'G', '113 ': 'E', '114 ': 'R', '115 ': 'F', '116 ': 'D', '117 ': 'Y'}, 'fwh4': {'118 ': 'W', '119 ': 'G', '120 ': 'Q', '121 ': 'G', '122 ': 'T', '123 ': 'L', '124 ': 'V', '125 ': 'T', '126 ': 'V', '127 ': 'S', '128 ': 'S'}}",|||||,"|Deletions: 10, 73||||"

Cleaning and Splitting OAS Sequence Files#

Data Preprocessing Class#

A new method, process_files will be created to clean up the files and create train, validation, and test splits. For this exercise, only the columns containing the sequence id and the sequence for the antibody heavy chain will be retained (sequence_id_heavy, sequence_heavy).

Edit the file $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py to add this functionality. These are the final edits that will need to be made to the preprocessing class. Here is an example of such a file:

from bionemo.data.preprocess import ResourcePreprocessor
from bionemo.utils.remote import RemoteResource
from nemo.utils import logging
from dataclasses import dataclass
from typing import List
import re
import os
import gzip
import shutil
import random
import pandas as pd

__all__ = ['OASPairedPreprocess']

# BIONEMO_HOME = os.getenv('BIONEMO_HOME', '/workspace/bionemo')
BIONEMO_HOME = '/workspace/bionemo' # FIXME
OAS_DOWNLOAD_LINKS_PATH = f'{BIONEMO_HOME}/bionemo/data/preprocess/protein/oas_paired_subset_download.sh'

@dataclass
class OASPairedPreprocess(ResourcePreprocessor):
    """OASPairedPreprocess to download and preprocess OAS paired antibody data for heavy chains."""
    random_seed: int = 0
    root_directory: str = '/data'
    dest_directory: str = 'OASpaired/raw'
    processed_directory: str = 'OASpaired/processed/heavy'
    columns_to_keep = ['sequence_id_heavy', 'sequence_heavy']
    num_val_files = 2
    num_test_files = 2
    

    def get_remote_resources(self, download_script_path:str = OAS_DOWNLOAD_LINKS_PATH) -> List[RemoteResource]:
        """Download and verify each file from the download script path."""
        
        with open(download_script_path, 'r') as fh:
            url_list = [re.split('\s+', x.strip())[1] for x in fh.readlines()]
        
        # Add calculated checksums
        checksums = {"SRR11528761_paired.csv.gz": "3b671ee3d376445fdafd89932cb4687e",
                     "SRR11528762_paired.csv.gz": "69988520b12162b1f0613a55236d13a7",
                     "SRR10358523_paired.csv.gz": "fb5f7242f1f2b555c0bb798da449454e",
                     "SRR10358524_paired.csv.gz": "5db80ccbf8f47c855daa0d4d13d13d59",
                     "SRR10358525_paired.csv.gz": "d9df83c314bc426bb7ad00a3375a5994",
                     "SRR9179273_paired.csv.gz": "8e91cb8b719c4a2d30b2467cc5f6f080",
                     "SRR9179274_paired.csv.gz": "cc7b54cf168a86012773073bc6016cd2",
                     "SRR9179275_paired.csv.gz": "1660af663e0bdcf9e4a2dc0d8f79bfae",
                     "SRR9179276_paired.csv.gz": "56336af0a20afde929c263e628be6828",
                     "SRR9179277_paired.csv.gz": "48ac0e0f4ded0df1e345c7fbbb161601"}

        resources = list()
        for url in url_list:
            filename = os.path.basename(url)
            resource = RemoteResource(
                dest_directory=self.dest_directory,
                dest_filename=filename,
                root_directory=self.root_directory,
                checksum=checksums.get(filename),
                url=url
            )
            resources.append(resource)
            
        return resources

    
    def prepare_resource(self, 
                         resource: RemoteResource, 
                         delete_gzipped: bool = False) -> str:
        """Logs and downloads the passed resource.

        resource: RemoteResource - Resource to be prepared.
        delete_gzipped: boolean, default: True - Delete gzipped file once extracted.
        
        Returns - the absolute destination path for the downloaded resource
        """
        logging.info(f"Downloading {resource.url}")
        fully_qualified_gz_filename = resource.download_resource(overwrite=False)
        

        logging.info(f"Extracting the gzipped file")
        fully_qualified_dest_filename = os.path.splitext(fully_qualified_gz_filename)[0]
        with gzip.open(fully_qualified_gz_filename, 'rb') as f_gz:
            with open(fully_qualified_dest_filename, 'wb') as f_ext:
                shutil.copyfileobj(f_gz, f_ext)
                
        if delete_gzipped:
            shutil.rmtree(fully_qualified_gz_filename)
            
        return fully_qualified_dest_filename
        
        
    def process_files(self, filepaths: List[str], ):
        file_fill_size = 3
        full_processed_path = os.path.join(self.root_directory, self.processed_directory)
        os.makedirs(full_processed_path, exist_ok=True)
        
        # Assign two files to validation, two to test, and the rest to train
        shuffled_paths = filepaths.copy()
        random.seed = self.random_seed
        random.shuffle(shuffled_paths)
        
        val_paths = shuffled_paths[:self.num_val_files]
        test_paths = shuffled_paths[self.num_val_files:self.num_val_files+self.num_test_files]
        train_paths = shuffled_paths[self.num_val_files+self.num_test_files:]

        # Split the data and clean up extra information
        for split, filenames in zip(['train', 'val', 'test'], [train_paths, val_paths, test_paths]):
            split_path = os.path.join(full_processed_path, split)
            os.makedirs(split_path, exist_ok=True)
            
            for num,filename in enumerate(filenames):
                output_name = os.path.join(split_path, f'x{str(num).zfill(file_fill_size)}.csv')
                logging.info(f"Converting {filename} to {output_name}")
                data = pd.read_csv(filename, skiprows=[0], usecols=self.columns_to_keep)
                data.to_csv(output_name, index=False)
    
    
    def prepare(self):
        filepaths = [self.prepare_resource(resource) 
                         for resource in self.get_remote_resources()]
        
        self.process_files(filepaths)

Testing#

As before, execute the pre-train script:

cd examples/protein/esm1nv
python pretrain_oas.py

This is what the end of the log looks like once preprocessing has started:

[NeMo I 2023-08-17 16:48:23 pretrain_oas:26] ************** Starting Preprocessing ***********
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:24 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:24 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179273_paired.csv to /data/OASpaired/processed/heavy/train/x000.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR11528761_paired.csv to /data/OASpaired/processed/heavy/train/x001.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179277_paired.csv to /data/OASpaired/processed/heavy/train/x002.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR10358525_paired.csv to /data/OASpaired/processed/heavy/train/x003.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179276_paired.csv to /data/OASpaired/processed/heavy/train/x004.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR11528762_paired.csv to /data/OASpaired/processed/heavy/train/x005.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR10358524_paired.csv to /data/OASpaired/processed/heavy/val/x000.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR10358523_paired.csv to /data/OASpaired/processed/heavy/val/x001.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179275_paired.csv to /data/OASpaired/processed/heavy/test/x000.csv
[NeMo I 2023-08-17 16:48:25 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179274_paired.csv to /data/OASpaired/processed/heavy/test/x001.csv

Results#

This has split the data into train, val, and test directories and cleaned up the data:

/data/OASpaired/processed/heavy/test:
x000.csv  x001.csv

/data/OASpaired/processed/heavy/train:
x000.csv  x001.csv  x002.csv  x003.csv	x004.csv  x005.csv

/data/OASpaired/processed/heavy/val:
x000.csv  x001.csv

This is what the first five lines of one of the training files looks like:

sequence_id_heavy,sequence_heavy
AAACCTGAGACTTGAA-1_contig_1,GGGAGAGGAGGCCTGTCCTGGATTCGATTCCCAGTTCCTCACATTCAGTCAGCACTGAACACGGACCCCTCACCATGAACTTCGGGCTCAGCTTGATTTTCCTTGTCCTTGTTTTAAAAGGTGTCCAGTGTGAAGTGATGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAGCTATGCCATGTCTTGGGTTCGCCAGACTCCGGAGAAGAGGCTGGAGTGGGTCGCAACCATTAGTAGTGGTGGTAGTTACACCTACTATCCAGACAGTGTGAAGGGGCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTACCTGCAAATGAGCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGACGGGGGAATGATGGTTACTACGAAGACTACTGGGGCCAAGGCACCACTCTCACAGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGG
AAACCTGAGCGCCTTG-1_contig_2,GAGCTCTGACAGAGGAGGCCAGTCCTGGAATTGATTCCCAGTTCCTCACGTTCAGTGATGAGCACTGAACACAGACACCTCACCATGAACTTTGGGCTCAGATTGATTTTCCTTGTCCTTACTTTAAAAGGTGTGAAGTGTGAAGTGCAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCGCTTTCAGTAGCTATGACATGTCTTGGGTTCGCCAGACTCCGGAGAAGAGGCTGGAGTGGGTCGCATACATTAGTAGTGGTGGTGGTATCACCTACTATCCAGACACTGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCATGTATTACTGTGCAAGGCCCCCGGGACGGGGCTACTGGTACTTCGATGTCTGGGGCGCAGGGACCACGGTCACCGTCTCCTCAGCCAAAACAACAGCCCCATCGGTCTATCCACTGGCCCCTGTGTGTGGAGATACAACTGGCTCCTCGGTGACTCTAGGGTGCCTGGTCAAGGATTATT
AAACCTGCAAGTTAAG-1_contig_1,AACATATGTCCAATGTCCTCTCCACAGACACTGAACACACTGACTCTAACCATGGGATGGAGCTGGATCTTTCTCTTCCTCCTGTCAGGAACTGCAGGCGTCCACTCTGAGGTCCAGCTTCAGCAGTCAGGACCTGAGCTGGTGAAACCTGGGGCCTCAGTGAAGATATCCTGCAAGGCTTCTGGATACACATTCACTGACTACAACATGCACTGGGTGAAGCAGAGCCATGGAAAGAGCCTTGAGTGGATTGGATATATTTATCCTTACAATGGTGGTACTGGCTACAACCAGAAGTTCAAGAGCAAGGCCACATTGACTGTAGACAATTCCTCCAGCACAGCCTACATGGAGCTCCGCAGCCTGACATCTGAGGACTCTGCAGTCTATTACTGTGCAAGATGGGGGCTAACTGGTGATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGG
AAACCTGGTATCTGCA-1_contig_1,GACATAACAGCAAGAGAGTGTCCGGTTAGTCTCAAGGAAGACTGAGACACAGTCTTAGATATCATGGAATGGCTGTGGAACTTGCTATTTCTCATGGCAGCAGCTCAAAGTATCCAAGCACAGATCCAGTTGGTGCAGTCTGGACCTGAGCTGAAGAAGCCTGGAGAGACAGTCAGGATCTCCTGCAAGGCTTCTGGGTATACCTTCACAACTGCTGGAATGCAGTGGGTGCAAAAGATGCCAGGAAAGGGTTTGAAGTGGATTGGCTGGATAAACACCCACTCTGGAGTGCCAAAATATGCAGAAGACTTCAAGGGACGGTTTGCCTTCTCTTTGGAAACCTCTGCCAGCACTGCATATTTACAGATAAGCAACCTCAAAAATGAGGACACGGCTACGTATTTCTGTGCGAGATCAGGTTACGACGCCTTTGACTACTGGGGCCAAGGCACCACTCTCACAGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGG

Optional Variation: Process the Light Chain Data#

What if instead the light chain columns (sequence_id_light and sequence_light) were desired? How could the existing class be subclassed to create a preprocessing class for light chains?

Starting with the existing OASPairedPreprocess class, the only additional changes that would need to be made are:

  • Change the columns_to_keep to preserve the light chains instead of the heavy ones

  • Optionally change the directory for the processed files

Here is an example of a class that could be added to $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py to accomplish this:

@dataclass
class OASPairedLightPreprocessor(OASPairedPreprocess):
    """OASPairedLightPreprocessor to download and preprocess OAS paired antibody light chain data."""
    processed_directory: str = 'OASpaired/processed/light'
    columns_to_keep = ['sequence_id_light', 'sequence_light']