Adding the OAS Dataset: Downloading and Preprocessing
Contents
Adding the OAS Dataset: Downloading and Preprocessing#
Adding a new dataset to BioNeMo is a common task. This tutorial will show the developer how to accomplish this objective. The Observed Antibody Space (OAS) dataset will be used for this example. The OAS dataset is a database of antibody sequences containing over one billion sequences from 80 different studies for use in large scale analysis.
The task of adding a new dataset can be broken into three development tasks which can make use of associated base and helper classes in BioNeMo and NeMo. This dataset will be added to the ESM1-nv pre-training pipeline. There are three steps to this process:
Preprocessing includes download of the raw data and any additional preparation steps, such as extracting the files. It also includes dividing the data into train, validation, and test splits. The preprocessing step can make use of two BioNeMo base classes,
RemoteResource
andResourcePreprocessor
, frombionemo.utils.remote
andbionemo.data.preprocess
, respectively. Their use is optional but they provide some basic functionality which can accelerate development. This step is covered by the current tutorial.Development of the new dataset class. Here, the NeMo dataset class CSVMemMapDataset will be used. This task will be covered by the next tutorial, Modifying the Dataset Class.
Modification of the dataloader classes. This task will be covered by the third tutorial, Adding a Custom Dataloader. TODO FIX LINK WHEN TUTORIAL FINISHED
Setup and Assumptions#
This tutorial assumes that a copy of the BioNeMo framework repo exists on workstation or server and has been mounted inside the container at /workspace/bionemo
as described in the Code Development section of the Quickstart Guide. This path will be referred to with the variable BIONEMO_WORKSPACE
in the tutorial.
All commands should be executed inside the BioNeMo docker container.
Accessing OAS Dataset#
The paired sequence subset of the data will be used for this tutorial. The tutorial requires a shell script containing url links for the appropriate files. This script cannot be directly downloaded from the website and must be generated from the paired sequences search page by selecting “search” without choosing any attributes, as described here.
The full dataset currently contains links to 158 sequence files. This tutorial will use a subset of the data – the first ten files. The contents of the file are shown below and, if preferred, can be copied directly instead of downloading from OAS. Save this file to $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_paired_subset_download.sh
. The contents of oas_paired_subset_download.sh
should look like this:
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
wget https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
Downloading and Verifying Data#
The RemoteResource
class is used to create the existing download location (if needed), download a file, and verify its checksum. If the dataset contains multiple files (as is the case with OAS data), then multiple RemoteResources will be used. In practice, this class is rarely interacted with directly. Instead, it is usually called as part of the second class, ResourcePreprocessor
. ResourcePreprocessor
will be used as the base class for creation of the OAS preprocessing class.
The creation of the OAS preprocessing class will require the implementation of two methods:
get_remote_resources
, which implements a RemoteResource for each file, downloads it, and verifies the checksum; andprepare
, which performs any preprocessing on the data and splits into train, val, and test datasets.
Data Preprocessing Class#
First, let’s create the functionality to download the files. In the same directory as oas_paired_subset_download.sh
, create a file called oas_preprocess.py
. The path will be $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py
. In this file, create a class based on ResourcePreprocessor
that parses the URLs in the download script, returns a RemoteResource
for each of the URLs, and downloads the files referenced by the URLs.
Here is an example of such a class for oas_preprocess.py
. This class saves the downloaded files to /data/OASpaired/raw
.
from bionemo.data.preprocess import ResourcePreprocessor
from bionemo.utils.remote import RemoteResource
from nemo.utils import logging
from dataclasses import dataclass
from typing import List, Optional
import re
import os
__all__ = ['OASPairedPreprocess']
# BIONEMO_HOME = os.getenv('BIONEMO_HOME', '/workspace/bionemo')
BIONEMO_HOME = '/workspace/bionemo' # FIXME
OAS_DOWNLOAD_LINKS_PATH = f'{BIONEMO_HOME}/bionemo/data/preprocess/protein/oas_paired_subset_download.sh'
@dataclass
class OASPairedPreprocess(ResourcePreprocessor):
"""OASPairedPreprocessor to download and preprocess OAS paired antibody heavy chain data."""
root_directory: str = '/data/OASpaired'
dest_directory: str = 'raw'
def get_remote_resources(self, download_script_path:str = OAS_DOWNLOAD_LINKS_PATH) -> List[RemoteResource]:
"""Download and verify each file from the file at the provided download script path."""
# Load the download script and parse the urls
with open(download_script_path, 'r') as fh:
url_list = [re.split('\s+', x.strip())[-1] for x in fh.readlines()]
logging.info(f"The following URLs were parsed: {url_list}")
# Checksums will be added later
checksums = {'SRR11528761_paired.csv.gz': None,
'SRR11528762_paired.csv.gz': None,
'SRR10358523_paired.csv.gz': None,
'SRR10358524_paired.csv.gz': None,
'SRR10358525_paired.csv.gz': None,
'SRR9179273_paired.csv.gz': None,
'SRR9179274_paired.csv.gz': None,
'SRR9179275_paired.csv.gz': None,
'SRR9179276_paired.csv.gz': None,
'SRR9179277_paired.csv.gz': None}
resources = list()
for url in url_list:
filename = os.path.basename(url)
resource = RemoteResource(
dest_directory=self.dest_directory,
dest_filename=filename,
root_directory=self.root_directory,
checksum=checksums.get(filename),
url=url
)
resources.append(resource)
return resources
def prepare(self):
pass
Custom YAML Config#
A custom YAML configuration file is useful for making changes to the model and training configuration parameters. Copy the file $BIONEMO_WORKSPACE/examples/protein/esm1nv/conf/pretrain_small.yaml
to examples/protein/esm1nv/conf/pretrain_oas.yaml
.
To this new file, make the following modifications:
Delete the entire downstream task validation portion in the model section (
model.dwnstr_task_validation
). This can be reintroduced in the future to enable this functionality, but for now removing it will simplify working with the configuration file.Give the training a new name – here
esm1nv-oas
has been chosen.Set
do_training
to False since the focus is currently data preprocessingDisable Weights and Biases logging for now by since it won’t be used for preprocessing by creating an
exp_manager
section and settingcreate_wandb_logger
to False.
Here is what the new yaml config file looks like.
defaults:
- base_config
###### Begin OAS Related Addtions ######
name: esm1nv-oas ### Add OAS to the name
do_training: False ### Set to False for preprocessing or True for training
exp_manager:
create_wandb_logger: False ### Disable Weights and Biases logger for demo
###### End OAS Related Addtions ######
restore_from_path: null # used when starting from a .nemo file
model:
tokenizer:
library: 'sentencepiece'
type: null
model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
data:
dataset_path: /data/uniref2022_05 # parent directory for data, contains train / val / test folders. Needs to be writeable for index creation.
dataset: # inclusive range of data files to load x[000..049] or can a single file, e.g. x000
train: x[000..049]
test: x[000..049]
val: x[000..049]
micro_batch_size: ${model.micro_batch_size}
num_workers: 10
data_impl_kwargs:
csv_mmap:
data_col: 3 # 0-based
modify_percent: 0.1 # Percentage of characters in a protein sequence to modify. (Modification means replacing with another amino acid or with a mask token)
perturb_percent: 0.5 # Of the modify_percent, what percentage of characters are to be replaced with another amino acid.
Python Execution Script#
A python script to execute our job will also need to be created. In the directory examples/protein/esm1nv
, copy the existing pre-train script pretrain.py
to pretrain_oas.py
. This will be the file which performs preprocessing and runs the pre-training once the pipeline is completed.
Make the following changes to the new pre-training file:
Remove the imports for
UniRef50Preprocess
andFLIPPreprocess
Add an import for
OASPairedPreprocess
frombionemo.data.preprocess.protein.oas_preprocess
Modify the section with the log
Starting Preprocessing
so that it downloads the data and calculates the MD5 checksums for each of the OAS files.
Here is an example:
from omegaconf.omegaconf import OmegaConf
from nemo.core.config import hydra_runner
from nemo.utils import logging
from bionemo.model.protein.esm1nv import ESM1nvModel
from bionemo.model.utils import setup_trainer
from bionemo.utils import BioNeMoSaveRestoreConnector
from bionemo.utils.callbacks.callback_utils import setup_callbacks
from bionemo.data.preprocess.protein.oas_preprocess import OASPairedPreprocess ### Import OAS preprocessor
import os, hashlib ### Used for checksum verification
@hydra_runner(config_path="conf", config_name="pretrain_oas") ### Custom YAML config file
def main(cfg) -> None:
logging.info("\n\n************** Experiment configuration ***********")
logging.info(f'\n{OmegaConf.to_yaml(cfg)}')
callbacks = setup_callbacks(cfg)
trainer = setup_trainer(cfg, callbacks=callbacks)
if cfg.do_training:
logging.info("************** Starting Training ***********")
model = ESM1nvModel(cfg.model, trainer)
trainer.fit(model)
logging.info("************** Finished Training ***********")
else:
logging.info("************** Calculating Checksums ***********")
### Changes to calculate checksums
oas_filepaths = [resource.download_resource(overwrite=True)
for resource in OASPairedPreprocess().get_remote_resources()]
for fully_qualified_dest_filename in oas_filepaths:
with open(fully_qualified_dest_filename, 'rb') as fh:
filename = os.path.basename(fully_qualified_dest_filename)
logging.info(f"\"{filename}\": \"{hashlib.md5(fh.read()).hexdigest()}\"")
if __name__ == '__main__':
main()
Testing#
Run the pipeline with the following command:
cd examples/protein/esm1nv
python pretrain_oas.py
The end of the logged output is shown below:
[NeMo I 2023-08-17 16:44:39 pretrain_oas:26] ************** Calculating Checksums ***********
[NeMo I 2023-08-17 16:44:39 oas_preprocess:27] The following URLs were parsed: ['https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz']
[NeMo I 2023-08-17 16:44:39 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:44:44 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:44:44 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:45:18 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:18 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:45:20 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:20 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:45:22 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:22 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:45:24 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:24 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:45:31 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:31 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:45:56 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:45:56 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:46:05 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:46:05 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:46:55 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:46:55 remote:109] Downloading resource: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:47:59 remote:129] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR11528761_paired.csv.gz": "3b671ee3d376445fdafd89932cb4687e"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR11528762_paired.csv.gz": "69988520b12162b1f0613a55236d13a7"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR10358523_paired.csv.gz": "fb5f7242f1f2b555c0bb798da449454e"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR10358524_paired.csv.gz": "5db80ccbf8f47c855daa0d4d13d13d59"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR10358525_paired.csv.gz": "d9df83c314bc426bb7ad00a3375a5994"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179273_paired.csv.gz": "8e91cb8b719c4a2d30b2467cc5f6f080"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179274_paired.csv.gz": "cc7b54cf168a86012773073bc6016cd2"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179275_paired.csv.gz": "1660af663e0bdcf9e4a2dc0d8f79bfae"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179276_paired.csv.gz": "56336af0a20afde929c263e628be6828"
[NeMo I 2023-08-17 16:47:59 pretrain_oas:35] "SRR9179277_paired.csv.gz": "48ac0e0f4ded0df1e345c7fbbb161601"
Results#
This should have downloaded the ten sequence files and calculated their checksums. The files are found in the path (root_directory
/dest_directory
) as defined in the preprocessing class (/data/OASpaired/raw
):
SRR10358523_paired.csv.gz SRR11528762_paired.csv.gz SRR9179276_paired.csv.gz
SRR10358524_paired.csv.gz SRR9179273_paired.csv.gz SRR9179277_paired.csv.gz
SRR10358525_paired.csv.gz SRR9179274_paired.csv.gz
SRR11528761_paired.csv.gz SRR9179275_paired.csv.gz
Decompressing OAS Sequence Files#
Data Preprocessing Class#
Now, the functionality to finish processing the data will be added. The following edits should be made to bionemo/data/preprocess/protein/oas_preprocess.py
:
Add the checksum list to the
checksum
dictionary inget_remote_resources
Create a method
prepare_resource
that downloads each file, performs any additional processing (such as unzipping the files), and returns the final, full path of each file.Create a method
prepare
that runsprepare_resource
for each file.
from bionemo.data.preprocess import ResourcePreprocessor
from bionemo.utils.remote import RemoteResource
from nemo.utils import logging
from dataclasses import dataclass
from typing import List
import re
import os
import gzip
import shutil
__all__ = ['OASPairedPreprocess']
# BIONEMO_HOME = os.getenv('BIONEMO_HOME', '/workspace/bionemo')
BIONEMO_HOME = '/workspace/bionemo' # FIXME
OAS_DOWNLOAD_LINKS_PATH = f'{BIONEMO_HOME}/bionemo/data/preprocess/protein/oas_paired_subset_download.sh'
@dataclass
class OASPairedPreprocess(ResourcePreprocessor):
"""OASPairedPreprocess to download and preprocess OAS paired antibody heavy chain data."""
root_directory: str = '/data'
dest_directory: str = 'OASpaired/raw'
def get_remote_resources(self, download_script_path:str = OAS_DOWNLOAD_LINKS_PATH) -> List[RemoteResource]:
"""Download and verify each file from the download script path."""
with open(download_script_path, 'r') as fh:
url_list = [re.split('\s+', x.strip())[1] for x in fh.readlines()]
# Add calculated checksums
checksums = {"SRR11528761_paired.csv.gz": "3b671ee3d376445fdafd89932cb4687e",
"SRR11528762_paired.csv.gz": "69988520b12162b1f0613a55236d13a7",
"SRR10358523_paired.csv.gz": "fb5f7242f1f2b555c0bb798da449454e",
"SRR10358524_paired.csv.gz": "5db80ccbf8f47c855daa0d4d13d13d59",
"SRR10358525_paired.csv.gz": "d9df83c314bc426bb7ad00a3375a5994",
"SRR9179273_paired.csv.gz": "8e91cb8b719c4a2d30b2467cc5f6f080",
"SRR9179274_paired.csv.gz": "cc7b54cf168a86012773073bc6016cd2",
"SRR9179275_paired.csv.gz": "1660af663e0bdcf9e4a2dc0d8f79bfae",
"SRR9179276_paired.csv.gz": "56336af0a20afde929c263e628be6828",
"SRR9179277_paired.csv.gz": "48ac0e0f4ded0df1e345c7fbbb161601"}
resources = list()
for url in url_list:
filename = os.path.basename(url)
resource = RemoteResource(
dest_directory=self.dest_directory,
dest_filename=filename,
root_directory=self.root_directory,
checksum=checksums.get(filename),
url=url
)
resources.append(resource)
return resources
def prepare_resource(self,
resource: RemoteResource,
delete_gzipped: bool = False) -> str:
"""Logs and downloads the passed resource.
resource: RemoteResource - Resource to be prepared.
delete_gzipped: boolean, default: True - Delete gzipped file once extracted.
Returns - the absolute destination path for the downloaded resource
"""
logging.info(f"Downloading {resource.url}")
fully_qualified_gz_filename = resource.download_resource(overwrite=False)
logging.info(f"Extracting the gzipped file")
fully_qualified_dest_filename = os.path.splitext(fully_qualified_gz_filename)[0]
with gzip.open(fully_qualified_gz_filename, 'rb') as f_gz:
with open(fully_qualified_dest_filename, 'wb') as f_ext:
shutil.copyfileobj(f_gz, f_ext)
if delete_gzipped:
shutil.rmtree(fully_qualified_gz_filename)
return fully_qualified_dest_filename
def prepare(self):
return [
self.prepare_resource(resource) for resource in self.get_remote_resources()
]
Python Execution Script#
Now, modify the python pre-train script so that it creates an instance of the class and runs the prepare
method. These are the final set of changes which need to be made to the pre-training script.
# ## Simplified ESM1nv training config to demonstrate new dataset
from omegaconf.omegaconf import OmegaConf
from bionemo.model.protein.esm1nv import ESM1nvModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from bionemo.model.utils import setup_trainer
from bionemo.utils.callbacks.callback_utils import setup_callbacks
from bionemo.data.preprocess.protein.oas_preprocess import OASPairedPreprocess ### Import new preprocessor
import os, hashlib ### Used for checksum verification
@hydra_runner(config_path="conf", config_name="pretrain_oas") ### Custom YAML config file
def main(cfg) -> None:
logging.info("\n\n************** Experiment configuration ***********")
logging.info(f'\n{OmegaConf.to_yaml(cfg)}')
callbacks = setup_callbacks(cfg)
trainer = setup_trainer(cfg, callbacks=callbacks)
if cfg.do_training:
logging.info("************** Starting Training ***********")
model = ESM1nvModel(cfg.model, trainer)
trainer.fit(model)
logging.info("************** Finished Training ***********")
else:
logging.info("************** Starting Preprocessing ***********")
preprocessor = OASPairedPreprocess() ### Create instance of preprocess class
preprocessor.prepare() ### Prepare data
if __name__ == '__main__':
main()
Testing#
Execute the pre-train script as before:
cd examples/protein/esm1nv
python pretrain_oas.py
Below is the relevant portion of the log statments:
[NeMo I 2023-08-17 16:48:11 pretrain_oas:27] ************** Starting Preprocessing ***********
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
Results#
The original files already existed, so they were not downloaded. But each of the files has now been extracted.
SRR10358523_paired.csv SRR11528761_paired.csv.gz SRR9179275_paired.csv
SRR10358523_paired.csv.gz SRR11528762_paired.csv SRR9179275_paired.csv.gz
SRR10358524_paired.csv SRR11528762_paired.csv.gz SRR9179276_paired.csv
SRR10358524_paired.csv.gz SRR9179273_paired.csv SRR9179276_paired.csv.gz
SRR10358525_paired.csv SRR9179273_paired.csv.gz SRR9179277_paired.csv
SRR10358525_paired.csv.gz SRR9179274_paired.csv SRR9179277_paired.csv.gz
SRR11528761_paired.csv SRR9179274_paired.csv.gz
The CSV files contain an extra row at the top and a lot of additional columns. See the first three lines from SRR10358523_paired.csv
below, as an example.
These extra columns will increase seek time during training, so they should be removed. The files also need to be split and numbered consecutively for training, validation, and test splits, respectively.
"{""Run"": ""SRR10358523"", ""Link"": ""https://doi.org/10.1016/j.celrep.2019.12.027"", ""Author"": ""Eccles et al., 2020"", ""Species"": ""human"", ""Age"": ""33"", ""BSource"": ""PBMC"", ""BType"": ""RV+B-Cells"", ""Vaccine"": ""None"", ""Disease"": ""None"", ""Subject"": ""Healthy-1"", ""Longitudinal"": ""no"", ""Unique sequences"": 100, ""Isotype"": ""All"", ""Chain"": ""Paired""}"
sequence_id_heavy,sequence_heavy,locus_heavy,stop_codon_heavy,vj_in_frame_heavy,productive_heavy,rev_comp_heavy,v_call_heavy,d_call_heavy,j_call_heavy,sequence_alignment_heavy,germline_alignment_heavy,sequence_alignment_aa_heavy,germline_alignment_aa_heavy,v_alignment_start_heavy,v_alignment_end_heavy,d_alignment_start_heavy,d_alignment_end_heavy,j_alignment_start_heavy,j_alignment_end_heavy,v_sequence_alignment_heavy,v_sequence_alignment_aa_heavy,v_germline_alignment_heavy,v_germline_alignment_aa_heavy,d_sequence_alignment_heavy,d_sequence_alignment_aa_heavy,d_germline_alignment_heavy,d_germline_alignment_aa_heavy,j_sequence_alignment_heavy,j_sequence_alignment_aa_heavy,j_germline_alignment_heavy,j_germline_alignment_aa_heavy,fwr1_heavy,fwr1_aa_heavy,cdr1_heavy,cdr1_aa_heavy,fwr2_heavy,fwr2_aa_heavy,cdr2_heavy,cdr2_aa_heavy,fwr3_heavy,fwr3_aa_heavy,cdr3_heavy,cdr3_aa_heavy,junction_heavy,junction_length_heavy,junction_aa_heavy,junction_aa_length_heavy,v_score_heavy,d_score_heavy,j_score_heavy,v_cigar_heavy,d_cigar_heavy,j_cigar_heavy,v_support_heavy,d_support_heavy,j_support_heavy,v_identity_heavy,d_identity_heavy,j_identity_heavy,v_sequence_start_heavy,v_sequence_end_heavy,v_germline_start_heavy,v_germline_end_heavy,d_sequence_start_heavy,d_sequence_end_heavy,d_germline_start_heavy,d_germline_end_heavy,j_sequence_start_heavy,j_sequence_end_heavy,j_germline_start_heavy,j_germline_end_heavy,fwr1_start_heavy,fwr1_end_heavy,cdr1_start_heavy,cdr1_end_heavy,fwr2_start_heavy,fwr2_end_heavy,cdr2_start_heavy,cdr2_end_heavy,fwr3_start_heavy,fwr3_end_heavy,cdr3_start_heavy,cdr3_end_heavy,np1_heavy,np1_length_heavy,np2_heavy,np2_length_heavy,sequence_id_light,sequence_light,locus_light,stop_codon_light,vj_in_frame_light,productive_light,rev_comp_light,v_call_light,d_call_light,j_call_light,sequence_alignment_light,germline_alignment_light,sequence_alignment_aa_light,germline_alignment_aa_light,v_alignment_start_light,v_alignment_end_light,d_alignment_start_light,d_alignment_end_light,j_alignment_start_light,j_alignment_end_light,v_sequence_alignment_light,v_sequence_alignment_aa_light,v_germline_alignment_light,v_germline_alignment_aa_light,d_sequence_alignment_light,d_sequence_alignment_aa_light,d_germline_alignment_light,d_germline_alignment_aa_light,j_sequence_alignment_light,j_sequence_alignment_aa_light,j_germline_alignment_light,j_germline_alignment_aa_light,fwr1_light,fwr1_aa_light,cdr1_light,cdr1_aa_light,fwr2_light,fwr2_aa_light,cdr2_light,cdr2_aa_light,fwr3_light,fwr3_aa_light,cdr3_light,cdr3_aa_light,junction_light,junction_length_light,junction_aa_light,junction_aa_length_light,v_score_light,d_score_light,j_score_light,v_cigar_light,d_cigar_light,j_cigar_light,v_support_light,d_support_light,j_support_light,v_identity_light,d_identity_light,j_identity_light,v_sequence_start_light,v_sequence_end_light,v_germline_start_light,v_germline_end_light,d_sequence_start_light,d_sequence_end_light,d_germline_start_light,d_germline_end_light,j_sequence_start_light,j_sequence_end_light,j_germline_start_light,j_germline_end_light,fwr1_start_light,fwr1_end_light,cdr1_start_light,cdr1_end_light,fwr2_start_light,fwr2_end_light,cdr2_start_light,cdr2_end_light,fwr3_start_light,fwr3_end_light,cdr3_start_light,cdr3_end_light,np1_light,np1_length_light,np2_light,np2_length_light,ANARCI_numbering_light,ANARCI_numbering_heavy,ANARCI_status_light,ANARCI_status_heavy
AAACCTGGTCCGTCAG-1_contig_2,AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCATTTGGTGATCAGGACTGAACAGAGAGAACTCACCATGGAGTTTGGGCTGAGCTGGCTTTTTCTTGTGGCTATTTTAAAAGGTGTCCAGTGTGAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTG,H,F,T,T,F,IGHV3-23*01,IGHD6-19*01,IGHJ4*02,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGANNNNNNNNNNNNGCAGTGGCTGGTANNNNNNNNNNNNNNNTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDWPFWQWLVRRGERFDYWGQGTLVTVSS,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKXXXXXQWLVXXXXXFDYWGQGTLVTVSS,1.0,296.0,309.0,321.0,337.0,379.0,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGA,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAK,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGA,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAK,GCAGTGGCTGGTA,QWLV,GCAGTGGCTGGTA,QWLV,TTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,FDYWGQGTLVTVSS,TTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG,FDYWGQGTLVTVSS,GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCT,EVQLLESGGGLVQPGGSLRLSCAAS,GGATTCACCTTTAGCAGCTATGCC,GFTFSSYA,ATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCT,MSWVRQAPGKGLEWVSA,ATTAGTGGTAGTGGTGGTAGCACA,ISGSGGST,TACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGT,YYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYC,GCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTAC,AKDWPFWQWLVRRGERFDY,TGTGCGAAAGATTGGCCGTTTTGGCAGTGGCTGGTAAGAAGGGGGGAGCGGTTTGACTACTGG,63.0,CAKDWPFWQWLVRRGERFDYW,21.0,463.037,25.682,83.363,136S296M154S,444S7N13M129S1N,472S5N43M71S,2.7109999999999982e-132,0.004878,4.5910000000000006e-20,100.0,100.0,100.0,137.0,432.0,1.0,296.0,445.0,457.0,8.0,20.0,473.0,515.0,6.0,48.0,137.0,211.0,212.0,235.0,236.0,286.0,287.0,310.0,311.0,424.0,425.0,481.0,TTGGCCGTTTTG,12.0,AGAAGGGGGGAGCGG,15.0,AAACCTGGTCCGTCAG-1_contig_1,TGGGGAGGAATCAGTCCCACTCAGGACACAGCATGGACATGAGGGTCCCCGCTCAGCTCCTGGGGCTCCTGCTGCTCTGGTTCCCAGGTGCCAGGTGTGACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTCGAACGTTCGGCCAAGGGACCAAGGTGGAAATCAAACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGC,K,F,T,T,F,IGKV1-17*01,,IGKJ1*01,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTCGAACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTCNNACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYPRTFGQGTKVEIK,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYPXTFGQGTKVEIK,1,286,,,289.0,322.0,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTC,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYP,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGTCAGGGCATTAGAAATGATTTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTATGCTGCATCCAGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTC,DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYP,,,,,ACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,TFGQGTKVEIK,ACGTTCGGCCAAGGGACCAAGGTGGAAATCAAAC,TFGQGTKVEIK,GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCAAGT,DIQMTQSPSSLSASVGDRVTITCRAS,CAGGGCATTAGAAATGAT,QGIRND,TTAGGCTGGTATCAGCAGAAACCAGGGAAAGCCCCTAAGCGCCTGATCTAT,LGWYQQKPGKAPKRLIY,GCTGCATCC,AAS,AGTTTGCAAAGTGGGGTCCCATCAAGGTTCAGCGGCAGTGGATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGATTTTGCAACTTATTACTGT,SLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYC,CTACAGCATAATAGTTACCCTCGAACG,LQHNSYPRT,TGTCTACAGCATAATAGTTACCCTCGAACGTTC,33.0,CLQHNSYPRTF,11.0,447.456,,66.059,98S286M172S1N,,386S4N34M136S,1.2569999999999987e-127,,7.0420000000000004e-15,100.0,,100.0,99,384,1,286,,,,,387.0,420.0,5.0,38.0,99.0,176.0,177.0,194.0,195.0,245.0,246.0,254.0,255.0,362.0,363.0,389.0,GA,2.0,,,"{'fwk1': {'1 ': 'D', '2 ': 'I', '3 ': 'Q', '4 ': 'M', '5 ': 'T', '6 ': 'Q', '7 ': 'S', '8 ': 'P', '9 ': 'S', '10 ': 'S', '11 ': 'L', '12 ': 'S', '13 ': 'A', '14 ': 'S', '15 ': 'V', '16 ': 'G', '17 ': 'D', '18 ': 'R', '19 ': 'V', '20 ': 'T', '21 ': 'I', '22 ': 'T', '23 ': 'C', '24 ': 'R', '25 ': 'A', '26 ': 'S'}, 'cdrk1': {'27 ': 'Q', '28 ': 'G', '29 ': 'I', '36 ': 'R', '37 ': 'N', '38 ': 'D'}, 'fwk2': {'39 ': 'L', '40 ': 'G', '41 ': 'W', '42 ': 'Y', '43 ': 'Q', '44 ': 'Q', '45 ': 'K', '46 ': 'P', '47 ': 'G', '48 ': 'K', '49 ': 'A', '50 ': 'P', '51 ': 'K', '52 ': 'R', '53 ': 'L', '54 ': 'I', '55 ': 'Y'}, 'cdrk2': {'56 ': 'A', '57 ': 'A', '65 ': 'S'}, 'fwk3': {'66 ': 'S', '67 ': 'L', '68 ': 'Q', '69 ': 'S', '70 ': 'G', '71 ': 'V', '72 ': 'P', '74 ': 'S', '75 ': 'R', '76 ': 'F', '77 ': 'S', '78 ': 'G', '79 ': 'S', '80 ': 'G', '83 ': 'S', '84 ': 'G', '85 ': 'T', '86 ': 'E', '87 ': 'F', '88 ': 'T', '89 ': 'L', '90 ': 'T', '91 ': 'I', '92 ': 'S', '93 ': 'S', '94 ': 'L', '95 ': 'Q', '96 ': 'P', '97 ': 'E', '98 ': 'D', '99 ': 'F', '100 ': 'A', '101 ': 'T', '102 ': 'Y', '103 ': 'Y', '104 ': 'C'}, 'cdrk3': {'105 ': 'L', '106 ': 'Q', '107 ': 'H', '108 ': 'N', '109 ': 'S', '114 ': 'Y', '115 ': 'P', '116 ': 'R', '117 ': 'T'}, 'fwk4': {'118 ': 'F', '119 ': 'G', '120 ': 'Q', '121 ': 'G', '122 ': 'T', '123 ': 'K', '124 ': 'V', '125 ': 'E', '126 ': 'I', '127 ': 'K'}}","{'fwh1': {'1 ': 'E', '2 ': 'V', '3 ': 'Q', '4 ': 'L', '5 ': 'L', '6 ': 'E', '7 ': 'S', '8 ': 'G', '9 ': 'G', '11 ': 'G', '12 ': 'L', '13 ': 'V', '14 ': 'Q', '15 ': 'P', '16 ': 'G', '17 ': 'G', '18 ': 'S', '19 ': 'L', '20 ': 'R', '21 ': 'L', '22 ': 'S', '23 ': 'C', '24 ': 'A', '25 ': 'A', '26 ': 'S'}, 'cdrh1': {'27 ': 'G', '28 ': 'F', '29 ': 'T', '30 ': 'F', '35 ': 'S', '36 ': 'S', '37 ': 'Y', '38 ': 'A'}, 'fwh2': {'39 ': 'M', '40 ': 'S', '41 ': 'W', '42 ': 'V', '43 ': 'R', '44 ': 'Q', '45 ': 'A', '46 ': 'P', '47 ': 'G', '48 ': 'K', '49 ': 'G', '50 ': 'L', '51 ': 'E', '52 ': 'W', '53 ': 'V', '54 ': 'S', '55 ': 'A'}, 'cdrh2': {'56 ': 'I', '57 ': 'S', '58 ': 'G', '59 ': 'S', '62 ': 'G', '63 ': 'G', '64 ': 'S', '65 ': 'T'}, 'fwh3': {'66 ': 'Y', '67 ': 'Y', '68 ': 'A', '69 ': 'D', '70 ': 'S', '71 ': 'V', '72 ': 'K', '74 ': 'G', '75 ': 'R', '76 ': 'F', '77 ': 'T', '78 ': 'I', '79 ': 'S', '80 ': 'R', '81 ': 'D', '82 ': 'N', '83 ': 'S', '84 ': 'K', '85 ': 'N', '86 ': 'T', '87 ': 'L', '88 ': 'Y', '89 ': 'L', '90 ': 'Q', '91 ': 'M', '92 ': 'N', '93 ': 'S', '94 ': 'L', '95 ': 'R', '96 ': 'A', '97 ': 'E', '98 ': 'D', '99 ': 'T', '100 ': 'A', '101 ': 'V', '102 ': 'Y', '103 ': 'Y', '104 ': 'C'}, 'cdrh3': {'105 ': 'A', '106 ': 'K', '107 ': 'D', '108 ': 'W', '109 ': 'P', '110 ': 'F', '111 ': 'W', '111A': 'Q', '111B': 'W', '111C': 'L', '112C': 'V', '112B': 'R', '112A': 'R', '112 ': 'G', '113 ': 'E', '114 ': 'R', '115 ': 'F', '116 ': 'D', '117 ': 'Y'}, 'fwh4': {'118 ': 'W', '119 ': 'G', '120 ': 'Q', '121 ': 'G', '122 ': 'T', '123 ': 'L', '124 ': 'V', '125 ': 'T', '126 ': 'V', '127 ': 'S', '128 ': 'S'}}",|||||,"|Deletions: 10, 73||||"
Cleaning and Splitting OAS Sequence Files#
Data Preprocessing Class#
A new method, process_files
will be created to clean up the files and create train, validation, and test splits. For this exercise, only the columns containing the sequence id and the sequence for the antibody heavy chain will be retained (sequence_id_heavy
, sequence_heavy
).
Edit the file $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py
to add this functionality. These are the final edits that will need to be made to the preprocessing class. Here is an example of such a file:
from bionemo.data.preprocess import ResourcePreprocessor
from bionemo.utils.remote import RemoteResource
from nemo.utils import logging
from dataclasses import dataclass
from typing import List
import re
import os
import gzip
import shutil
import random
import pandas as pd
__all__ = ['OASPairedPreprocess']
# BIONEMO_HOME = os.getenv('BIONEMO_HOME', '/workspace/bionemo')
BIONEMO_HOME = '/workspace/bionemo' # FIXME
OAS_DOWNLOAD_LINKS_PATH = f'{BIONEMO_HOME}/bionemo/data/preprocess/protein/oas_paired_subset_download.sh'
@dataclass
class OASPairedPreprocess(ResourcePreprocessor):
"""OASPairedPreprocess to download and preprocess OAS paired antibody data for heavy chains."""
random_seed: int = 0
root_directory: str = '/data'
dest_directory: str = 'OASpaired/raw'
processed_directory: str = 'OASpaired/processed/heavy'
columns_to_keep = ['sequence_id_heavy', 'sequence_heavy']
num_val_files = 2
num_test_files = 2
def get_remote_resources(self, download_script_path:str = OAS_DOWNLOAD_LINKS_PATH) -> List[RemoteResource]:
"""Download and verify each file from the download script path."""
with open(download_script_path, 'r') as fh:
url_list = [re.split('\s+', x.strip())[1] for x in fh.readlines()]
# Add calculated checksums
checksums = {"SRR11528761_paired.csv.gz": "3b671ee3d376445fdafd89932cb4687e",
"SRR11528762_paired.csv.gz": "69988520b12162b1f0613a55236d13a7",
"SRR10358523_paired.csv.gz": "fb5f7242f1f2b555c0bb798da449454e",
"SRR10358524_paired.csv.gz": "5db80ccbf8f47c855daa0d4d13d13d59",
"SRR10358525_paired.csv.gz": "d9df83c314bc426bb7ad00a3375a5994",
"SRR9179273_paired.csv.gz": "8e91cb8b719c4a2d30b2467cc5f6f080",
"SRR9179274_paired.csv.gz": "cc7b54cf168a86012773073bc6016cd2",
"SRR9179275_paired.csv.gz": "1660af663e0bdcf9e4a2dc0d8f79bfae",
"SRR9179276_paired.csv.gz": "56336af0a20afde929c263e628be6828",
"SRR9179277_paired.csv.gz": "48ac0e0f4ded0df1e345c7fbbb161601"}
resources = list()
for url in url_list:
filename = os.path.basename(url)
resource = RemoteResource(
dest_directory=self.dest_directory,
dest_filename=filename,
root_directory=self.root_directory,
checksum=checksums.get(filename),
url=url
)
resources.append(resource)
return resources
def prepare_resource(self,
resource: RemoteResource,
delete_gzipped: bool = False) -> str:
"""Logs and downloads the passed resource.
resource: RemoteResource - Resource to be prepared.
delete_gzipped: boolean, default: True - Delete gzipped file once extracted.
Returns - the absolute destination path for the downloaded resource
"""
logging.info(f"Downloading {resource.url}")
fully_qualified_gz_filename = resource.download_resource(overwrite=False)
logging.info(f"Extracting the gzipped file")
fully_qualified_dest_filename = os.path.splitext(fully_qualified_gz_filename)[0]
with gzip.open(fully_qualified_gz_filename, 'rb') as f_gz:
with open(fully_qualified_dest_filename, 'wb') as f_ext:
shutil.copyfileobj(f_gz, f_ext)
if delete_gzipped:
shutil.rmtree(fully_qualified_gz_filename)
return fully_qualified_dest_filename
def process_files(self, filepaths: List[str], ):
file_fill_size = 3
full_processed_path = os.path.join(self.root_directory, self.processed_directory)
os.makedirs(full_processed_path, exist_ok=True)
# Assign two files to validation, two to test, and the rest to train
shuffled_paths = filepaths.copy()
random.seed = self.random_seed
random.shuffle(shuffled_paths)
val_paths = shuffled_paths[:self.num_val_files]
test_paths = shuffled_paths[self.num_val_files:self.num_val_files+self.num_test_files]
train_paths = shuffled_paths[self.num_val_files+self.num_test_files:]
# Split the data and clean up extra information
for split, filenames in zip(['train', 'val', 'test'], [train_paths, val_paths, test_paths]):
split_path = os.path.join(full_processed_path, split)
os.makedirs(split_path, exist_ok=True)
for num,filename in enumerate(filenames):
output_name = os.path.join(split_path, f'x{str(num).zfill(file_fill_size)}.csv')
logging.info(f"Converting {filename} to {output_name}")
data = pd.read_csv(filename, skiprows=[0], usecols=self.columns_to_keep)
data.to_csv(output_name, index=False)
def prepare(self):
filepaths = [self.prepare_resource(resource)
for resource in self.get_remote_resources()]
self.process_files(filepaths)
Testing#
As before, execute the pre-train script:
cd examples/protein/esm1nv
python pretrain_oas.py
This is what the end of the log looks like once preprocessing has started:
[NeMo I 2023-08-17 16:48:23 pretrain_oas:26] ************** Starting Preprocessing ***********
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:24 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179276_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:24 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179277_paired.csv.gz
[NeMo I 2023-08-17 16:48:24 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179273_paired.csv to /data/OASpaired/processed/heavy/train/x000.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR11528761_paired.csv to /data/OASpaired/processed/heavy/train/x001.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179277_paired.csv to /data/OASpaired/processed/heavy/train/x002.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR10358525_paired.csv to /data/OASpaired/processed/heavy/train/x003.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179276_paired.csv to /data/OASpaired/processed/heavy/train/x004.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR11528762_paired.csv to /data/OASpaired/processed/heavy/train/x005.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR10358524_paired.csv to /data/OASpaired/processed/heavy/val/x000.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR10358523_paired.csv to /data/OASpaired/processed/heavy/val/x001.csv
[NeMo I 2023-08-17 16:48:24 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179275_paired.csv to /data/OASpaired/processed/heavy/test/x000.csv
[NeMo I 2023-08-17 16:48:25 oas_preprocess:111] Converting /data/OASpaired/raw/SRR9179274_paired.csv to /data/OASpaired/processed/heavy/test/x001.csv
Results#
This has split the data into train, val, and test directories and cleaned up the data:
/data/OASpaired/processed/heavy/test:
x000.csv x001.csv
/data/OASpaired/processed/heavy/train:
x000.csv x001.csv x002.csv x003.csv x004.csv x005.csv
/data/OASpaired/processed/heavy/val:
x000.csv x001.csv
This is what the first five lines of one of the training files looks like:
sequence_id_heavy,sequence_heavy
AAACCTGAGACTTGAA-1_contig_1,GGGAGAGGAGGCCTGTCCTGGATTCGATTCCCAGTTCCTCACATTCAGTCAGCACTGAACACGGACCCCTCACCATGAACTTCGGGCTCAGCTTGATTTTCCTTGTCCTTGTTTTAAAAGGTGTCCAGTGTGAAGTGATGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAGCTATGCCATGTCTTGGGTTCGCCAGACTCCGGAGAAGAGGCTGGAGTGGGTCGCAACCATTAGTAGTGGTGGTAGTTACACCTACTATCCAGACAGTGTGAAGGGGCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTACCTGCAAATGAGCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGACGGGGGAATGATGGTTACTACGAAGACTACTGGGGCCAAGGCACCACTCTCACAGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGG
AAACCTGAGCGCCTTG-1_contig_2,GAGCTCTGACAGAGGAGGCCAGTCCTGGAATTGATTCCCAGTTCCTCACGTTCAGTGATGAGCACTGAACACAGACACCTCACCATGAACTTTGGGCTCAGATTGATTTTCCTTGTCCTTACTTTAAAAGGTGTGAAGTGTGAAGTGCAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCGCTTTCAGTAGCTATGACATGTCTTGGGTTCGCCAGACTCCGGAGAAGAGGCTGGAGTGGGTCGCATACATTAGTAGTGGTGGTGGTATCACCTACTATCCAGACACTGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTACCTGCAAATGAGCAGTCTGAAGTCTGAGGACACAGCCATGTATTACTGTGCAAGGCCCCCGGGACGGGGCTACTGGTACTTCGATGTCTGGGGCGCAGGGACCACGGTCACCGTCTCCTCAGCCAAAACAACAGCCCCATCGGTCTATCCACTGGCCCCTGTGTGTGGAGATACAACTGGCTCCTCGGTGACTCTAGGGTGCCTGGTCAAGGATTATT
AAACCTGCAAGTTAAG-1_contig_1,AACATATGTCCAATGTCCTCTCCACAGACACTGAACACACTGACTCTAACCATGGGATGGAGCTGGATCTTTCTCTTCCTCCTGTCAGGAACTGCAGGCGTCCACTCTGAGGTCCAGCTTCAGCAGTCAGGACCTGAGCTGGTGAAACCTGGGGCCTCAGTGAAGATATCCTGCAAGGCTTCTGGATACACATTCACTGACTACAACATGCACTGGGTGAAGCAGAGCCATGGAAAGAGCCTTGAGTGGATTGGATATATTTATCCTTACAATGGTGGTACTGGCTACAACCAGAAGTTCAAGAGCAAGGCCACATTGACTGTAGACAATTCCTCCAGCACAGCCTACATGGAGCTCCGCAGCCTGACATCTGAGGACTCTGCAGTCTATTACTGTGCAAGATGGGGGCTAACTGGTGATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGG
AAACCTGGTATCTGCA-1_contig_1,GACATAACAGCAAGAGAGTGTCCGGTTAGTCTCAAGGAAGACTGAGACACAGTCTTAGATATCATGGAATGGCTGTGGAACTTGCTATTTCTCATGGCAGCAGCTCAAAGTATCCAAGCACAGATCCAGTTGGTGCAGTCTGGACCTGAGCTGAAGAAGCCTGGAGAGACAGTCAGGATCTCCTGCAAGGCTTCTGGGTATACCTTCACAACTGCTGGAATGCAGTGGGTGCAAAAGATGCCAGGAAAGGGTTTGAAGTGGATTGGCTGGATAAACACCCACTCTGGAGTGCCAAAATATGCAGAAGACTTCAAGGGACGGTTTGCCTTCTCTTTGGAAACCTCTGCCAGCACTGCATATTTACAGATAAGCAACCTCAAAAATGAGGACACGGCTACGTATTTCTGTGCGAGATCAGGTTACGACGCCTTTGACTACTGGGGCCAAGGCACCACTCTCACAGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGG
Optional Variation: Process the Light Chain Data#
What if instead the light chain columns (sequence_id_light
and sequence_light
) were desired? How could the existing class be subclassed to create a preprocessing class for light chains?
Starting with the existing OASPairedPreprocess
class, the only additional changes that would need to be made are:
Change the
columns_to_keep
to preserve the light chains instead of the heavy onesOptionally change the directory for the processed files
Here is an example of a class that could be added to $BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py
to accomplish this:
@dataclass
class OASPairedLightPreprocessor(OASPairedPreprocess):
"""OASPairedLightPreprocessor to download and preprocess OAS paired antibody light chain data."""
processed_directory: str = 'OASpaired/processed/light'
columns_to_keep = ['sequence_id_light', 'sequence_light']