ZINC Training Dataset Setup for small molecule language models
Contents
ZINC Training Dataset Setup for small molecule language models#
Preparation of a drug-like subset of the ZINC15 dataset#
ZINC is a free database of commercially-available compounds for virtual screening (J. Chem. Inf. Model. 2015, 55, 11, 2324–2337). In this notebook, we will walk through the steps to create a drug-like subset of the ZINC15 dataset ready for use for training or inference of BioNeMo small molecule language models such as MolMIM and MegaMolBART.
These steps involve:
input validation
smiles canonicalisation
filtering by maximum token length of input (per model constraint)
filtering based on tokenizer vocabulary
filtering by compliance to Lipinski rule of 5
NOTE: this notebook was developed and tested for BioNeMo framework container 1.7
Tested GPUs: A1000, A6000
%%capture --no-display --no-stderr cell_outputto suppress this output. Comment or delete this line in the cells below to restore full output.
First import all required packages
import os
import numpy as np
import pandas as pd
from pathlib import Path
from rdkit import Chem
from rdkit.Chem import Descriptors
from nemo.collections.common.tokenizers.regex_tokenizer import RegExTokenizer
bionemo_home = os.environ['BIONEMO_HOME']
1. Download the data#
We will download ZINC data from https://files.docking.org/
wget -O {raw_data_file} --no-check-certificate https://files.docking.org/2D/AB/ABAC.txtThis is a potential security risk; use with caution.
# We will download just 1 file for illustration purposes
raw_data_dir = f"{bionemo_home}/data/raw"
! mkdir -p {raw_data_dir}
raw_data_file = f"{raw_data_dir}/ZINC_sample_raw_data.txt"
! wget -O {raw_data_file} https://files.docking.org/2D/AB/ABAC.txt
--2024-09-03 14:41:24-- https://files.docking.org/2D/AB/ABAC.txt
Resolving files.docking.org (files.docking.org)... 169.230.75.4
Connecting to files.docking.org (files.docking.org)|169.230.75.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 443705 (433K) [text/plain]
Saving to: ‘/workspace/bionemo/data/raw/ZINC_sample_raw_data.txt’
/workspace/bionemo/ 100%[===================>] 433.31K 516KB/s in 0.8s
2024-09-03 14:41:25 (516 KB/s) - ‘/workspace/bionemo/data/raw/ZINC_sample_raw_data.txt’ saved [443705/443705]
2. Process the data#
2.1 SMILES validation and canonicalisation#
We will drop input SMILES that cannot generate a valid mol object and obtain the canonical SMILES string for each input molecule.
data_df = pd.read_csv(raw_data_file, sep="\t", usecols=["zinc_id", "smiles"])
def canonicalise_smiles(smiles: str) -> str:
"""Returns the canonical SMILES string for the input SMILES string
or np.nan if it was not possible to generate a valid mol object from the input string
"""
mol = Chem.MolFromSmiles(smiles)
return np.nan if mol is None else Chem.MolToSmiles(mol)
print("Generating canonical SMILES strings from the provided SMILES...")
data_df["canonical_smiles"] = data_df["smiles"].map(canonicalise_smiles)
data_df.dropna(subset=["canonical_smiles"], inplace=True)
Generating canonical SMILES strings from the provided SMILES...
2.2 Filter for vocabulary compliance#
Next, we exclude SMILES strings exceeding the model’s maximum token limit and molecules with tokens absent from the model’s vocabulary. To accomplish this, we import the model’s existing vocabulary files, using MolMIM as an illustrative example.
model_name = "molmim"
# Note: the maximum token length generated from the smiles string should be 2 less than the max_seq_length specified in the model config.
# This is to account for the extra tokens <BOS> and <EOS>
MAX_TOKEN_LENGTH = 126
# Provide path to files containing allowed vocabulary
# Note: This file can be changed according to the user's needs
tokenizer_path = bionemo_home + "/tokenizers/molecule/{model_name}/vocab/{model_name}.{extension}"
tokenizer = RegExTokenizer().load_tokenizer(regex_file=tokenizer_path.format(model_name=model_name, extension="model"),
vocab_file=tokenizer_path.format(model_name=model_name, extension="vocab"))
[NeMo I 2024-09-03 14:41:26 regex_tokenizer:240] Loading vocabulary from file = /workspace/bionemo/tokenizers/molecule/molmim/vocab/molmim.vocab
[NeMo I 2024-09-03 14:41:26 regex_tokenizer:254] Loading regex from file = /workspace/bionemo/tokenizers/molecule/molmim/vocab/molmim.model
def vocab_compliance_check(smiles: str, tokenizer: RegExTokenizer, max_token_length: int) -> bool:
"""Checks if the SMILES string only contains vocabulary in the tokenizer's vocabulary
and if the token length is less than or equal to `max_token_length"""
tokens = tokenizer.text_to_tokens(smiles)
vocab_allowed = tokenizer.vocab.keys()
return set(tokens).issubset(set(vocab_allowed)) and len(tokens) <= max_token_length
print(f"Filtering out molecules which are not present in the {model_name} tokenizer vocabulary or with max token length greater than {MAX_TOKEN_LENGTH}...")
data_df["vocab_compliant"] = data_df["canonical_smiles"].apply(lambda smi: vocab_compliance_check(smi, tokenizer, MAX_TOKEN_LENGTH))
# Select only molecules which are vocab compliant
vocab_compliant_df = data_df.loc[data_df['vocab_compliant']]
print(f"{len(data_df) - len(vocab_compliant_df)} molecules removed.")
Filtering out molecules which are not present in the molmim tokenizer vocabulary or with max token length greater than 126...
5 molecules removed.
2.3 Filter out undesirable molecules#
In this step, we filter out non-druglike molecules, where druglikeness is estimated using the following criteria:
Quantitative Estimate of Druglikeness (QED score) with a cutoff of
0.5
def determine_druglikeness(smiles: str, property: str) -> bool:
"""Calculates the specified property for the input smiles
and returns a boolean indicating whether the calculated property passes Lipinski's rule of 5"""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return np.nan
if property == "mol_weight":
return True if Descriptors.ExactMolWt(mol) <= 500 else False
elif property == "clogp":
return True if Chem.Crippen.MolLogP(mol) <= 5 else False
elif property == "hbd":
return True if Descriptors.NumHDonors(mol) <= 5 else False
elif property == "hba":
return True if Descriptors.NumHAcceptors(mol) <= 10 else False
elif property == "qed_score":
return True if Chem.QED.qed(mol) >= 0.5 else False
else:
raise ValueError('Please choose property from the following options: ["mol_weight", "clogp","hbd", "hba", "qed_score"]')
druglikeness_criteria = ["mol_weight", "clogp","hbd", "hba", "qed_score"]
print(f"Calculating druglikeness of input molecules...")
for property in druglikeness_criteria:
vocab_compliant_df.loc[:,property] = vocab_compliant_df.loc[:,"smiles"].map(lambda smiles: determine_druglikeness(smiles, property))
print(f"Filtering out molecules which do not meet Lipinski and QED criteria...")
druglike_df = vocab_compliant_df.query("mol_weight & clogp & hbd & hba & qed_score")
print(f"{len(vocab_compliant_df) - len(druglike_df)} molecules removed.")
druglike_df = druglike_df[["zinc_id", "smiles"]]
Calculating druglikeness of input molecules...
Filtering out molecules which do not meet Lipinski and QED criteria...
1594 molecules removed.
3. Split dataset to create training, validation and test sets#
Finally, we split the dataset into training, validation, and test sets with a ratio of 8:1:1, respectively. These sets are then saved as CSV files in designated subdirectories (train
, val
and test
) as required for model training. Other dataset variables e.g. columns, headers, dataset_path are specified in the config used for training, see example config files in examples/conf/molecule
for more information.
def train_valid_test_random_split(input_df: pd.DataFrame, test_frac: float, val_frac: float) -> dict:
"""Splits the input_df into train, validation and test sets randomly according to the specified fractions
and returns a dictionary of dataframes
"""
# Calculate sample sizes before size of dataframe changes
test_samples = max(int(test_frac * input_df.shape[0]), 1)
val_samples = max(int(val_frac * input_df.shape[0]), 1)
splits={}
splits["test"] = input_df.sample(n=test_samples, random_state=0)
# remove test data from training data
input_df = input_df.drop(splits["test"].index)
splits["val"] = input_df.sample(n=val_samples, random_state=0)
# remove validation data from training data
splits["train"] = input_df.drop(splits["val"].index)
return splits
test_frac = 0.1
val_frac = 0.1
splits = train_valid_test_random_split(druglike_df, test_frac, val_frac)
for molecule_set in splits.keys():
output_dir = os.path.join(bionemo_home, f"data/processed/{molecule_set}")
Path(output_dir).mkdir(parents=True, exist_ok=True)
splits[molecule_set].to_csv(os.path.join(output_dir, f"{molecule_set}_set.csv"), index=False)
4. Example of MolMIM pretraining using the prepared data#
Here we will run pretraining from scratch on our prepared data for molmim as an example using the default config pretrain_xsmall_canonicalized.yaml
. Other example configs can be found in directory examples/molecule/molmim/conf
. We will override some config arguments at runtime using Hydra.
The column containing the SMILES strings in the input datasets is specified with the argument model.data.data_impl_kwargs.csv_mmap.data_col
. We specify the directory containing our newly created train
, val
and test
subdirectories using the argument model.data.dataset_path
and the names of our input files using argument model.data.dataset.train
etc.
Note, index files are written to the file specified in model.data.index_mapping_dir
, if index files here are present they will be read rather than written. Therefore, we here clear the index files first to avoid unintentional errors which could occur if changing model or dataset params in this notebook playground.
For the same reason we set exp_manager.resume_if_exists
to be False
, otherwise if training was interrupted and then restarted, training will continue from the last saved checkpoint, and errors could occur if model params in the config have been changed.
We will also reduce the number of training steps (trainer.max_steps
) and correspondingly the step interval for checking the validation set (trainer.val_check_interval
) just for the purpose of this demonstration.
Optional: Add arguments for logging with Weights and Biases and login when prompted
%%capture --no-display --no-stderr cell_output
! rm -rf data/data_index
! cd {bionemo_home} && python examples/molecule/molmim/pretrain.py \
do_training=True \
++model.data.dataset_path="data/processed/" \
++model.data.dataset.train="train_set" \
++model.data.dataset.val="val_set" \
++model.data.dataset.test="test_set" \
++model.data.index_mapping_dir="data/data_index/" \
++model.data.data_impl_kwargs.csv_mmap.data_col=1 \
++model.dwnstr_task_validation.enabled=False \
++model.global_batch_size=null \
++trainer.devices=1 \
++trainer.accelerator='gpu' \
++trainer.max_steps=200 \
++trainer.val_check_interval=100 \
++exp_manager.create_wandb_logger=False \
++exp_manager.resume_if_exists=False