ZINC Training Dataset Setup for small molecule language models#

Preparation of a drug-like subset of the ZINC15 dataset#

ZINC is a free database of commercially-available compounds for virtual screening (J. Chem. Inf. Model. 2015, 55, 11, 2324–2337). In this notebook, we will walk through the steps to create a drug-like subset of the ZINC15 dataset ready for use for training or inference of BioNeMo small molecule language models such as MolMIM and MegaMolBART.

These steps involve:

  • input validation

  • smiles canonicalisation

  • filtering by maximum token length of input (per model constraint)

  • filtering based on tokenizer vocabulary

  • filtering by compliance to Lipinski rule of 5

NOTE: this notebook was developed and tested for BioNeMo framework container 1.7
Tested GPUs: A1000, A6000

NOTE Some of the cells below generate long text output. We're using
%%capture --no-display --no-stderr cell_output
to suppress this output. Comment or delete this line in the cells below to restore full output.

First import all required packages

import os
import numpy as np
import pandas as pd
from pathlib import Path
from rdkit import Chem
from rdkit.Chem import Descriptors

from nemo.collections.common.tokenizers.regex_tokenizer import RegExTokenizer

bionemo_home = os.environ['BIONEMO_HOME']

1. Download the data#

We will download ZINC data from https://files.docking.org/

NOTE The following cell may fail due to a certificate validation error in the wget commmand. You can skip the certificate check using
 wget -O {raw_data_file} --no-check-certificate https://files.docking.org/2D/AB/ABAC.txt 
This is a potential security risk; use with caution.
# We will download just 1 file for illustration purposes
raw_data_dir = f"{bionemo_home}/data/raw"
! mkdir -p {raw_data_dir}
raw_data_file = f"{raw_data_dir}/ZINC_sample_raw_data.txt"
! wget -O {raw_data_file} https://files.docking.org/2D/AB/ABAC.txt 
--2024-09-03 14:41:24--  https://files.docking.org/2D/AB/ABAC.txt
Resolving files.docking.org (files.docking.org)... 169.230.75.4
Connecting to files.docking.org (files.docking.org)|169.230.75.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 443705 (433K) [text/plain]
Saving to: ‘/workspace/bionemo/data/raw/ZINC_sample_raw_data.txt’

/workspace/bionemo/ 100%[===================>] 433.31K   516KB/s    in 0.8s    

2024-09-03 14:41:25 (516 KB/s) - ‘/workspace/bionemo/data/raw/ZINC_sample_raw_data.txt’ saved [443705/443705]

2. Process the data#

2.1 SMILES validation and canonicalisation#

We will drop input SMILES that cannot generate a valid mol object and obtain the canonical SMILES string for each input molecule.

data_df = pd.read_csv(raw_data_file, sep="\t", usecols=["zinc_id", "smiles"])

def canonicalise_smiles(smiles: str) -> str:
    """Returns the canonical SMILES string for the input SMILES string 
    or np.nan if it was not possible to generate a valid mol object from the input string
    """
    mol = Chem.MolFromSmiles(smiles)
    return np.nan if mol is None else Chem.MolToSmiles(mol)
    
print("Generating canonical SMILES strings from the provided SMILES...")
data_df["canonical_smiles"] = data_df["smiles"].map(canonicalise_smiles)
data_df.dropna(subset=["canonical_smiles"], inplace=True)
Generating canonical SMILES strings from the provided SMILES...

2.2 Filter for vocabulary compliance#

Next, we exclude SMILES strings exceeding the model’s maximum token limit and molecules with tokens absent from the model’s vocabulary. To accomplish this, we import the model’s existing vocabulary files, using MolMIM as an illustrative example.

model_name = "molmim"

# Note: the maximum token length generated from the smiles string should be 2 less than the max_seq_length specified in the model config. 
# This is to account for the extra tokens <BOS> and <EOS>
MAX_TOKEN_LENGTH = 126

# Provide path to files containing allowed vocabulary
# Note: This file can be changed according to the user's needs
tokenizer_path = bionemo_home + "/tokenizers/molecule/{model_name}/vocab/{model_name}.{extension}"
tokenizer = RegExTokenizer().load_tokenizer(regex_file=tokenizer_path.format(model_name=model_name, extension="model"), 
                                                  vocab_file=tokenizer_path.format(model_name=model_name, extension="vocab"))
[NeMo I 2024-09-03 14:41:26 regex_tokenizer:240] Loading vocabulary from file = /workspace/bionemo/tokenizers/molecule/molmim/vocab/molmim.vocab
[NeMo I 2024-09-03 14:41:26 regex_tokenizer:254] Loading regex from file = /workspace/bionemo/tokenizers/molecule/molmim/vocab/molmim.model
def vocab_compliance_check(smiles: str, tokenizer: RegExTokenizer, max_token_length: int) -> bool:
    """Checks if the SMILES string only contains vocabulary in the tokenizer's vocabulary
    and if the token length is less than or equal to `max_token_length"""
    tokens = tokenizer.text_to_tokens(smiles)
    vocab_allowed = tokenizer.vocab.keys()
    return set(tokens).issubset(set(vocab_allowed)) and len(tokens) <= max_token_length
print(f"Filtering out molecules which are not present in the {model_name} tokenizer vocabulary or with max token length greater than {MAX_TOKEN_LENGTH}...")
data_df["vocab_compliant"] = data_df["canonical_smiles"].apply(lambda smi: vocab_compliance_check(smi, tokenizer, MAX_TOKEN_LENGTH))

# Select only molecules which are vocab compliant
vocab_compliant_df = data_df.loc[data_df['vocab_compliant']]
print(f"{len(data_df) - len(vocab_compliant_df)} molecules removed.")
Filtering out molecules which are not present in the molmim tokenizer vocabulary or with max token length greater than 126...
5 molecules removed.

2.3 Filter out undesirable molecules#

In this step, we filter out non-druglike molecules, where druglikeness is estimated using the following criteria:

  1. Lipinski’s rule of 5 compliance

  2. Quantitative Estimate of Druglikeness (QED score) with a cutoff of 0.5

def determine_druglikeness(smiles: str, property: str) -> bool:
    """Calculates the specified property for the input smiles 
    and returns a boolean indicating whether the calculated property passes Lipinski's rule of 5"""
    
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return np.nan
    if property == "mol_weight":
        return True if Descriptors.ExactMolWt(mol) <= 500 else False
    elif property == "clogp":
        return True if Chem.Crippen.MolLogP(mol) <= 5 else False
    elif property == "hbd":
        return True if Descriptors.NumHDonors(mol) <= 5 else False
    elif property == "hba":
        return True if Descriptors.NumHAcceptors(mol) <= 10 else False
    elif property == "qed_score":
        return True if Chem.QED.qed(mol) >= 0.5 else False
    else:
        raise ValueError('Please choose property from the following options: ["mol_weight", "clogp","hbd", "hba", "qed_score"]')

druglikeness_criteria = ["mol_weight", "clogp","hbd", "hba", "qed_score"]

print(f"Calculating druglikeness of input molecules...")

for property in druglikeness_criteria:
    vocab_compliant_df.loc[:,property] = vocab_compliant_df.loc[:,"smiles"].map(lambda smiles: determine_druglikeness(smiles, property))

print(f"Filtering out molecules which do not meet Lipinski and QED criteria...")
druglike_df = vocab_compliant_df.query("mol_weight & clogp & hbd & hba & qed_score")

print(f"{len(vocab_compliant_df) - len(druglike_df)} molecules removed.")

druglike_df = druglike_df[["zinc_id", "smiles"]]
Calculating druglikeness of input molecules...
Filtering out molecules which do not meet Lipinski and QED criteria...
1594 molecules removed.

3. Split dataset to create training, validation and test sets#

Finally, we split the dataset into training, validation, and test sets with a ratio of 8:1:1, respectively. These sets are then saved as CSV files in designated subdirectories (train, val and test) as required for model training. Other dataset variables e.g. columns, headers, dataset_path are specified in the config used for training, see example config files in examples/conf/molecule for more information.

def train_valid_test_random_split(input_df: pd.DataFrame, test_frac: float, val_frac: float) -> dict:
    """Splits the input_df into train, validation and test sets randomly according to the specified fractions 
    and returns a dictionary of dataframes
    """
    # Calculate sample sizes before size of dataframe changes
    test_samples = max(int(test_frac * input_df.shape[0]), 1)
    val_samples = max(int(val_frac * input_df.shape[0]), 1)

    splits={}
    splits["test"] = input_df.sample(n=test_samples, random_state=0)
    # remove test data from training data
    input_df = input_df.drop(splits["test"].index)  

    splits["val"] = input_df.sample(n=val_samples, random_state=0)
    # remove validation data from training data
    splits["train"] = input_df.drop(splits["val"].index)  
    return splits


test_frac = 0.1
val_frac = 0.1

splits = train_valid_test_random_split(druglike_df, test_frac, val_frac)

for molecule_set in splits.keys():
    output_dir = os.path.join(bionemo_home, f"data/processed/{molecule_set}")
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    splits[molecule_set].to_csv(os.path.join(output_dir, f"{molecule_set}_set.csv"), index=False)

4. Example of MolMIM pretraining using the prepared data#

Here we will run pretraining from scratch on our prepared data for molmim as an example using the default config pretrain_xsmall_canonicalized.yaml. Other example configs can be found in directory examples/molecule/molmim/conf. We will override some config arguments at runtime using Hydra.

The column containing the SMILES strings in the input datasets is specified with the argument model.data.data_impl_kwargs.csv_mmap.data_col. We specify the directory containing our newly created train, val and test subdirectories using the argument model.data.dataset_path and the names of our input files using argument model.data.dataset.train etc.

Note, index files are written to the file specified in model.data.index_mapping_dir, if index files here are present they will be read rather than written. Therefore, we here clear the index files first to avoid unintentional errors which could occur if changing model or dataset params in this notebook playground. For the same reason we set exp_manager.resume_if_exists to be False, otherwise if training was interrupted and then restarted, training will continue from the last saved checkpoint, and errors could occur if model params in the config have been changed.

We will also reduce the number of training steps (trainer.max_steps) and correspondingly the step interval for checking the validation set (trainer.val_check_interval) just for the purpose of this demonstration.

Optional: Add arguments for logging with Weights and Biases and login when prompted

%%capture --no-display --no-stderr cell_output
! rm -rf data/data_index
! cd {bionemo_home} && python examples/molecule/molmim/pretrain.py \
    do_training=True \
    ++model.data.dataset_path="data/processed/" \
    ++model.data.dataset.train="train_set" \
    ++model.data.dataset.val="val_set" \
    ++model.data.dataset.test="test_set" \
    ++model.data.index_mapping_dir="data/data_index/" \
    ++model.data.data_impl_kwargs.csv_mmap.data_col=1 \
    ++model.dwnstr_task_validation.enabled=False \
    ++model.global_batch_size=null \
    ++trainer.devices=1 \
    ++trainer.accelerator='gpu' \
    ++trainer.max_steps=200 \
    ++trainer.val_check_interval=100 \
    ++exp_manager.create_wandb_logger=False \
    ++exp_manager.resume_if_exists=False