MegaMolBART Model Training using BioNeMo#

The purpose of this tutorial is to provide an example use case of training a BioNeMo Large Language model using the BioNeMo framework. At the end of this tutorial, the user will get experience in

  • configuring various config files and launch parameters for MegaMolBART training

  • launching single and multi-node, multi-gpu training runs

  • using NVIDIA’s Base Command Platform commands for LLM model training

Overview - MegaMolBART model#

MegaMolBART is based on the BART architecture and trained on billions of SMILES from the ZINC15 database. MegaMolBART understands chemistry and is capable of producing embeddings that can be used for prediction of chemical properties and performing sequence translation tasks such as retrosynthesis prediction. Because MegaMolBART utilizes an autoregressive decoder, it can also generate molecules starting from a seed molecule.

Setup and Assumptions#

This tutorial assumes that the user has access to BioNeMo framework and NVIDIA’s BCP and DGX-Cloud compute infrastructure. The user is also expected to have required background details about

  • the BioNeMo framework, as described in the Quickstart Guide, and

  • running the model training jobs on BCP

All model training related commands should be executed inside the BioNeMo docker container.

Requesting compute resources#

Access to DGX compute resource NGC site or NGC CLI#

As a prerequisite, configure your access to the DGX compute resources and required contents either via NVIDIA’s Base Command Platform or NGC-CLI using ngc config set command.

For more details on how to request the resources, visit Running BioNeMo on DGX-Cloud using BCP

Note

The interactive job launch example shown here using the Jupyter Lab interface is intended for initial user experience/trial runs. It is strongly advised to launch the model training jobs using the launch script as a part of the ngc batch run command, as mentioned in Running BioNeMo on DGX-Cloud using BCP. For MegaMolBART training, the model training script should be used as a template for launching the job as provided in <BioNeMo_Workspace>/examples/molecule/megamolbart/scripts/pretrain_bcp_prd11.sh.

Data Preprocessing#

Downloading and pre-processing the dataset#

Download the data#

The ZINC-15 dataset is a free database of commercially-available compounds for virtual screening. resource for protein sequence and annotation data [1].

The ZINC-15 database was used for training [Sterling and Irwin, 2015]. Approximately 1.54 Billion molecules (SMILES strings) were selected from tranches meeting the following constraints: molecular weight <= 500 Daltons, LogP <= 5, reactivity level was “reactive,” and purchasability was “annotated.” The compounds were filtered to ensure a maximum length of 512 characters. Train, validation, and test splits were randomly split as 99% / 0.5% / 0.5%.

Using BioNeMo features to download ZINC-15#

The simplest and most reliable way to download the ZINC-15 subset is through the BioNeMo framework Zinc15Preprocess class which has the following features:

  • Runs a fasta indexer

  • Splits the data into train, validation and test samples

  • Writes the dataset in the appropriate directories within the BioNeMo Framework /tmp/uniref50/processed ~ For example

from bionemo.data import Zinc15Preprocess
data = Zinc15Preprocess()
data.prepare_dataset()

In the snippet above, the ZINC-15 (tranches) will be downloaded.

Alternative datasets#

We can also download datasets that are not available in the BioNeMo Framework. This can be done in 2 ways:

A) Using bash and wget pointing to the dataset’s URL

mkdir -p /tmp/data/molecule/megamolbart  
wget -P /tmp/data/molecule/megamolbart <URL>

B) Transfering from the local machine to the container

docker cp <dataset directory and filename> container_id:/<container directory and filename>

Then, once the data is downloaded, we can start moving files and using the Data Loaders and Data Module to make sure the dataset is in a format the BioNeMo Framework can operate. It is not guaranteed that the Zinc15Preprocess class will handle datasets other than those from UniProt.

Model training#

Example dataset#

To briefly showcase the model training capacities of BioNeMo Framework, we will use a very small subset of the original UniRef50 dataset that is provided as a part of the sample datasets located in ${BIONEMO_HOME}/examples/tests/test_data/molecule

For the purpose of this test run, the folder contains /train, /val, /test folders with protein sequences in CSV files.

Single-node or Multi-node setup#

In this test runs, we will use preconfigured parameters provided in the pretrain_xsmall_span_aug.yaml config file located in the ${BIONEMO_HOME}/examples/molecule/megamolbart/conf folder.

We will also set other parameters suitable for a quick run, such as using very limited molecules subset as x000.csv file. User can update these parameters by editing the .yaml config file or as additional command line arguments, as shown in the example below. User can select the full dataset and adjust other parameters - for example - as shown in the pretrain_base.yaml or pretrain_large_span_aug.yaml files. Also, for the quick test run, we will disable downstream task validation using ++model.dwnstr_task_validation.enabled=False.

As we are connected to the compute node, we navigate to the BioNeMo home folder using the command cd ${BIONEMO_HOME}, and execute the following command in the terminal.

Note

To run the model training job on a local workstation, user can directly execute the pretrain.py script with desired configurations. For example,

python examples/molecule/megamolbart/pretrain.py 

User may need to update relevant arguments in the commands according to their compute and data setup.

The bcprun command is alaogous to srun command in SLURM, you can find more details at the NVIDIA BCP User Guide.

bcprun --nnodes=1 --npernode=8 --cmd "python pretrain.py --config-path=conf \
    --config-name=pretrain_xsmall_span_aug do_training=True \
    ++model.data.dataset_path=/workspace_test/examples_1/tests/test_data/molecule \
    ++model.data.dataset.train=x000 ++model.data.dataset.val=x000 ++model.data.dataset.test=x000 \
    ++exp_manager.wandb_logger_kwargs.offline=False ++trainer.devices=8 ++trainer.num_nodes=1 \
    ++model.dwnstr_task_validation.enabled=False "



To run the model training on multiple nodes, the user will have to update parameters accordingly, for example, the command running the model training job on four nodes would require nnodes=4 and ++trainer.num_nodes=4 arguments.

Logging with WandB#

If you are launching the model training job interactively from the terminal/Jupyter-Lab, you can set your Weights and Biases access via wandb login <YOUR_WANDB_API_KEY> or checkout https://docs.wandb.ai/ref/cli/wandb-login for more information. Alternatively, you may also export the API key as a variable at the time of launching the job via command-line, as shown in ${BIONEMO_HOME}/examples/molecule/megamolbart/scripts/pretrain_bcp_prd11.sh

Output and Results#

As the MegaMolBART model training job is launched, BioNeMo will print out some of the details related to compute resources, model training configuration, and dataset being used for training. As the job progresses, it will also print out various details related to the test/train/validation steps and accuracy matrices at a set intervals.

Upon the completion of training process, it will also print out the details related to log files, model checkpoints, and so on, that will also be saved in the directory as configured (usually /result).

mmb_1.png

mmb_2.png


Finally, if Weights and Biases logging was enabled (for example ++exp_manager.create_wandb_logger=True ), you can also visualize the model training progress and resulting matrices, and the summary also gets printed on the termainal at the end of the training job.

mmb_3.png

mmb_4.png

mmb_5.png