Training LLM for a Custom Downstram Task: Retrosynthesis Prediction using MegaMolBART#

This tutorial serves to provide an overview for how to train our Seq-to-Seq MegaMolBart (MMB) model on a retrosynthesis prediction as a downstream task.

Overview#

A practical goal of chemistry and materials science is to design molecules and materials with specific properties. In the realm of predicting chemical reactions, there are two main directions: forward mapping, which involves predicting the product from given reactants, and backward mapping, which entails designing the appropriate reactants based on a target product. The latter mapping direction is referred to as retrosynthesis, and it involves the planning of synthesis pathways.

Retrosynthesis presents a unique challenge compared to forward mapping, as it involves a one-to-many mapping. This means that there could be multiple different reaction pathways to synthesize a desired compound. This complexity makes retrosynthesis a more intricate and demanding task. In recent years, artificial intelligence (AI) has emerged as a powerful tool to aid in retrosynthesis. AI-based retrosynthesis aims to automate the process by leveraging machine learning algorithms and data from previous chemical reactions. By analyzing and learning from a vast amount of chemical reaction data, AI models can generate predictions and propose potential synthetic pathways for target compounds. For the retrosynthesis prediction task, the model is given the products of a reaction reactants and asked to predict the reactants.

Setup and Assumptions#

This tutorial assumes that a copy of the BioNeMo framework repo exists on workstation or server and has been mounted inside the container at /workspace/bionemo as described in the Code Development section of the Quickstart Guide. This path will be referred to with the variable BIONEMO_WORKSPACE in the tutorial.

All commands should be executed inside the BioNeMo docker container.

A user may create/place the following codes and execute files from $BIONEMO_WORKSPACE/examples/<molecule_or_protein>/<model_name>/ folder, which needs to be adjusted according to the use case.

Dataset#

The USPTO-50K dataset, which contains approximately 50,000 reactions, can be used for the retrosynthesis prediction task.

Training#

Create the required data directory in your local bionemo repository:

mkdir -p data/uspto_50k_dataset

Next, you must download the pickle file provided by Samuel Genheden of Astra Zeneca from this URL and place it in this directory with the name uspto_50.pickle.

Now that the data prerequesits are in place. Launch BioNeMo development container, which will mount your local bionemo directory into the docker image.

bash launch.sh dev

Training of a retrosynthesis model using pretrained megamolbart can be done in three steps within BioNeMo:

  • Process the downloaded USPTO-50K dataset

  • Additional pre-training of MegaMolBart using USPTO-50K dataset

  • Train finetuned model using USPTO-50K dataset

Processing the dataset#

Using the BioNeMo restrosynthesis training script, we can automatically process our downloaded data by simply set do_training to False inside our yaml config or through command-line arguments, like below. This command needs to be run executed from /workspace/bionemo or <BioNeMo_ROOT> directory. The provided config will look for the unprocessed data you downloaded in /workspace/bionemo/data/uspto_50k_dataset/uspto_50.pickle

python examples/molecule/megamolbart/downstream_retro.py \
--config-name=downstream_retro_uspto50k \
++trainer.devices=1 \
++exp_manager.create_wandb_logger=False \
++do_training=False

This will download the dataset in the location specified by the dataset_path parameter in the config yaml file. By default, this is set to /data/uspto_50k_dataset.

Finetune pretrained MMB model for retrosynthesis#

python downstream_retro.py ++trainer.devices=1 ++exp_manager.create_wandb_logger=False ++do_training=True

BioNeMo also provides a config yaml file for inference located at examples/molecule/megamolbart/conf/infer_retro.yaml

Inference#

To run inference with your trained model, simply run the command below. Important to note that you need to set the restore_from_path to the .nemo model you created during training for retrosynthesis. This command needs to be executed from /workspace/bionemo or <BioNeMo_ROOT> directory.

python bionemo/model/infer.py \
--config-path=examples/molecule/megamolbart/conf \
--config-name=infer_retro \
++model.downstream_task.restore_from_path=<RETRO_CHECKPOINT_PATH>