Training LLM for a Custom Downstram Task: Retrosynthesis Prediction using MegaMolBART
Contents
Training LLM for a Custom Downstram Task: Retrosynthesis Prediction using MegaMolBART#
This tutorial serves to provide an overview for how to train our Seq-to-Seq MegaMolBart (MMB) model on a retrosynthesis prediction as a downstream task.
Overview#
A practical goal of chemistry and materials science is to design molecules and materials with specific properties. In the realm of predicting chemical reactions, there are two main directions: forward mapping, which involves predicting the product from given reactants, and backward mapping, which entails designing the appropriate reactants based on a target product. The latter mapping direction is referred to as retrosynthesis, and it involves the planning of synthesis pathways.
Retrosynthesis presents a unique challenge compared to forward mapping, as it involves a one-to-many mapping. This means that there could be multiple different reaction pathways to synthesize a desired compound. This complexity makes retrosynthesis a more intricate and demanding task. In recent years, artificial intelligence (AI) has emerged as a powerful tool to aid in retrosynthesis. AI-based retrosynthesis aims to automate the process by leveraging machine learning algorithms and data from previous chemical reactions. By analyzing and learning from a vast amount of chemical reaction data, AI models can generate predictions and propose potential synthetic pathways for target compounds. For the retrosynthesis prediction task, the model is given the products of a reaction reactants and asked to predict the reactants.
Setup and Assumptions#
This tutorial assumes that a copy of the BioNeMo framework repo exists on workstation or server and has been mounted inside the container at /workspace/bionemo
as described in the Code Development section of the Quickstart Guide. This path will be referred to with the variable BIONEMO_WORKSPACE
in the tutorial.
All commands should be executed inside the BioNeMo docker container.
A user may create/place the following codes and execute files from $BIONEMO_WORKSPACE/example/<molecule_or_protein>/<model_name>/
folder, which needs to be adjusted according to the use case.
Dataset#
The USPTO-50K dataset, which contains approximately 50,000 reactions, can be used for the retrosynthesis prediction task.
Training#
Launch BioNeMo development container
bash launch.sh dev
Training of a retrosynthesis model using pretrained megamolbart can be done in three steps within BioNeMo:
Download the USPTO-50K dataset
Additional pre-training of MegaMolBart using USPTO-50K dataset
Train finetuned model using USPTO-50K dataset
Downloading dataset#
Using the BioNeMo restrosynthesis training script, we can automatically download our data by simply set do_training to False inside our yaml config or through command-line arguments, like below. This command needs to be run executed from /workspace/bionemo
or <BioNeMo_ROOT> directory.
python examples/molecule/megamolbart/downstream_retro.py \
--config-name=downstream_retro_uspto50k \
++trainer.devices=1 \
++exp_manager.create_wandb_logger=False \
++do_training=False
This will download the dataset in the location specified by the dataset_path parameter in the config yaml file. By default, this is set to /data/uspto_50k_dataset.
Finetune pretrained MMB model for retrosynthesis#
python downstream_retro.py ++trainer.devices=1 ++exp_manager.create_wandb_logger=False ++do_training=True
BioNeMo also provides a config yaml file for inference located at examples/molecule/megamolbart/conf/infer_retro.yaml
Inference#
To run inference with your trained model, simply run the command below. Important to note that you need to set the restore_from_path to the .nemo model you created during training for retrosynthesis. This command needs to be executed from /workspace/bionemo
or <BioNeMo_ROOT> directory.
python examples/infer.py \
--config-path=examples/molecule/megamolbart/conf \
--config-name=infer_retro \
model.downstream_task.restore_from_path=<RETRO_CHECKPOINT_PATH>