DiffDock Model Training using BioNeMo#

The purpose of this tutorial is to provide an example use case of training a BioNeMo Geometric Diffusion model using the BioNeMo framework. At the end of this tutorial, the user will get experience in

  • configuring various config files and launch parameters for DiffDock training

  • launching single and multi-node, multi-gpu training runs

  • using NVIDIA’s Base Command Platform commands for geometric diffusion model training

Overview - DiffDock model#

DiffDock in Bionemo is based on the public DiffDock model, which is a score-based geometric diffusion model. DiffDock uses score model to perform reverse diffusion step by step to find the docking sites and the ligand poses, and use a separate confidence model to rank and select the best generated poses.

This DiffDock model training example walkthrough will show how to utilize the compute resources, download and preprocess the datasets, and perform model training on single and multiple nodes.

Setup and Assumptions#

This tutorial assumes that the user has access to BioNeMo framework and NVIDIA’s BCP and DGX-Cloud compute infrastructure. The user is also expected to have required background details about

  • the BioNeMo framework, as described in the Quickstart Guide, and

  • running the model training jobs on BCP

All model training related commands should be executed inside the BioNeMo docker container.

Requesting compute resources#

Access to DGX compute resource NGC site or NGC CLI#

As a prerequisite, configure your access to the DGX compute resources and required contents either via NVIDIA’s Base Command Platform or NGC-CLI using ngc config set command.

For more details on how to request the resources, visit Running BioNeMo on DGX-Cloud using BCP

Note: The interactive job launch example shown here using the Jupyter Lab interface is intended for initial user experience/trial runs. It is strongly advised to launch the model training jobs using the launch script as a part of the ngc batch run command, as mentioned in Running BioNeMo on DGX-Cloud using BCP. For DiffDock training, the model training script should be used as a template for launching the job as provided in <BioNeMo_Workspace>/example/molecule/diffdock/scripts/train_bcp.sh.

First, let’s request the resource for running the model training in an interactive manner.

Here is one such example of a command for requesting the resources using NGC-CLI. Make sure to update the relevant arguments according to the compute setup, datasets, workspaces, instance types, and so on.

In the configuration below, update nvidia, clara and 1.4 with the correct NGC org, team name and container image tag, respectively. If there is no team name, then this can be omitted. Refer to NGC documentation for more details.

ngc batch run \
  --name "example-training-diffdock" \
  --org nvidia \
  --team clara \
  --instance INSTANCE_TYPE \            #Compute node type, such as dgxa100.80g.4.norm
  --array-type PYTORCH \
  --replicas 2 \
  --image "nvidia/clara/bionemo-framework:1.4" \     #Image path for BioNeMo
  --result /results \
  --workspace WORKSPACE_ID:/bionemo_diffdock:RW \
  --port 8888 \
  --datasetid DATASET_ID:/workspace/bionemo/data/ \       # Dataset's NGC ID
  --total-runtime 1D \
  --preempt RUNONCE \                   
  --priority MEDIUM \                   # Priority level for the jog execution [LOW, MEDIUM, HIGH]
  --order 1 \                           # Priority order for the jog execution [1-99]
  --commandline "jupyter lab --allow-root --ip=* --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --ContentsManager.allow_hidden=True --notebook-dir=/ & sleep infinity"  # This command can be replaced with the model training command: python train.py....

The bcprun command provided in the cells below can also be submitted as --commandline argument (instead of launching Jupyter-lab).

Once the resources are assigned for the job and the BioNeMo container is running, we’ll proceed ahead via ngc batch exec <JOB_ID> or using the Jupyter-Lab interface accessible at https://<COMPUTE_HEAD_NODE_URL_ADDRESS>:8888.

Data Preprocessing for Score Model#

Downloading and pre-processing the example dataset for score model#

To briefly showcase the model training capacities of BioNeMo Framework, we will use a very small sample dataset (50 complex samples) from Posebusters benchmark set that is provided as a part of the preprocessed sample dataset with dataset id: 1617183

If you want to do training with your own pdb data, refer to the preprocessing instructions here. Remove the dataset id setting, update the total runtime, and update the data file directories with linking the mounted workspace folder with adding ln -s /bionemo_diffdock/data /workspace/bionemo/data; in the --commandline.

Model training: Score Model#

Example dataset#

In this test runs, we will use preconfigured parameters provided in the train_score.yaml config file located in the /workspace/bionemo/examples/molecule/diffdock/conf folder.

We will also set other parameters suitable for a quick run, such as ++trainer.max_epoch=2. User can update these parameters by editing the .yaml config file or as additional command line arguments, as shown in the example below. User can select the full dataset and adjust other parameters.

As we are connected to the compute node, we navigate to the BioNeMo home folder using the command cd /workspace/bionemo, and execute the following command in the terminal.

The bcprun command is similar to srun command in SLURM, you can find more details at the NVIDIA BCP User Guide.

BCP Run commands#

mkdir -p /bionemo_diffdock/results  && ln -s /bionemo_diffdock/results/ /workspace/bionemo/results
bcprun --debug --nnodes=1 --npernode=2 \
    -w /workspace/bionemo \
    --cmd 'export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync; \
    python examples/molecule/diffdock/train.py trainer.devices=2 trainer.num_nodes=1 \
    trainer.max_epochs=2 name=diffdock_score_training_test_nnodes_1_ndevices_2 \
    model.val_denoising_inference_freq=1 model.micro_batch_size=4 trainer.num_sanity_val_steps=0 \
    model.max_total_size=null model.estimate_memory_usage.maximal=null model.tensor_product.type=fast_tp trainer.log_every_n_steps=1 '



To run the model training on multiple nodes, you will have to update parameters accordingly, for example, the command running the model training job on 4 nodes would look like this, but don’t test this too small sample data with 4 nodes.

bcprun --debug --nnodes=4 --npernode=8 \
    -w /workspace/bionemo \
    --cmd 'mkdir -p /bionemo_diffdock/results ; ln -s /bionemo_diffdock/results/ /workspace/bionemo/results; export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync; python examples/molecule/diffdock/train.py trainer.devices=8 trainer.num_nodes=4 ... '

Note

To run the model training job on a local workstation, user can directly execute the train.py script with desired configurations. For example,

python examples/molecule/diffdock/train.py ...

Logging with WandB#

If you are launching the model training job interactively from the terminal/Jupyter-Lab, you can set your Weights and Biases access via wandb login <YOUR_WANDB_API_KEY> or checkout https://docs.wandb.ai/ref/cli/wandb-login for more information. Alternatively, you may also export the API key as a variable at the time of launching the job via command-line, as shown in /workspace/bionemo/examples/molecule/scripts/train_bcp.sh

Output and Results#

As the DiffDock model training job is launched, BioNeMo will print out some of the details related to compute resources, model training configuration, and dataset being used for training. As the job progresses, it will also print out various details related to the test/train/validation steps and accuracy matrices at a set intervals.

diffdock_1.png

diffdock_2.png

Upon the completion of training process, it will also print out the details related to log files, model checkpoints, and so on, that will also be saved in the directory as configured (here /workspace/bionemo/results).

diffdock_3.png

Finally, if Weights and Biases logging was enabled (for example, ++exp_manager.create_wandb_logger=True ), you can also visualize the model training progress and resulting matrices, and the summary also gets printed on the termainal at the end of the training job.

diffdock_4.png

Train a small score model#

Adjust model with adding following config settings to the python train.py

model.diffusion.tr_sigma_max=34 model.ns=24 model.nv=6 model.num_conv_layers=5

Train Confidence Model#

Adjust the config file used for training with

python examples/molecule/diffdock/train.py --config-name=train_confidence ...