NVIDIA Docs Hub NVIDIA LaunchPad Training and Deploying Large Language Models for Drug Discovery BioNeMo_MegaMolBART Framework

BioNeMo_MegaMolBART Framework

The BioNeMo_MegaMolBART’s training stack is built using CUDA-X, PyTorch, PyTorch lightning, and Apex (for performing training at scale), and NVIDIA’s NeMo framework for building and fine-tuning Large Language models.

This framework allows distributed model training on a multi-GPU and multi-node compute architecture in a Model parallel, Pipeline parallel, and Data parallel configurations. Setting the desired training configuration is simple as BioNeMo_MegaMolBART uses configurable YAML files as shown below:

Additionally, this framework allows for model checkpointing while training, so users can continue training with a previously trained (or even the provided pre-trained) model.

The library organization has the following key components:

examples/chem: It contains configuration files and training scripts
data: It includes classes and functions for loading and augmenting datasets
models: The NeMo MegaMolBART model
tokenizer: MegaMolBART tokenizer for processing SMILES input
vocab: Default vocabulary file and regular expression for tokenizer

The model configuration file (here, megamolbart_pretrain_large_span_aug.yaml) is in a hierarchical YAML format, as shown in the image. Users can set parameters like devices, nodes, precision levels, etc.

Similarly, the pre-training run script megamolbart_pretrain.py ``in Python; also, a set of scripts for Slurm and Shell (for example, ``megamolbart_pretrain_slurm.sh) are also provided to launch training jobs in respective settings.