BioNeMo_MegaMolBART Framework
The BioNeMo_MegaMolBART’s training stack is built using CUDA-X, PyTorch, PyTorch lightning, and Apex (for performing training at scale), and NVIDIA’s NeMo framework for building and fine-tuning Large Language models.
![clara-megamolbart-09.png](https://docscontent.nvidia.com/dims4/default/fc9a619/2147483647/strip/true/crop/600x290+0+0/resize/600x290!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F00000186-155c-dad2-a9a7-5ffccd510000%2Flaunchpad%2Fai%2Fclara-megamolbart%2Flatest%2F_images%2Fclara-megamolbart-09.png)
This framework allows distributed model training on a multi-GPU and multi-node compute architecture in a Model parallel, Pipeline parallel, and Data parallel configurations. Setting the desired training configuration is simple as BioNeMo_MegaMolBART uses configurable YAML files as shown below:
![clara-megamolbart-10.png](https://docscontent.nvidia.com/dims4/default/69ee948/2147483647/strip/true/crop/979x298+0+0/resize/979x298!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F00000186-155c-dad2-a9a7-5ffccd510000%2Flaunchpad%2Fai%2Fclara-megamolbart%2Flatest%2F_images%2Fclara-megamolbart-10.png)
Additionally, this framework allows for model checkpointing while training, so users can continue training with a previously trained (or even the provided pre-trained) model.
The library organization has the following key components:
examples/chem: It contains configuration files and training scripts
data: It includes classes and functions for loading and augmenting datasets
models: The NeMo MegaMolBART model
tokenizer: MegaMolBART tokenizer for processing SMILES input
vocab: Default vocabulary file and regular expression for tokenizer
![clara-megamolbart-11.png](https://docscontent.nvidia.com/dims4/default/51df9eb/2147483647/strip/true/crop/742x553+0+0/resize/742x553!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F00000186-155c-dad2-a9a7-5ffccd510000%2Flaunchpad%2Fai%2Fclara-megamolbart%2Flatest%2F_images%2Fclara-megamolbart-11.png)
The model configuration file (here, megamolbart_pretrain_large_span_aug.yaml
) is in a hierarchical YAML format, as shown in the image. Users can set parameters like devices, nodes, precision levels, etc.
![clara-megamolbart-12.png](https://docscontent.nvidia.com/dims4/default/3fc221a/2147483647/strip/true/crop/979x639+0+0/resize/979x639!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F00000186-155c-dad2-a9a7-5ffccd510000%2Flaunchpad%2Fai%2Fclara-megamolbart%2Flatest%2F_images%2Fclara-megamolbart-12.png)
Similarly, the pre-training run script megamolbart_pretrain.py ``in Python; also, a set of scripts for Slurm and Shell (for example, ``megamolbart_pretrain_slurm.sh
) are also provided to launch training jobs in respective settings.
![clara-megamolbart-13.png](https://docscontent.nvidia.com/dims4/default/908611f/2147483647/strip/true/crop/979x487+0+0/resize/979x487!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F00000186-155c-dad2-a9a7-5ffccd510000%2Flaunchpad%2Fai%2Fclara-megamolbart%2Flatest%2F_images%2Fclara-megamolbart-13.png)