BioNeMo Framework v1.2#
OpenFold implementation under BioNeMo framework, derived from public OpenFold and DeepMind AlphaFold-2.
DNABERT implementation for computing embeddings for each nucleotide in the input DNA sequence.
Training recipes for DNABERT and OpenFold, including automated data processing and full configuration for training.
Example tutorials for running inference using OpenFold.
Splice Prediction downstream task example for DNABERT.
Wrapper scripts for DNABERT and OpenFold to launch jobs on BCP.
Bug fixes and Improvements#
Interface changes for ESM2 data ingestion and pre-processing: To allow useage of separate datasets for train/validation/test datasets,
config.model.data.dataset_pathhas been deprecated and replaced with
There is an enformer model that is currently work in progress in
OpenFold training speed does not yet include MLPerf optimizations, and these will be released in the subsequent release.
The container contains a known vulnerability which is exposed when using Apache Subversion (SVN).
The container contains five other high risk vulnerabilities associated with the NGC CLI which is used to download models from the NGC Registry. The BioNeMo Framework container is a means of shipping a functional development environment, not as a production service container. A full vulnerability report can always be found on the NGC Registry.
BioNeMo Framework v1.1#
EquiDock for protein-protein docking pose prediction
DiffDock for protein-ligand blind docking pose generation
Training recipes for EquiDock and DiffDock, including automated data processing and full configuration for training.
Accelerated inference and training for DiffDock via fast tensor-product kernels.
Example tutorials for running inference using EquiDock and DiffDock.
Recipes for running EquiDock and DiffDock on BCP and Slurm.
Pipeline parallel supported for ESM-2nv.
Migration of inference notebooks to using pytriton.
Bug fixes and Improvements#
Faster pre-processing of data on BCP.
Refactor of download_models.sh to download_models.py for easier CLI use.
Refactor of install structure to move from /opt/nvidia to /workspace/bionemo. The environment variable $BIONEMO_HOME now points to the repo base and is required to be set for tests to pass.
SchedMD Slurm in the release container is shipped with a security vulnerability, CVE-2022-29501, and therefore this version of Slurm should not be used to run a Slurm cluster (specifically, the processes
In general, the BioNeMo Framework release is designed to ship code and an environment that would be executed on local workstations, or deployed on clusters for large scale training jobs. This container is not designed to run as a service with public facing APIs. A full summary of security vulnerabilities can be found here.
BioNeMo Framework v1.0#
ESM-2nv for protein sequence representations, pretrained weights of ESM-2 650M and ESM-2 3B converted from HF checkpoint available.
Pre-training recipes for ESM-2nv, including automated data processing and full configuration for training
Fine-tuning of ESM-2nv with encoder frozen or trainable
Downstream task finetuning support for single-value classification (e.g. subcellular localization), single-value regression (e.g. meltome) and per-token classification (e.g. secondary structure)
Validation in loop to evaluate performance on downstream tasks during training
Example tutorials for pre-training, fine tuning, and downstream tasks
BioNeMo Framework v0.4.0#
ESM-1nv for protein sequence representations, pretrained weights available
ProtT5nv for protein sequence representation and sequence-to-sequence tasks, pretrained weights available
Pre-training for all models, including automated data processing and full configuration for training
Fine-tuning of MegaMolBART, ESM-1nv, and ProtT5nv with encoder frozen or trainable
Downstream task example applications – secondary structure prediction for ESM-1nv and ProtT5nv, physchem prediction (lipophilicity, FreeSolv, ESOL) and retrosynthesis prediction for MegaMolBART
Validation in loop to evaluate performance on downstream tasks during training: physchem prediction (MegaMolBART) and secondary structure prediction (ESM-1nv and ProtT5nv).
Pipeline parallelism supported as a beta feature. Not fully tested.
Example notebooks for pre-training, fine tuning, and downstream tasks
Data preprocessing on DGX Cloud is slow. Faster to do it on a local machine.
BioNeMoDataModule - Encapsulates dataset instantiation in bionemo models so that many different datasets can be used with the same model
EncoderFineTuning - Base class to facilitate implementation of downstream tasks built on embeddings from other models