Overview#
Introduction#
GenMol is a masked diffusion model trained on Sequential Attachmentbased Fragment Embedding (SAFE)1 representations for fragment-based molecule generation, which can serve as a generalist model for various drug discovery tasks. As the most important feature of GenMol compared to other molecular generation models, the use of SAFE format enables users to design a template for a highly flexible generation schema, such as:
Specifying fixed fragment(s), which will remain unchanged in generation.
specifying specific positions that generated fragments will attach to.
Generating a partial or full fragment or generating multiple fragments.
Generating fragments at any range of lengths specified.
The inference by this model is a masking-unmasking process inspired by the idea of masked discrete diffusion2. It accepts a SAFE-formatted text sequence (representing a molecule) with some fragments masked (represented by asterisk symbols) and translates it into tokens (with masking tokens) before supplying it to the Transformer/BERT-based neural network model. In each forward pass, the model will try to predict the tokens at all masked positions, and only one or a few tokens with the highest probit values will be picked in a step, which will be repeated until all masking positions are recovered.
Applications#
GenMol can be applied for a variety of scenarios for molecular generation:
De Novo generation - random sampling of valid molecular sequences with certain lengths.
Conditioned generation - sequence completion from a given molecular structure, such as:
Motif extension & scaffold decoration - adding a new fragment sequence to a specified attachment point in a molecule.
Superstructure generation - adding a new fragment sequence to any of the possible attachment points (randomly) in a molecule.
Linker design - generation of a sequence linking two separated molecular fragments on specified attachment points.
Molecule optimization with Oracle methods, such as:
Hit-generation - screening drug candidates generated from a library of molecular fragments for effective binding affinity to the target.
Lead-optimization - sampling molecules from a pre-identified lead compound with improved drug-like properties, such as ADMET.
NIM Features#
The GenMol model is wrapped and delivered as NVIDIA Inference Microservice (NIM) to provide high-performance and user-friendly AI inferences with following highlighted features:
Quick and scalable deployment - GenMol NIM can be downloaded as a Docker image and quickly deployed to supported Linux systems with NVIDIA GPUs, and it can be easily scaled up to any number of GPUs according to the requirements of the workload.
Simple interface - GenMol inferences can be made as OpenAPI standard requests to the HTTP(s) endpoints of NIM either hosted in NVIDIA Preview-API Catalog or self-deployed on-premise platforms, with which it can be integrated in different pipelines or workflows for molecular designs and discoveries.
Enterprise-level optimization and support - NIM products are highly optimized for the performance, reliability, and security of AI inferences and are constantly monitoring and patching CVEs to ensure enterprise-level qualities.

Note
A more detailed description of the model can be found in the Model Card.
References#
1 Gotta be SAFE: A New Framework for Molecular Design, Noutahi et al, 2023. Link
2 Simple and Effective Masked Diffusion Language Models, Sahoo et al, 2024. Link