Benchmarks#
With BioNeMo GenMol, a benchmark for both performance and accuracy is built on two fragment completion tasks, motif-extension and scaffold-decoration, with 10 tests per task drawn from the SAFE-DRUGS dataset. These are derived by following tests from SAFE-GPT. More information about these tasks can be found in this article.
For the task of motif-extension, parameters are chosen as:
mask_length = 17
temperature = 1.2
noise = 1.6
For the task of scaffold-decoration, parameters are chosen as:
mask_length = 17
temperature = 1.2
noise = 2.0
Performance#
Average wall-time (in seconds) for generating 1000 molecules across 10 tests on each supported GPU.
GPU |
motif-extension |
scaffold-decoration |
|---|---|---|
A10G |
3.893 |
3.051 |
A100 |
1.598 |
1.379 |
L40S |
1.625 |
1.458 |
RTX6000 Ada |
1.680 |
1.423 |
RTX6000 Blackwell |
0.909 |
0.811 |
H100 |
1.051 |
0.961 |
H200 |
0.896 |
0.804 |
B200 |
0.684 |
0.653 |
B300 |
1.676 |
0.947 |
GH200 |
0.911 |
0.901 |
GB10 (DGX Spark) |
3.938 |
2.811 |
GB200 |
0.941 |
0.870 |
GB300 |
1.483 |
0.924 |
Accuracy#
Accuracy is evaluated from generating 100 molecules and computing the following metrics:
validity: fraction of generated SMILES that are valid.
uniqueness: ratio of unique molecules from all valid molecules.
diversity: average pair-wise distances in molecular fingerprints of generated molecules.
novelty: average distances of molecular fingerprints from the input molecule to generated molecules.
quality: fraction of generated molecules with QED_score > 0.6 and SA_score < 4.
Reference values measured on H100:
metric |
motif-extension |
scaffold-decoration |
|---|---|---|
validity |
0.889 |
0.995 |
uniqueness |
0.670 |
0.756 |
diversity |
0.674 |
0.564 |
novelty |
0.691 |
0.624 |
quality |
0.188 |
0.354 |
Values are single-run measurements and standard deviations across runs are not reported. Accuracy is broadly consistent across the validated GPU matrix, while validity stays within approximately ±2 percentage points and quality within approximately ±4 percentage points across all 13 supported SKUs.
v1 vs v2 reference (H100)#
The following table compares v1 and v2 measurements on the same H100 reference SKU, using the same benchmark harness and methodology in both runs.
Metric |
v1 H100 |
v2 H100 |
Δ |
|---|---|---|---|
Performance (wall-time, seconds) |
|||
motif-extension |
1.824 |
1.051 |
−42.4% |
scaffold-decoration |
0.972 |
0.961 |
−1.1% |
Accuracy — motif-extension |
|||
validity |
0.915 |
0.889 |
−0.026 |
uniqueness |
0.671 |
0.670 |
−0.001 |
diversity |
0.604 |
0.674 |
+0.070 |
novelty |
0.684 |
0.691 |
+0.007 |
quality |
0.273 |
0.188 |
−0.085 |
Accuracy — scaffold-decoration |
|||
validity |
0.967 |
0.995 |
+0.028 |
uniqueness |
0.763 |
0.756 |
−0.007 |
diversity |
0.555 |
0.564 |
+0.009 |
novelty |
0.655 |
0.624 |
−0.031 |
quality |
0.346 |
0.354 |
+0.008 |
Motif-extension quality is sensitive to hyperparameter selection (temperature,
noise, and mask-length). The values reported above use the defaults listed in the
test parameters at the top of this page; customer applications can achieve different
accuracy/quality trade-offs by tuning these parameters.