Benchmarks#

With BioNeMo GenMol, a benchmark for both performance and accuracy is built on two fragment completion tasks, motif-extension and scaffold-decoration, with 10 tests per task drawn from the SAFE-DRUGS dataset. These are derived by following tests from SAFE-GPT. More information about these tasks can be found in this article.

For the task of motif-extension, parameters are chosen as:

  • mask_length = 17

  • temperature = 1.2

  • noise = 1.6

For the task of scaffold-decoration, parameters are chosen as:

  • mask_length = 17

  • temperature = 1.2

  • noise = 2.0

Performance#

Average wall-time (in seconds) for generating 1000 molecules across 10 tests on each supported GPU.

GPU

motif-extension

scaffold-decoration

A10G

3.893

3.051

A100

1.598

1.379

L40S

1.625

1.458

RTX6000 Ada

1.680

1.423

RTX6000 Blackwell

0.909

0.811

H100

1.051

0.961

H200

0.896

0.804

B200

0.684

0.653

B300

1.676

0.947

GH200

0.911

0.901

GB10 (DGX Spark)

3.938

2.811

GB200

0.941

0.870

GB300

1.483

0.924

Accuracy#

Accuracy is evaluated from generating 100 molecules and computing the following metrics:

  • validity: fraction of generated SMILES that are valid.

  • uniqueness: ratio of unique molecules from all valid molecules.

  • diversity: average pair-wise distances in molecular fingerprints of generated molecules.

  • novelty: average distances of molecular fingerprints from the input molecule to generated molecules.

  • quality: fraction of generated molecules with QED_score > 0.6 and SA_score < 4.

Reference values measured on H100:

metric

motif-extension

scaffold-decoration

validity

0.889

0.995

uniqueness

0.670

0.756

diversity

0.674

0.564

novelty

0.691

0.624

quality

0.188

0.354

Values are single-run measurements and standard deviations across runs are not reported. Accuracy is broadly consistent across the validated GPU matrix, while validity stays within approximately ±2 percentage points and quality within approximately ±4 percentage points across all 13 supported SKUs.

v1 vs v2 reference (H100)#

The following table compares v1 and v2 measurements on the same H100 reference SKU, using the same benchmark harness and methodology in both runs.

Metric

v1 H100

v2 H100

Δ

Performance (wall-time, seconds)

motif-extension

1.824

1.051

−42.4%

scaffold-decoration

0.972

0.961

−1.1%

Accuracy — motif-extension

validity

0.915

0.889

−0.026

uniqueness

0.671

0.670

−0.001

diversity

0.604

0.674

+0.070

novelty

0.684

0.691

+0.007

quality

0.273

0.188

−0.085

Accuracy — scaffold-decoration

validity

0.967

0.995

+0.028

uniqueness

0.763

0.756

−0.007

diversity

0.555

0.564

+0.009

novelty

0.655

0.624

−0.031

quality

0.346

0.354

+0.008

Motif-extension quality is sensitive to hyperparameter selection (temperature, noise, and mask-length). The values reported above use the defaults listed in the test parameters at the top of this page; customer applications can achieve different accuracy/quality trade-offs by tuning these parameters.