Model Benchmarks#

Protein Sequence Representation#

Metrics and datasets are from FLIP.

Metric			Dataset
Type	Name	Definition	Split Name	Split Definition
Classification	Secondary Structure	Predict one of three secondary structure classes (helix, sheet, coil) for each amino acid in a protein sequence.	Sampled	Randomly split sequences into train/test with 95/5% probability.
Classification	Subcellular Localization (SCL)	For each protein, predict one of ten subcellular locations (cytoplasm, nucleus, cell membrane, mitochondrion, endoplasmic reticulum, lysosome/vacuole, golgi apparatus, peroxisome, extracellular, and plastid).	Mixed Soft	The mixed soft split uses train, validation, and test splits as provided in the DeepLoc 1.0 publication.
Classification	Conservation	Predict one of nine possible conservation classes (1 = most variable to 9 = highly conserved) for each amino acid in a protein sequence	Sampled	Randomly split sequences into train/test with 95/5% probability.
Regression	Meltome	Predict melting degree, which is the temperature at which >50% of a protein is denatured.	Mixed Split	Protein sequences were clustered by seq identity with 80% of clusters used for training, 20% for testing. The mixed split uses sequences from clusters for training and the representative cluster sequence for testing. The objective is to minimize performance overestimation on large clusters in the test set.
Regression	GB1 Binding Activity	The impact of amino acid substitutions for one or more of four GB1 positions (V39, D40, G41, and V54) was measured in a binding assay. Values > 1 indicate more binding than wildtype, equal to 1 indicate equivalent binding, and < 1 indicate less binding than wildtype.	Two vs Rest	The training split includes wild type sequence and all single and double mutations. Everything else is put into the test set.

Classification Metric Values

ESM models listed below are tested as deployed in BioNeMo.

Secondary Structure		Subcellular Localization (SCL)		Conservation
Model	Accuracy	Model	Accuracy	Model	Accuracy
One Hot	0.643	One Hot	0.386	One Hot	0.202
ESM1nv	0.773	ESM1nv	0.720	ESM1nv	0.249
ProtT5nv	0.793	ProtBERT	0.740	ProtT5nv	0.256
ProtBERT	0.818	ProtT5nv	0.764	ProtBERT	0.326
ProtT5	0.854	ESM2 T33 650M UR50D	0.791	ESM2 T33 650M UR50D	0.329
ESM2 T33 650M UR50D	0.855	ESM2 T36 3B UR50D	0.812	ESM2 T36 3B UR50D	0.337
ESM2 T36 3B UR50D	0.861	ProtT5	0.820	ESM2 T48 15B UR50D	0.340
ESM2 T48 15B UR50D	0.867	ESM2 T48 15B UR50D	0.839	ProtT5	0.343

Regression Metric Values

Meltome		GB1 Binding Activity
Model	MSE	Model	MSE
One Hot	128.21	One Hot	2.56
ESM1nv	82.85	ProtT5	1.69
ProtT5nv	77.39	ESM2 T33 650M UR50D	1.67
ProtBERT	58.87	ESM2 T36 3B UR50D	1.64
ESM2 T33 650M UR50D	53.38	ProtBERT	1.61
ESM2 T36 3B UR50D	45.78	ProtT5nv	1.60
ProtT5	44.76	ESM1nv	1.58
ESM2 T48 15B UR50D	39.49	ESM2 T48 15B UR50D	1.52

SMILES Representation#

Metric Definitions and Dataset

Type	Metric	Metric Definition	Dataset
Physchem Properties	Lipophilicity	MSE from best performing SVM and Random Forest model, as determined by hyperparameter optimization with 20-fold nested cross-validation.	MoleculeNet datasets: Lipophilicity: 4,200 molecules FreeSolv: 642 molecules ESOL: 1,128 molecules
	FreeSolv
	ESOL
Bioactivities	Activity		ExCAPE database filtered on a subset of protein targets (28 genes). The set of ligands for each target comprise one dataset, with the number of ligands ranging from 1,341 to 367,067 molecules (total = 1,203,479). A model is fit for each dataset and the resulting MSE values are averaged.

Metric Values

Type	Metric	SVM MSE	Random Forest MSE
Physchem Properties	Lipophilicity	0.491	0.811
	FreeSolv	1.991	4.832
	ESOL	0.474	0.862
Bioactivities	Activity	0.520	0.616

SMILES Generation#

Metric Definitions and Dataset

Type	Metric	Metric Definition	Dataset
Sampling	Validity	Percentage of molecules generated which are valid SMILES, as determined by RDKit.	The dataset was 10k molecules randomly selected from ChEMBL that are not present in the training data for MoFlow or MegaMolBART and pass drug-likeness filters. For each of these seed molecules, sample 512 molecules from MoFlow with a temperature of 0.25. For MegaMolBART, sample 10 molecules with a radius of 1.0. For each seed molecule, calculate metric or properties as described on its samples. The metric value is the percentage of molecules which meet the metric definition.
	Novelty	Percentage of valid molecules that are not present in training data and don’t match the seed molecule.
	Uniqueness	Percentage of valid molecules that are unique.
	NUV	Percentage of molecules generated which meet all sampling metrics (novelty, uniqueness, validity).
Drug-Likeness	QED	Quantitative estimate of drug-likeness.
	SAS	Synthetic accessibility score.
	Pass Filters	Fraction of valid molecules which meet all of the following drug-likeness criteria: (1) SAS between 2.0 and 4.0, inclusive; (2) QED >= 0.65; (3) Maximum ring size <= 6; (4) Number of rings >= 2; (5) No rings with fewer than 5 atoms.

Metric Values

Type	Metric	MegaMolBART		MoFlow
		Mean	Standard Deviation	Mean	Standard Deviation
Sampling	Validity	0.819	0.034	1.000	0.000
	Novelty	1.000	0.000	1.000	0.000
	Uniqueness	0.513	0.069	0.841	0.190
	NUV	0.395	0.037	0.841	0.190
Drug-Likeness	QED	0.746	0.007	0.583	0.009
	SAS	2.654	0.204	4.150	0.254
	Pass Filters	0.766	0.074	0.215	0.020

NVIDIA BioNeMo Framework

Model Benchmarks

Contents

Model Benchmarks#

Protein Sequence Representation#

SMILES Representation#

SMILES Generation#