Model Benchmarks
Contents
Model Benchmarks#
Protein Sequence Representation#
Metrics and datasets are from FLIP.
Metric |
Dataset |
|||
---|---|---|---|---|
Type |
Name |
Definition |
Split Name |
Split Definition |
Classification |
Secondary Structure |
Predict one of three secondary structure classes (helix, sheet, coil) for each amino acid in a protein sequence. |
Sampled |
Randomly split sequences into train/test with 95/5% probability. |
Classification |
Subcellular Localization (SCL) |
For each protein, predict one of ten subcellular locations (cytoplasm, nucleus, cell membrane, mitochondrion, endoplasmic reticulum, lysosome/vacuole, golgi apparatus, peroxisome, extracellular, and plastid). |
Mixed Soft |
The mixed soft split uses train, validation, and test splits as provided in the DeepLoc 1.0 publication. |
Classification |
Conservation |
Predict one of nine possible conservation classes (1 = most variable to 9 = highly conserved) for each amino acid in a protein sequence |
Sampled |
Randomly split sequences into train/test with 95/5% probability. |
Regression |
Meltome |
Predict melting degree, which is the temperature at which >50% of a protein is denatured. |
Mixed Split |
Protein sequences were clustered by seq identity with 80% of clusters used for training, 20% for testing. The mixed split uses sequences from clusters for training and the representative cluster sequence for testing. The objective is to minimize performance overestimation on large clusters in the test set. |
Regression |
GB1 Binding Activity |
The impact of amino acid substitutions for one or more of four GB1 positions (V39, D40, G41, and V54) was measured in a binding assay. Values > 1 indicate more binding than wildtype, equal to 1 indicate equivalent binding, and < 1 indicate less binding than wildtype. |
Two vs Rest |
The training split includes wild type sequence and all single and double mutations. Everything else is put into the test set. |
Classification Metric Values
ESM models listed below are tested as deployed in BioNeMo.
Secondary Structure |
Subcellular Localization (SCL) |
Conservation |
|||
---|---|---|---|---|---|
Model |
Accuracy |
Model |
Accuracy |
Model |
Accuracy |
One Hot |
0.643 |
One Hot |
0.386 |
One Hot |
0.202 |
ESM1nv |
0.773 |
ESM1nv |
0.720 |
ESM1nv |
0.249 |
ProtT5nv |
0.793 |
ProtBERT |
0.740 |
ProtT5nv |
0.256 |
ProtBERT |
0.818 |
ProtT5nv |
0.764 |
ProtBERT |
0.326 |
ProtT5 |
0.854 |
ESM2 T33 650M UR50D |
0.791 |
ESM2 T33 650M UR50D |
0.329 |
ESM2 T33 650M UR50D |
0.855 |
ESM2 T36 3B UR50D |
0.812 |
ESM2 T36 3B UR50D |
0.337 |
ESM2 T36 3B UR50D |
0.861 |
ProtT5 |
0.820 |
ESM2 T48 15B UR50D |
0.340 |
ESM2 T48 15B UR50D |
0.867 |
ESM2 T48 15B UR50D |
0.839 |
ProtT5 |
0.343 |
Regression Metric Values
Meltome |
GB1 Binding Activity |
||
---|---|---|---|
Model |
MSE |
Model |
MSE |
One Hot |
128.21 |
One Hot |
2.56 |
ESM1nv |
82.85 |
ProtT5 |
1.69 |
ProtT5nv |
77.39 |
ESM2 T33 650M UR50D |
1.67 |
ProtBERT |
58.87 |
ESM2 T36 3B UR50D |
1.64 |
ESM2 T33 650M UR50D |
53.38 |
ProtBERT |
1.61 |
ESM2 T36 3B UR50D |
45.78 |
ProtT5nv |
1.60 |
ProtT5 |
44.76 |
ESM1nv |
1.58 |
ESM2 T48 15B UR50D |
39.49 |
ESM2 T48 15B UR50D |
1.52 |
SMILES Representation#
Metric Definitions and Dataset
Type |
Metric |
Metric Definition |
Dataset |
---|---|---|---|
Physchem Properties |
Lipophilicity |
MSE from best performing SVM and Random Forest model, as determined by hyperparameter optimization with 20-fold nested cross-validation. |
MoleculeNet datasets: Lipophilicity: 4,200 molecules FreeSolv: 642 molecules ESOL: 1,128 molecules |
FreeSolv |
|||
ESOL |
|||
Bioactivities |
Activity |
ExCAPE database filtered on a subset of protein targets (28 genes). The set of ligands for each target comprise one dataset, with the number of ligands ranging from 1,341 to 367,067 molecules (total = 1,203,479). A model is fit for each dataset and the resulting MSE values are averaged. |
Metric Values
Type |
Metric |
SVM MSE |
Random Forest MSE |
---|---|---|---|
Physchem Properties |
Lipophilicity |
0.491 |
0.811 |
FreeSolv |
1.991 |
4.832 |
|
ESOL |
0.474 |
0.862 |
|
Bioactivities |
Activity |
0.520 |
0.616 |
SMILES Generation#
Metric Definitions and Dataset
Type |
Metric |
Metric Definition |
Dataset |
---|---|---|---|
Sampling |
Validity |
Percentage of molecules generated which are valid SMILES, as determined by RDKit. |
The dataset was 10k molecules randomly selected from ChEMBL that are not present in the training data for MoFlow or MegaMolBART and pass drug-likeness filters. For each of these seed molecules, sample 512 molecules from MoFlow with a temperature of 0.25. For MegaMolBART, sample 10 molecules with a radius of 1.0. For each seed molecule, calculate metric or properties as described on its samples. The metric value is the percentage of molecules which meet the metric definition. |
Novelty |
Percentage of valid molecules that are not present in training data and don’t match the seed molecule. |
||
Uniqueness |
Percentage of valid molecules that are unique. |
||
NUV |
Percentage of molecules generated which meet all sampling metrics (novelty, uniqueness, validity). |
||
Drug-Likeness |
QED |
Quantitative estimate of drug-likeness. |
|
SAS |
Synthetic accessibility score. |
||
Pass Filters |
Fraction of valid molecules which meet all of the following drug-likeness criteria: (1) SAS between 2.0 and 4.0, inclusive; (2) QED >= 0.65; (3) Maximum ring size <= 6; (4) Number of rings >= 2; (5) No rings with fewer than 5 atoms. |
Metric Values
Type |
Metric |
MegaMolBART |
MoFlow |
||
---|---|---|---|---|---|
Mean |
Standard Deviation |
Mean |
Standard Deviation |
||
Sampling |
Validity |
0.819 |
0.034 |
1.000 |
0.000 |
Novelty |
1.000 |
0.000 |
1.000 |
0.000 |
|
Uniqueness |
0.513 |
0.069 |
0.841 |
0.190 |
|
NUV |
0.395 |
0.037 |
0.841 |
0.190 |
|
Drug-Likeness |
QED |
0.746 |
0.007 |
0.583 |
0.009 |
SAS |
2.654 |
0.204 |
4.150 |
0.254 |
|
Pass Filters |
0.766 |
0.074 |
0.215 |
0.020 |