ZINC-15 Dataset#

The ZINC-15 is a free database of commercially-available compounds for virtual screening and was used for training. Approximately 1.54 Billion molecules (SMILES strings) were selected from tranches meeting the following constraints: molecular weight <= 500 Daltons, LogP <= 5, reactivity level was “reactive,” and purchasability was “annotated.” The compounds were filtered to ensure a maximum length of 512 characters. Train, validation, and test splits were randomly split as 99% / 0.5% / 0.5%.