Similarity Prediction

Donated on 10/28/2022

Molecular similarity assessments by expert chemists. Useful for the prediction of molecular similarity evaluations by humans.

Dataset Characteristics

Tabular, Image

Subject Area

Physical Sciences

Associated Tasks


Attribute Type


# Instances


# Attributes



For what purpose was the dataset created?

Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that ‘similar molecules have similar properties’. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts’ judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful. Models built on the dataset can be useful to reduce or assist human efforts in future evaluations.

Who funded the creation of the dataset?

The dataset was created by Enrico Gandini during his PhD at Università degli Studi di Milano.

What do the instances in this dataset represent?

Two CSV files containing the similarity assessments, the SMILES representation of the molecules, and the molecular descriptors described in the paper. Accompanied are the 2D and 3D pictures shown to the experts for similarity evaluation.

Are there recommended data splits?

In the paper, the original dataset was used to test the models built on the new dataset, and vice versa. New models can take advantage by a combination of the two datasets.

Was there any data preprocessing performed?

Standardized SMILES representations were obtained with RDKit and MolVS. Molecular descriptors were calculated with KNIME, RDKit, and OpenEye Omega and ROCS.

Has the dataset been used for any tasks already?

The dataset was used to build the models described in the paper.

Additional Information

The dataset is composed of two parts: the original dataset and the new dataset. The molecules from the original dataset were standardized and processed in the same way as the molecules in the new dataset, as described in the paper, and that is the reason for inclusion of both parts in this dataset.

Citation Requests/Acknowledgements

Gandini, Enrico, Gilles Marcou, Fanny Bonachera, Alexandre Varnek, Stefano Pieraccini, and Maurizio Sironi. 2022. "Molecular Similarity Perception Based on Machine-Learning Models" International Journal of Molecular Sciences 23, no. 11: 6114.

Introduction Paper


1 citations


ChemistryCheminformaticsMolecular SimilaritySmall Molecule


Enrico Gandini


This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Learn More