[COMMODE] a large-scale database of molecular descriptors using compounds from PubChem
© Dander et al.; licensee BioMed Central Ltd. 2013
Received: 30 August 2013
Accepted: 29 October 2013
Published: 13 November 2013
Molecular descriptors have been extensively used in the field of structure-oriented drug design and structural chemistry. They have been applied in QSPR and QSAR models to predict ADME-Tox properties, which specify essential features for drugs. Molecular descriptors capture chemical and structural information, but investigating their interpretation and meaning remains very challenging.
This paper introduces a large-scale database of molecular descriptors called COMMODE containing more than 25 million compounds originated from PubChem. About 2500 DRAGON-descriptors have been calculated for all compounds and integrated into this database, which is accessible through a web interface at http://commode.i-med.ac.at.
Groups of molecular descriptors
Name of the group
Number of descriptors
Walk and path counts
Edge adjacency indices
Burden eigenvalue descriptors
Topological charge indices
Functional group counts
2D binary fingerprints
2D frequency fingerprints
The main contribution of this paper is to provide a large-scale, online available database, containing over 25 million chemicals downloaded from the database PubChem[10, 11]. Our database, called COMMODE (COMpilation of MOlecular DEscriptors), provides a valuable source containing descriptor data, which is usually not available at a large scale. As we have already mentioned, the present paper can be regarded as an application explaining the database as well the underlying tool in brief. It would go far beyond the scope of the paper to explain the features and the application of COMMODE in in depth. COMMODE also allows researchers to examine the interpretation or meaning of molecular descriptors.
In the context of topological descriptors [1, 3, 5, 6], this relates to exploring the structural interpretation of such measures. Exemplarily, branching, cyclicity, connectivity and symmetry are plausible interpretations of such indices that have been investigated, see [5, 12–15]. Nevertheless, one is far away from possessing a general framework to tackle this problem by using a large number of available descriptors. We believe that COMMODE will prove to be useful to resolve this challenging task successfully. Also, researchers who are active in QSPR and QSAR [16, 17] and those developing chemometrics-driven models  using descriptor data might utilize this database.
The paper is organized as follows. In the section ‘Implementation’, we explain the database scheme, and details of the data generation and integration process. The section ‘Results and discussion’ outlines the usage of our tool with a standard web browser. The article finishes with ‘Conclusions’.
Molecular and structural descriptors
Molecular descriptors encode certain information about chemicals. As a result, special classes of such measures have been developed to emphasize particular aspects of chemicals, e.g., atom types, bond types or structural properties. In particular, molecular descriptors have been proven essential for designing QSPR/QSAR models efficiently [16, 17]. In this application, we calculated descriptors using of DRAGON . A class of descriptors that has been investigated extensively are topological (or structural) descriptors [5–8]. Clearly, this class itself can be divided in different subcategories such as graph entropy [8, 19] representing information-theoretic indices, eigenvalue-based measures [20, 21], distance-based measures [7, 22] and symmetry-based descriptors . Note that DRAGON has its own categorization of molecular descriptors and, for instance, it does not consider information-theoretic descriptors representing the structural information content of a chemical structure as topological descriptors.
From a practical point of view, molecular descriptors have been used extensively to predict melting and boiling points . Also, other chemical properties such as properties that are important in the drug design process have been used in combination with molecular descriptors. Crucial properties can be ADME-Tox properties (absorption, distribution, metabolism, excretion, and toxicity) influencing different essential aspects of drugs . Examples for molecular descriptors infulencing ADME-Tox properties are the octanol/water partition coefficient (LogP) , the aqueous solubility description (LogS)  and the blood-brain barrier permeation (LogBB) .
Large-scale database of molecular descriptors
This section describes the MySQL database scheme and the process of data integration with Java routines, as well as the calculation of the molecular descriptors by using DRAGON. Moreover, the web page, which provides access to the large scale database and is essential for querying the database, and the implemented descriptive analysis, is explained.
COMMODE contains about 25 million chemicals downloaded from the database PubChem and about 2,500 molecular descriptors, which have been calculated for all of those compounds using DRAGON . The database can be accessed through a php-based web application at http://commode.i-med.ac.at, where different queries can be started, files with compounds of interest can be uploaded, and results can be exported. Furthermore, additional descriptive values can be calculated and statistically analyzed by means of statistical measures, correlation and uniqueness.
As we used compounds from PubChem, the first step was to download the complete set provided as SDF-files  (Structure Data Format). Those files contain 25,051,770 compounds, which vary largely in their size and structure. The mass of the downloaded compounds ranges from 1 to 59750 Dalton with a median value of 383 Dalton, whereby 45 compounds are heavier than 10,000 Dalton. Also the number of heavy atoms (anything other than hydrogen) varies dramatically between 0 and 576 with a median at 27. Other constitutional information about the compounds integrated in COMMODE are number of atoms (min=1, max=982, median=48), number of bonds (min=0, max=990, median=50) and the number of rings (min=0, max=19, median=3) per molecule.
Integration of compounds
Computation and integration of molecular descriptors
After the successful completion of the compound integration, 2,489 molecular descriptors have been calculated by applying DRAGON to all SDF-files. Table 1 lists all 15 calculated groups, their name, and their corresponding number of descriptors. DRAGON produces one file per input file and per group, thereby, numerous files were created holding a huge number of positive real numbers. It appears that DRAGON was not able to calculate the descriptors for 950,688 compounds, due to different errors. These errors can have different causes, like there is just one atom (e.g. hydron H+ (id 1038), bromide Br- (id 259)), or the downloaded molecule represents an unconnected graph and therefore those molecular descriptors can not be computed (e.g. [3-(dimethylcarbamoyloxy)phenyl]-trimethyl-ammonium; methyl sulfate (id 5824)). As DRAGON fails to calculate molecular descriptors for these compounds, the relation Error with the Boolean attribute error was introduced. As 3D information of the compounds is not available using the provided SDF-files, the corresponding 3D descriptors have not been calculated. To handle the large number of result files, an additional Java routine was developed, which integrates those files into the previously generated database relations, where the first attribute for each group contains the identifier from PubChem.
Results and discussion
Views on the data
After querying COMMODE various views displaying the results are provided. The first view shows a list with general information about the resulting compounds. Each single molecule can be selected from this list and explored. Therefore, a view was implemented showing different names and values for each molecule as well as a link to the corresponding PubChem page. This view also shows a 2D and a 3D plot from the molecule derived from PubChem. The user can further see the values of all molecular descriptors for the compound of interest or the values from a single molecular descriptor for all compounds of the given query.
When analyzing data of molecular descriptors on a large-scale, a statistical analysis is crucial. For example, this relates to estimating the correlations between descriptors to examine whether they capture chemical or structural information of compounds similarly. Note that this problem has been already tackled by Basak et al.  and Todeschini et al. ; but we would like to emphasize that in their analysis they have only used small subsets of compounds and descriptors. COMMODE offers now the opportunity to investigate this problem on a large-scale without the need to having access to a stand-alone application that computes molecular descriptors. This might be particularly interesting for researchers who want to analyze properties of molecular descriptors by using existing compounds.
The usage of results within other applications is necessary for scientists, as a lot of downstream analysis can be performed on the integrated data. Therefore, the application supports the following file-formats to export: SDF, SMILES, CSV, MS-Excel®; and XML format. The connection table of the SDF-file is converted from the stored SMILES code using the Chemistry Development Kit (CDK)  and opencsv  in a specific Java routine. The exported files can further be used in QSPR and QSAR models.
This work introduces a large chemical database containing chemical compound data and their corresponding molecular descriptor values. These molecular descriptors can be used in QSPR and QSAR models to predict different chemical parameters using the structure of the compounds, and are utilized in drug design.
The published database, COMMODE, includes more than 25 million compounds and about 2,500 computed descriptors. Clearly, COMMODE extends MOLEdb as this database contains only 1,124 molecular descriptors and 234,773 molecules [39, 40]. To use our database in QSPR or QSAR models, compounds of interests can be queried either by using different search attributes or by providing a list of PubChem identifiers. Afterwards, results for molecular descriptors can be exported in different file-formats. These results can further be combined with investigated attributes or properties of the given compounds. New models can be designed using these combinations, which can further be used to predict these attributes and properties for other compounds.
As not all molecular descriptors are necessary for the downstream analysis the introduced application is able to calculate descriptive values for each molecular descriptor representing the discrimination power or the correlation coefficient between chosen descriptors.
An additional research area supported by COMMODE is the field of chemical graph theory . COMMODE can be used to analyze the chemical meaning of molecular descriptors . Therefore, descriptive analysis of all descriptors can be performed for all integrated compounds as well as on a particular subset. Also the degeneracy [36, 42, 43] of all computed and integrated descriptors can be analyzed on different sets of compounds.
Overall, this novel database provides a flexible access to compounds and their related molecular descriptors, which can be used in different research areas.
Availability and requirements
Project name: COMMODE (COMpilation of MOlecular DEscriptors)
Project home page:http://commode.i-med.ac.at
Operating system(s): Platform independent
Programming language: Java, php
Other requirements: Web Browser
Any restrictions to use by non-academics: none
This work was supported by the FFG project Oncotyrol. Matthias Dehmer thanks the Austrian Science Funds for supporting this work (project P22029-N13). Also, Matthias Dehmer also gratefully acknowledges funding from the Standortagentur Tirol (formerly Tiroler Zukunftsstiftung).
- Kier LB, Hall LH: Molecular Connectivity in Chemistry and Drug Research. 1976, New York, USA: Academic PressGoogle Scholar
- Mazurie A, Bonchev D, Schwikowski B, Buck GA: Phylogenetic distances are encoded in networks of interacting pathways. Bioinformatics. 2008, 24 (22): 2579-2585. 10.1093/bioinformatics/btn503.PubMed CentralView ArticlePubMedGoogle Scholar
- Basak SC, Magnuson VR: Molecular topology and narcosis. Arzeim-Forsch/Drug Design. 1983, 33 (I): 501-503.Google Scholar
- Varmuza K, Demuth W, Karlovits M, Scsibrany H: Binary substructure descriptors for organic compounds. Croat Chem Acta. 2005, 78: 141-149.Google Scholar
- Dehmer M, Varmuza K, Borgert S, Emmert-Streib F: On entropy-based molecular descriptors: statistical analysis of real and synthetic chemical structures. J Chem Inf Model. 2009, 49: 1655-1663. 10.1021/ci900060x.View ArticlePubMedGoogle Scholar
- Bonchev D, Rouvray DH: Complexity in Chemistry, Biology, and Ecology. 2005, New York, NY, USA: Mathematical and Computational Chemistry, SpringerView ArticleGoogle Scholar
- Todeschini R, Consonni V, Mannhold R: Handbook of Molecular Descriptors. 2002, Weinheim, Germany: Wiley-VCHGoogle Scholar
- Bonchev D: Information Theoretic Indices for Characterization of Chemical Structures. 1983, Chichester: Research Studies PressGoogle Scholar
- SRL T: Talete: Dragon. [http://www.talete.mi.it/products/dragon_description.htm]. Accessed: 11/2012.
- Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: Integrated platform of small molecules and biological activities. Annual Reports in Computational Chemistry, Volume 4. Edited by: Cornell W, Wang W, Barker N, Simmerling C, Madura JD, Cornell W. 2008, American Chemical SocietyGoogle Scholar
- NLM: The PubChem project. [http://pubchem.ncbi.nlm.nih.gov]. Accessed: 11/2012.
- Basak SC, Balaban AT, Grunwald GD, Gute BD: Topological indices: their nature and mutual relatedness. J Chem Inf Comput Sci. 2000, 40: 891-898. 10.1021/ci990114y.View ArticlePubMedGoogle Scholar
- Dehmer M, Mowshowitz A: A history of graph entropy measures. Inform Sci. 2011, 1: 57-78.View ArticleGoogle Scholar
- Devillers J, Balaban AT: Topological Indices and Related Descriptors in QSAR and QSPR. 1999, Amsterdam, The Netherlands: Gordon and Breach Science PublishersGoogle Scholar
- Nikolić S, Trinajstić N: Complexity of molecules. J Chem Inf Comput Sci. 2000, 40: 920-926. 10.1021/ci9901183.View ArticlePubMedGoogle Scholar
- Bajorath J: Chemoinformatics: Concepts, Methods, and Tools for Drug Discovery. 2004, Totowa, NJ, USA: Methods in Molecular Biology, Humana PressView ArticleGoogle Scholar
- Guha R: On the interpretation and interpretability of quantitative structure-activity relationship models. J Comput Aided Mol Des. 2008, 22 (12): 857-871. 10.1007/s10822-008-9240-5.View ArticlePubMedGoogle Scholar
- Varmuza K, Filzmoser P: Introduction to Multivariate Statistical Analysis in Chemometrics. 2009, Boca Raton, FL, USA: Francis & Taylor, CRC PressView ArticleGoogle Scholar
- Dehmer M: Information processing in complex networks: graph entropy and information functionals. Appl Math Comput. 2008, 201: 82-94. 10.1016/j.amc.2007.12.010.View ArticleGoogle Scholar
- Dehmer M, Sivakumar L, Varmuza K: Uniquely discriminating molecular structures using novel eigenvalue-based descriptors. MATCH Commun Math Comp Chem. 2012, 67: 147-172.Google Scholar
- Estrada E: Characterization of the folding degree of proteins. Bioinformatics. 2002, 18: 697-704. 10.1093/bioinformatics/18.5.697.View ArticlePubMedGoogle Scholar
- Skorobogatov VA, Dobrynin AA: Metrical analysis of graphs. Commun Math Comp Chem. 1988, 23: 105-155.Google Scholar
- Wiener H: Structural determination of paraffin boiling points. J Amer Chem Soc. 1947, 69: 17-20. 10.1021/ja01193a005.View ArticleGoogle Scholar
- Talevi A, Goodarzi M, Ortiz EV, Duchowicz PR, Bellera CL, Pesce G, Castro EA, Bruno-Blanch LE: Prediction of drug intestinal absorption by new linear and non-linear QSPR. Euro J Med Chem. 2011, 46: 218-228. 10.1016/j.ejmech.2010.11.005.View ArticleGoogle Scholar
- Platts JA, Oldfield SP, Reif MM, Palmucci A, Gabano E, Osella D: The RP-HPLC measurement and QSPR analysis of logPo/w values of several Pt(II) complexes. J Inorgan Biochem. 2006, 100 (7): 1199-1207. 10.1016/j.jinorgbio.2006.01.035.View ArticleGoogle Scholar
- Duchowicz PR, Castro EA: QSPR Studies on aqueous solubilities of drug-like compounds. Int J Mol Sci. 2009, 10 (6): 2558-2577. 10.3390/ijms10062558.PubMed CentralView ArticlePubMedGoogle Scholar
- Fan Y, Unwalla R, Denny RA, Di L, Kerns EH, Diller DJ, Humblet C: Insights for predicting blood-brain barrier penetration of CNS targeted molecules using QSPR approaches. J Chem Inform Model. 2010, 50 (6): 1123-1133. 10.1021/ci900384c.View ArticleGoogle Scholar
- Rudigier T: Analytical Molecular Database Search - Eine Web-Applikation zur Analyse molekularer Deskriptoren. 2011, Austria: Bachelor Thesis, UMITGoogle Scholar
- Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, Laufer J: Description of several chemical structure file formats used by computer programs developed at molecular design limited. J Chem Inform Comput Sci. 1992, 32 (3): 244-255. 10.1021/ci00007a012.View ArticleGoogle Scholar
- Oracle: MySQL : The world’s most popular open source database. [http://www.mysql.com]. Accessed: 11/2012.
- Gasteiger J, Engel T(Eds): Chemoinformatics: A Textbook. Chap. Representation of Chemical Compounds. 2008, Weinheim, Germany: WILEY-VCH, 401-437.Google Scholar
- Todeschini R, Cazar R, Collina E: The chemical meaning of topological indices. Chemomet Intell Laboratory Syst. 1992, 15: 51-59. 10.1016/0169-7439(92)80026-Z.View ArticleGoogle Scholar
- Hu CY, Xu L: On highly discriminating molecular topological index. J Chem Inform Comput Sci. 1996, 36: 82-90. 10.1021/ci9501150.View ArticleGoogle Scholar
- Diudea MV, Ilić A, Varmuza K, Dehmer M: Network analysis using a novel highly discriminating topological index. Complexity. 2011, 16: 32-39. 10.1002/cplx.20363.View ArticleGoogle Scholar
- Konstantinova EV, Vidyuk MV: Discriminating tests of information and topological indices. Animals and trees. J Chem Inf Comput Sci. 2003, 43 (6): 1860-1871. 10.1021/ci025659y.View ArticlePubMedGoogle Scholar
- Konstantinova E: Information-Theoretic Methods in Chemical Graph Theory. Towards an Information Theory of Complex Networks. Edited by: Dehmer M, Emmert-Streib F, Mehler A. 2011, Boston: Birkhäuser, 97-126.View ArticleGoogle Scholar
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The chemistry development kit (CDK): an open-source java library for chemo- and Bioinformatics. J Chem inform Comput Sci. 2003, 43 (2): 493-500. 10.1021/ci025584y.View ArticleGoogle Scholar
- Smith G: opencsv. Accessed: 11/2012.Google Scholar
- Ballabio D, Manganaro A, Consonni V, Mauri A, Todeschini R: Introduction to MOLE DB - on-line molecular descriptors database. MATCH Commun Math Comput Chem. 2009, 62: 199-207.Google Scholar
- Ballabio D: MOLE db - Molecular Descriptors Data Base. [http://michem.disat.unimib.it/mole_db]. Accessed: 11/2012
- Todeschini R, Cazar R, Collina E: The chemical meaning of topological indices. Chemomet and Intell Laboratory Syst. 1992, 15: 51-59. 10.1016/0169-7439(92)80026-Z.View ArticleGoogle Scholar
- Dehmer M, Grabner M, Varmuza K: Information indices with high discriminative power for graphs. PLoS ONE. 2012, 7: e31214-10.1371/journal.pone.0031214.PubMed CentralView ArticlePubMedGoogle Scholar
- Hunter PR, Gaston MA: Numerical index of the discriminatory ability of typing systems: an application of Simpson’s index of diversity. J Clin Microbiol. 1988, 26 (11): 2465-2466.PubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.