Andreas Barth suggests that regarding research from a molecular viewpoint reveals the big picture

A scientist considering a research field might ask some of the following questions: ‘What opportunities exist? How large is the field? How much literature is available? How many compounds are known? Are there any new compounds that have not been investigated yet? How can I find potential research partners?’

Traditionally, these questions can be addressed using bibliometrics – citation and publication data. But this is far from perfect: citation counts are biased in several ways, and they are incomplete and error-prone. Furthermore, citations are difficult to categorise with regard to subjects and research topics.

An alternative is offered by looking instead at compounds. Just as the mention of relevant publications within a paper produces the citation data of bibliometrics, the mention of chemical compounds can be used in a similar way. 

Many fields of research, particularly in chemistry and materials, are heavily focused on chemical compounds – sometimes a single compound. The number of compounds in a particular compound class, and the associated publications, can therefore be a measure of research activity and scientific weight. Compounds are unambiguous, easy to categorise and can be accessed precisely via molecular and structural formulae or substructures. In short, instead of counting publications and citations, we count and map chemical compounds.  

Hot spots and gaps

By analysing research fields using chemistry databases and combining compound and literature information, we can obtain a richer, more informative and useful picture of the research landscape.1,2 In such ‘chemical bibliometrics’, we can define a research field in various ways: the elements of a compound, chemical structure, or even key words. Using existing chemical databases with high-level intellectual indexing and standardisation, we can obtain a reliable (large) set of chemical compounds that match the definition of the research field. The answer set can be further augmented with additional descriptors or by combining the compound set with other data, such as numeric properties from experiments or theoretical calculations.

In our own work, we have used the compound (RegistrySM) and literature (CAplusSM) databases of Chemical Abstracts Service (CAS). These are the most comprehensive chemical databases, containing records for chemical compounds identified by CAS since 1907, as well as the corresponding publications. Intelligently addressing these data enables chemical bibliometric information to be extracted.

Counting compounds instead of publications and citations opens new perspectives for data-based scientific discovery and it can complement and stimulate both experimental and theoretical research. The compound map of a research field can be used to identify hot spots, where research activity is high, but also gaps or white areas, where opportunities for new research exist.

Instead of counting publications and citations, we can count compounds

For example, we know that most high temperature superconductors belong to the class of rare earth cuprates. Restricting ourselves to compounds with four different elements, we obtain more than 65,000 publications (including patents) that refer to more than 6000 specific compounds. Almost half of the literature deals with a single compound: YBa2Cu3O7. Mapping the compounds shows that the major focus of research has been confined to a few element combinations: Ba with Y, La, Pr or Nd; and Sr with La. For most other element combinations only a few investigations have been published, and there are still gaps, for instance Be and Mg compounds.2

For certain applications it is also possible to use comprehensive databases like the Inorganic Crystallographic Structure Database (ICSD) or the Cambridge Structural Database (CSD) that focus on a given research field. In a recent review, several applications in computational material design were summarised based on a database search in ICSD.3 The authors performed calculations of thermodynamic and electronic properties together with database search and analysis technologies, obtaining tables of materials ordered by a set of physical property descriptors. These can be used as the basis for new material design. The chemical bibliometrics approach is less sophisticated than this, but much broader as it can be applied to all fields of chemical and material research.

Stepping back

The power of chemical bibliometrics lies in the unique perspective it offers. A good analogy is the field of aerial archaeology, where patterns invisible to the observer on the ground are revealed from the air, where the observer obtains a broader view.

New methods for analysis and visualisation of large amounts of data (‘big data’) are becoming more and more important. The volume of data being generated by researchers is increasing enormously, so it is vital to develop new methods that can manage these large datasets. Reading every publication in a given research field is no longer possible. Indeed, people have complained about the overflow of information since the beginning of science publishing. I believe that there will be a paradigm shift from accumulating knowledge by reading to the creation of knowledge by search and analysis. The decisive precondition is the combination of information needs with advanced search options. The new paradigm will focus on pattern recognition and scientists are well accustomed to this approach.

Andreas Barth is head of business development at FIZ Karlsruhe, Germany