Comment

No comments

Antony Williams looks forward to a rich online resource of chemical reactions

COMMENT-180

Chemists are spoilt for choice - with hundreds of data sources online covering tens of millions of different chemical compounds. But accessing those resources has been far from straightforward. Presently, online data across the multiple public resources are contaminated, imperfect and desperately requiring validation and curation. ChemSpider, a free online database of chemicals and related information, has been active in this pursuit for over three years and has delivered an online environment for the community to both deposit their data as well as curate and annotate existing content on the database.

This crowdsourcing must continue to expand. Five years from now, most chemists will see ChemSpider as a trusted source of information after reviewing the contributions of their colleagues in the community and will be adding their own commentaries and conclusions. Errors in data inherited from other databases will have been cleared using both robotic and manual curation efforts, to the point that the only questionable data on the database will be from incorrect interpretations by scientists, and even these possibilities will have been labelled for attention.

While Wikipedia is not a platform for the public management of novel research, ChemSpider will assume exactly that role for chemists who will be participating in the increasingly active area of Open Notebook Science, a growing area of attention for the funding agencies. Data will be exposed via web services, via semantic protocols such as the Resource Description Framework, and accessible through chemistry-based ontologies.

The data will be used as the basis of model development, to act as reference data to underpin new chemistry experiments, and as feeds into a centralised data pool for the chemical and pharmaceutical industries. ChemSpider will be one of the founding bodies of a centralised approach to data validation and integration. It will be charged with elevating public domain chemistry resources from their present state to a point where those resources can be trusted and are freely accessible. The drug design process will make use of online databases of toxicogenomics data, predictive toxicity and other quantitative structure-activity relationship algorithms as well as metabonomics, protein and other databases of biological interest. All such public domain databases will be accessible via ChemSpider.

ChemSpider was acquired by the RSC in May 2009. In five years’ time, the RSC will have marked up its entire archive of articles with the latest and greatest semantic markup technology. Thousands of pairs of eyes will have searched and read the majority of the content in the course of their content review and, using a simple user interface, will have linked and validated millions of chemical entities, linking them into the ChemSpider database and making both the articles and the associated chemistry more ’discoverable’.

Synthetic chemists will visit the online ChemSpider Synthetic Pages to search for reactions of interest as well as deposit their own syntheses to share with the community. The majority of syntheses conducted in laboratories around the world today are unpublished, so the associated information regarding previous successful and failed reactions could help improve the success rate and yield of future syntheses. Every reaction will be linked to source suppliers of the chemicals used in the synthesis and one click will highlight availability and pricing across a list of commercial suppliers. One more click will allow alternative reactions to be triggered at a remote site and ’online synthesis requests’ will be feasible. While chemistry may have to catch up with the concept of ’chemicals on demand’, the infrastructure will already exist to activate the process. Individual chemists as well as robotics systems can be part of the system responding to a Request for Synthesis (RFS) and will be able to manage their data on ChemSpider to share with customers.

In five years’ time more scientific publishers will have joined an environment of data sharing and integration. Business models will surely change and morph as consumers of content demand enhanced delivery systems and approaches. The RSC is already leading the charge for more exposure of data, to more deeply integrate into existing systems and provide interfaces to produce mash-ups (web applications that combine data and functionality from more than one source) that deliver greater value.

The five year plan may seem quite lofty but is very achievable. The primary challenge will only be achievable by engaging the community to participate further. Developing the world’s richest online resource of chemical reactions is not a platform issue, it is an issue of participation. I hope my vision of what ChemSpider will become in five years turns out to be limited. I hope we can achieve so much more.

Antony Williams is vice president of strategic development, ChemSpider