Chemical informatics' goal is to make data available to everyone. Kira Weissman looks at how this rapidly growing field is developing

Imagine you work in a leading pharmaceutical company, and you need to find a small molecule ligand that will bind to and inhibit a specific nuclear receptor implicated in breast cancer. You don’t know the structure of the receptor, but you have a small collection of molecules with demonstrated activity against it, although some nasty side effects make them undesirable medicines. Instead of heading to the lab, you turn on your computer.

Using your PC, you identify the parts of these molecules that are important for their function, and then, in your company’s vast chemical database, you locate other compounds that should exhibit the same activity but with better therapeutic profiles. You are also able to design new molecules from scratch. You go to the lab and synthesise the next wonder drug, saving millions of lives and earning billions of dollars. Sound fanciful? In fact, this approach is used within the drug industry every day.

The rapidly growing field of chemical informatics is at the heart of this scenario. Chemical informatics is all about data. It is an attempt to collect, verify and distribute on the internet, all of the data that exists on some 34 million known small molecules, and any future molecules. The aim is to make this data usable for any application and by anyone, not just chemists. Data in this sense refer not only to a molecule’s structure and physical properties, but also to details of its synthesis, reactivity and biological activity.

The gold standard in informatics has been established by the biological sciences. Biology boasts a culture of free and open access to methods and materials, promoted by a dedicated club of bioinformaticians over 11 000 members strong. In the chemical community, however, a different ethos has so far prevailed, with a huge amount of money still to be made selling chemical data, particularly within the pharmaceutical industry.

Currently, if you want to know something about a particular chemical, you have a few options. You can collect the relevant papers in which various properties of your molecule are reported, or you can hope that someone has included your specific compound in a database, although you may have to pay to access the information. It is obvious that a single, free and user-friendly database containing all validated information on small molecules would not only accelerate this process, but would be an enormous boon to those who cannot afford to pay for their data.

There would also be a significant change to the way we use data. The emphasis on publishing data in journals and the pressure on space means that papers only report ’good’ data, and so a great deal of useful information never sees the light of day. A general repository for chemical information would enable researchers to publish the ’bad’ data along with the good, saving other scientists from repeating their mistakes. The role of supplementary information would become increasingly important, particularly if coupled with efficient data search methods.

More importantly, gathering this information would significantly improve our ability to do chemistry in silico. These advances could have enormous benefits across many research and development areas, including modelling new reactions, planning more efficient syntheses, and designing better drugs (see box).

One of the goals for chemical informatics is to build an open-access chemical grid, modelled on Tim Berners-Lee’s Semantic Web: an internet environment in which knowledge is instantly available in a form that both people and computers can use to make decisions. The chemical grid aims to link the vast computational power of many PCs all over the world, delivering the best information from the people who are producing it to the people who want it, in the shortest amount of time. In short, the grid aims to make computers an integral part of chemical science.

Chemical informatics faces numerous challenges - conceptual, logistical and technological. One significant issue is how to collect the existing data, and to keep up with it as it continues to accumulate. Although academic groups publish a large proportion of data in online journals, much of it ends up in a form that is no longer computer searchable - less than one per cent of published chemical properties are openly available in an electronic form. And much has been lost because it was deemed ’unpublishable’ in the first place.

Secrecy is also a big problem. An enormous amount of data is locked away within pharmaceutical patents and licences, so the trick will be to convince these companies that sharing data is in everyone’s best interests. Even if all the data were accessible, there needs to be a way to filter it to improve its quality and validate it so the end-user can trust its reliability. A much more fundamental challenge is to develop common representations of data and metadata - ’data about data’ that describe the content, quality, condition, and other characteristics of data - which any computer can understand. The existence of a global chemical grid depends on developing universal semantics for chemistry. Creating such a language, however, is a significant challenge. Whereas the vast majority of data handled by bioinformaticians consists primarily of protein and gene sequences and so is fairly homogeneous, chemical data comes in all sorts of flavours: experimental, process, and computational.

The difficulty is illustrated by the current approach to chemical nomenclature. There are three primary naming conventions, each promoted by a particular institution: the International Union of Pure and Applied Chemistry (IUPAC), Beilstein and the Chemical Abstracts Service. The result is that a single compound can have three acceptable names, or a single name can apply to more than one compound. Using these systems is also difficult, as evidenced by the existence of several commercial naming services, such as Advanced Chemistry Development.

In addition, achieving grid-based chemistry will require levels of data acquisition, remote interaction and control, computation and visualisation that are well beyond the capabilities of existing computational chemistry programs. And the rate at which data can now be generated, for example in combinatorial chemistry experiments, means that we will lose a significant amount of information unless a much greater proportion of data collection and analysis can be automated.

Meeting these challenges keeps chemical informaticians busy. In the short term, academic research groups, dedicated informatics institutes such as the Unilever Centre for Molecular Informatics at the University of Cambridge and private companies are working to create tools and standards so that existing molecular databases can be better accessed, managed and understood.

Already, large data banks, such as ChemBank, ChemExper Chemical Directory and ChemIDplus have evolved from collections of structural diagrams into web-based search engines, offering users multiple ways to locate their target molecules. Recently, researchers have started to define what it means for molecules to be ’similar’ (eg their shapes or electronic properties) and this type of ’fuzzy’ searching is being incorporated into automated retrieval systems.

Collecting new data has also become easier, as informaticians have created robotic programs that are capable of reading journal articles, extracting the important bits of data and depositing the information in a searchable database. Decades old data can now be revisited and mined for useful information. This is only possible because the molecular data in chemistry papers are presented in a way that makes them recognisable to machines. However, other types of information, such as descriptions of synthetic procedures and chemical diagrams cannot be effectively analysed, and so for the moment, people will still have to read the literature.

Other researchers are developing ways to improve data accuracy to increase the reliability of all newly published data. A team of scientists from the Unilever Centre, including Jonathan Goodman and Peter Murray-Rust, and Richard Kidd, at the RSC, have developed an experimental data checker. It extracts data from a paper and analyses it for self-consistency and acceptable ranges, to reveal subtle errors that authors often miss (for example, disagreement between the number of carbons predicted by the molecular formula, and the number actually observed in the 13C-NMR spectrum).

Scientists are also tackling the much more basic problem of how to represent the different types of data in a consistent way. Chemists typically display molecules as two-dimensional pictures using programs such as ChemDraw or ChemSketch, but these simple representations do not convey a significant amount of information about the structures - atomic coordinates, spectral properties, conformational flexibility, electronic properties... the list is long. Where drawing programs fail, a new language, CML, or chemical mark-up language, could succeed.

CML is a chemistry derivative of XML, the eXtensible Mark-up Language. XML provides a mechanism to precisely identify information (’structure’) in many different data formats, including text and mathematical equations, and describe its use (’semantics’) and meaning (’ontology’) (ie the metadata). It also requires different types of data to conform to standard formats and to include estimates of uncertainty, so it can be used to validate information. XML was created so that highly structured documents could be accessed over the web, without losing any content.

To enable XML to convey the complexities of chemical data, collaborators Murray-Rust at the Unilever Centre and Henry Rzepa at Imperial College London, UK , have been developing CML within the XML format. CML should allow any type of chemical information - connectivity, reactivity, and spectral and structural data - to be transported across the internet. CML’s sister language, computational chemistry mark-up language (CCML), should do the same for the input, control and output of chemical calculations.

The researchers have also developed a Java-based browser called Jumbo. This translates chemically marked-up molecules into more user-friendly forms, such as chemical structures displayed on screen. CML version 1.0 was developed in 2001, and is being used by a wide range of organisations, including patent agencies, publishers, government agencies, and software manufacturers.

Murray-Rust and Rzepa have also brought the goal of an open-source chemical database significantly closer, by creating the World Wide Molecular Matrix (WWMM). The WWMM is a peer-to-peer data repository that contains and manages molecular information and metadata entirely in XML and CML. Scientists, but more ideally robots, can now upload their data directly in the matrix, instantly sharing their information with other researchers.

To solve the problem of how to name molecules in the WWMM, each chemical substance is given a unique identifier, by a computer program called INChI (the ’IUPAC/National Institutes of Science and Technology Chemical (NIST) Identifier’) that is under development at NIST. The molecular input is read by INChI, which then determines the connections between the individual atoms and summarises them in a ’connection table’. It then assigns a unique numbering to the atoms, which becomes the name for the molecule in CML.

The INChI identifier not only codifies the molecular formula of a molecule, it can also store information about possible tautomers, charges on fragments, and stereochemistry around double bonds or chiral centres.

The equivalent project for spectral data is called JCAMP, and protocols are available for many types of spectra. INChI and JCAMP should become widely used in the chemical community if they are approved by chemical societies such as IUPAC, the RSC and the American Chemical Society.

The WWMM could be a substantial step towards the chemical grid, if people use it in the way it is intended. Murray-Rust, Rzepa and their colleagues believe the WWMM should be the first, not last, place that new data are published. The data, both primary and supplemental, also needs to be in a machine-readable form (eg XML) so the information can automatically be reused by other people.

Grid chemistry will also ultimately depend on molecular scientists offering free computational services to the community, which rarely happens at present. The technology exists, so it is now up to the chemical community to further this vision.

The potential power of the grid is illustrated by the Screensaver Lifesaver project a joint effort by Oxford University’s, UK, Centre for Computational Drug Discovery, the US-based company United Devices, and Intel. It is the world’s largest computational project and is currently using the screensaver time of over 2.6 million computers worldwide (320 000 years of central processing unit power) to evaluate 3.5bn different molecules for their cancer-fighting potential.

One of these computers - maybe even yours - might discover the next cure for cancer.

Kira Weissman is a Dorothy Hodgkin research fellow in the department of biochemistry, University of Cambridge, UK.

Further Reading

  • S Aldridge, Chem. Commun., 2002, 2745
  • S E Adams et alOrg. Biomol. Chem., 2004, 2, 3067
  • P Murray-Rust et al Org. Biomol. Chem., 2004, 2, 3192
  • J A Townsend et al, Org. Biomol. Chem., 2004, 2, 3294 

Computers in drug discovery

For every drug that reaches the market, 8000 other candidates were synthesised, tested to various extents, and found to be unsuitable. The pharmaceutical industry is therefore very interested in computer-assisted or ’rational’ drug design, as it has the potential to significantly streamline the drug discovery process. The idea is that enormous libraries of compounds could be screened in silico instead of in the laboratory, to identify the most promising candidates for a particular protein receptor. Synthetic efforts could then be focused on preparing and optimising only these molecules.

Four different situations can arise in rational drug design. Ideally, both the high-resolution structure of the receptor and of small molecules that bind to it (ligands) are known.

In this situation, computational chemists can use molecular modelling to develop quantitative structure-activity relationships (QSARs) for the known ligands, to identify other compounds in a chemical database that are also likely to bind and show activity.

These new molecules can then be compared to each other by computationally docking them into the protein’s binding site in a way that allows for conformational changes in both the ligand and the protein, to yield an overall ’fitness score’.

There are numerous docking programs available, each with its particular features and advantages, with Gold, the product of a collaboration between the University of Sheffield, GlaxoSmithKline, and the Cambridge Crystallographic Data Centre, being particularly well regarded.

Alternatively, the receptor’s structure has been solved, but no active ligands have been identified. In this situation, the ligands can be designed from scratch (so called, ’analogue-based drug design’).

Three-dimensional search techniques are used to screen large databases to identify small molecule fragments that should interact with specific receptor sites, fragments which are of the correct size and geometry to fit into the receptor, or chemical scaffolds that can display functional groups (pharmacophores) at the right positions to make favourable interactions.

These features can then be combined to yield new molecules that should be complementary to the active site.

The most common scenario is that a collection of ligands has been identified which interacts with a receptor of unknown structure. However, using the known ligand structures, researchers can develop QSARs, or to try to identify which pieces of the structure together constitute the minimal pharmacophore.

These models can then be used to search chemical databases for attractive drug candidates, or to design novel compounds incorporating functional groups and features believed to be responsible for activity.

The least desirable situation is that neither the receptor’s structure nor any ligands are known. In this case, large libraries of compounds often need to be created using combinatorial chemistry and evaluated for activity using high-throughput methods. But here, computer-based methods can be used to identify structurally dissimilar compounds, to increase the diversity of synthetic libraries.

The first drug to be produced by rational design was Biota’s Relenza, an anti-viral that is used to treat influenza, but certainly the most famous result of these efforts is Pfizer’s anti-impotence drug Viagra.