Fully exploring the ocean of possible compounds – even computationally – is impossible, finds Philip Ball

How big is chemistry? I don’t mean how important is it, or how many people do it, but rather, how many molecules are there that we could make? The answer is, to a first approximation: don’t ask. The number of possible molecular permutations of all the elements doesn’t seem to have been estimated and might not really be a meaningful concept, but even if we limit ourselves to ‘small’ organic molecules (below about 70 atoms) that might exhibit drug activity, the figure is something like 1060. We haven’t the faintest hope of making any more than a minuscule fraction of them – the largest current public database of molecules so far synthesised, PubChem, contains around 50 orders of magnitude fewer (~70 million). You do the maths.

What we’re looking at here is chemical space, and to all intents and purposes it is as vast as the universe. We’ll never get to explore more than a few little corners. How discouraged should we be? Might there be untold riches here that we’ll never find? Or might we already be rooting around in the most fertile territory?

Drug discovery is the most prominent field for which such questions are pressing, but it’s not alone. Who knows what extraordinary materials might lurk in chemical space – after all, it’s been barely a decade that we’ve been getting to know the potential of one of the simplest materials conceivable, the pure form of crystalline carbon called graphene. Dyes, batteries, catalysts, solvents, food additives, you name it – there’s not a single area of chemistry that might not be enriched by our stumbling into a new region of chemical space. Even if our journeys are doomed forever to remain selective and parochial, can we at least find means of navigation that guide us into the more useful domains? Or are blind, random searches the best we can do?

Many researchers are convinced that it’s not, and they are designing computational and experimental tools to help us do better. While these may bring practical rewards, they also allow us to broach a wider question: can we start to descry the contours of chemical space itself? Do we know what kind of space it is – a uniform plain, say, or a rugged mountainous terrain? And how much of it is actually worth exploring?

Discovering new drugs

That the pharmaceutical industry should be the first to grapple with chemical space is no surprise. For one thing, we need new drugs urgently: more and more are failing before approval, and some fear that the pipeline is running dry. Some researchers feel that we’ll only find new ones if we broaden the search substantially.

But as yet, combinatorial libraries of small molecules – made by the random assembly of selected components – haven’t proved very fruitful as sources of new lead molecules from drug development. Perhaps that’s no surprise, for even here the range of options is too huge to manage. Peter Ertl, a chemoinformatics specialist at Novartis in Switzerland, has estimated that there are around 1020–1024 ‘drug-like’ organic molecules containing up to 30 atoms, which is still far too many for a computational search, let alone for making and screening using combinatorial synthesis. ‘All the synthetic effort so far is just scratching the surface,’ Ertl says.


Tina Zellmer/Debut Art

How far, then, can a brute-force search get us? Jean-Louis Reymond of the University of Berne in Switzerland has been one of the most assiduous cartographers of chemical space. He and his coworkers have developed computational methods for enumerating all possible molecules of a certain size,1 and have so far collected a database of around 166 billion molecules of up to 17 atoms containing carbon, nitrogen, oxygen, sulfur and halogens, which Reymond calls GDB-17.

Reymond and colleagues use graph theory to enumerate all the possible saturated hydrocarbon networks of this size. They then mutate and functionalise these skeletons by adding double and triple bonds or the heteroatoms nitrogen, oxygen, sulfur and halogens, observing the chemical rules of valency. All along the way the algorithm employs some chemical nous to weed out molecules that are too strained or reactive to be stably synthesised. In fact, the rules used to generate GDB-17 are more restrictive than those chemists actually employ – just 57% of all the molecules of this size and composition in PubChem are included.

How many of these molecules might be relevant to drug discovery? To answer that, you need to screen the candidates by computer, since making them all is out of the question. But reliable simulation methods that can predict, say, the affinity of a small molecule for a drug target such as an enzyme are still hard to devise. So Reymond’s searches start from what we know: they browse around tried and tested places. ‘In our approach we always start with an existing molecule that has a particular set of desirable properties: physical, biological, or whatever,’ he says. ‘The idea then is to search chemical space for other molecules with similar properties.’

To do that in a systematic way, it would be good to have some way of organising this huge, multidimensional space so that molecules with similar properties can be grouped together, much as the periodic table organises the elements. Mindful of that analogy, Reymond has proposed a classification scheme that uses ‘molecular quantum numbers’ (MQN). ‘These count things in molecules that are easy to see, such as atoms, bonds, cycles and so on,’ says Reymond. ‘Because we count things that give integer values, I thought it is elegant and also basically correct to call these molecular quantum numbers’. Despite its simplicity, the MQN system has turned out to be excellent not only for classification and visualisation but also for virtual screening, in particular searching for analogues of compounds in terms of shape and pharmacological activity. Reymond’s group has already found some promising new drug candidates in their chemical space. 

Guided by nature

There is more diversity of frameworks in GDB-17 than in the PubChem database. But does that diversity of chemical space matter, if only certain kinds of molecules are useful? In an attempt to find out if that’s so, Ertl and coworkers at Novartis set out to examine a small subset of molecules: those that contain simple aromatic ring systems. Rings are very common scaffolds for known drugs – Ertl’s analysis showed that more than 96% of a typical sample of currently known bioactive molecules contain rings, with 75% of them being aromatic. (In contrast, most natural product ring systems are aliphatic – only about 38% are aromatic.) Surprisingly, though, only a very small number of different scaffolds are found in currently known bioactive molecules. Of the simple aromatic ring structures – those with one to three fused five- or six-membered rings – only 780 types appear in about 150,000 bioactive structures, and only 10 are present in more than 1% of them. 

Bioactivity is located in a few small islands in the huge chemical space

To explore why this might be, and see how this bioactivity is distributed in this part of chemical space, Ertl and colleagues mapped out all the simple aromatic ring systems using an algorithm informed by basic chemical rules to produce all chemically plausible structures and exclude those deemed ‘too exotic’. Their method identified around 580,000 molecules, of which only 2000 or so have been made so far.2 They then used a neural network algorithm trained to seek bioactivity among the possible structures. It found only a few small islands within the full space, restricted to a few skeletons such as quinazoline and purine.

In other words, Ertl says, ‘known bioactivity is located in a few small islands in the huge chemical space’. That being so, he says, the best way to find new useful compounds is not to enumerate the whole space but to use algorithms trained to evolve molecules that are related to sites of already known bioactivity. But he cautions that this apparent sparseness of bioactive chemical space might be partly a result of the fact that we’ve only looked in a few places: there might be other bioactive islands that the neural network doesn’t pick up simply because it has no grounds for expecting it there.

That limitation on our experience may change, Ertl says, thanks to an emerging technology that may transform the way we explore chemical space, called DNA-encoded libraries.3 Here each molecular fragment is barcoded with a unique DNA sequence, so that the molecules made from their combinatorial shuffling can be quickly identified and screened. ‘This technology allows one to create and subsequently screen huge libraries (hundreds of millions, even billions) of molecules,’ he says – and the whole library is contained in a single tube. ‘In few years we will see whether this new technology will hold its promise,’ he adds.

Building scaffolds

Herbert Waldmann of the Max Planck Institute for Molecular Physiology in Dortmund, Germany, agrees that to explore chemical space usefully for drug discovery, we need a guide. And like Ertl (the two collaborate) he suggests that we should begin with molecules that already show that activity and look for ones that are structurally related. But rather than conducting the search empirically using machine-learning methods, he thinks it can be more systematic: we should look for a molecular ‘core’ that defines the chemical behaviour and base our search on that. That’s how natural products seem to operate: they tend to share a limited number of basic scaffolds. ‘Nature invented a solution once and then varied it to fulfill different functions,’ says Waldmann. ‘It didn’t reinvent all the time’ – and we shouldn’t have to either.

Waldmann and his colleagues have attempted to classify these natural-product cores using a set of rules to arrange the many small organic molecules found in nature on a tree of structure types, with each level of the hierarchy being an elaboration of a few fundamental cores. This so-called structural classification of natural products (SCONP) tree funnels down to just three basic ‘trunks’: cycles made just of carbon, or heterocycles containing nitrogen or oxygen.4


Tina Zellmer/Debut Art

There’s no reason why our drugs must be confined to the structures nature uses, however. Waldmann and his collaborators have developed an approach called Bios (biologically oriented synthesis) that expands the SCONP tree to include known synthetic molecules that show related biological activity, using a set of rules for relating a given molecule to its core scaffold along with published information on a molecule’s biological and pharmaceutical effects.5 The rules used to construct the Bios tree are to some extent arbitrary, so different rules will give different trees – but any given set of rules, Waldmann says, may be chosen to trace out particular types of related biological activity.

This approach is useful for showing what kinds of core structures molecules should have in order to display particular biological properties. But the really clever part is that the researchers have created an algorithm, called Scaffold Hunter, that can map out ‘missing’ parts of the Bios tree: molecules and families of molecules that also belong on particular branches but which simply haven’t been made or tested yet.6 ‘Such virtual compounds could then be made – or even better, purchased – and they should share the expected activity,’ says Waldmann.

And they do. In one example, Waldmann and colleagues used data from PubChem to construct a Bios tree for activators and inhibitors of pyruvate kinase, an enzyme involved in glycolysis. This contained over 35,000 different scaffolds, a quarter of which had not been synthesised. By focusing on those that were near-neighbours of molecules known to have this activity, the researchers narrowed their options down to just four novel scaffolds, from which they made or bought 107 derivatives and screened them to find nine activators and inhibitors with good activity. All of these fulfilled other criteria (such as solubility) needed for good drugs. 

In contrast to Ertl’s suggestion that a specific type of bioactivity is confined to small islands in the space, Waldmann says that chemical space is ‘very rich in bioactive compounds’, and these are highly distributed through the space – they aren’t a rather rare, clustered subset of the possible atomic permutations. There are compounds with very different structural cores that can show similar binding affinity to a single target and induce the same biological response. This sounds like good news for drug hunters. But the trick will be to find how to locate and pursue the right network of structures through a dense web of other networks. 

Space beyond drugs

Although drug discovery dominates the exploration of chemical space, it’s not the only game in town. Another arena in which molecular properties dictate the useful behaviour of a substance is homogeneous catalysis. Many of these catalysts are organometallic complexes, and devising or optimising a catalyst for a particular function is typically a matter of finding the right ligands to coordinate with a metal centre. The effects that the ligands have on catalytic activity may depend on several features, such as their size, shape and electronic properties: how much they withdraw electrons from or donate them to the metal centre, say. 

We are not yet at the point where computation routinely can come before experiment

Natalie Fey, currently a visiting fellow at the University of Bristol in the UK, and coworkers have found that simple linear combinations of features like this act as predictors of catalytic activity in a way that collapses the multidimensional chemical space of phosphorus donor ligands onto low-dimensional maps. These combinations are found by a statistical method called principal-component analysis. The principal components that represent the axes of the maps don’t have a simple intuitive interpretation in themselves, but they provide an easily visualised way of classifying ligand properties for around 350 known ligand types, and thereby streamlining a computational search for potential new catalysts.7

The modular character of such organometallics – metal centre plus ligands – permits a rational approach to building up meaningful libraries computationally, as Vidar Jensen at the University of Bergen and his coworkers have found.8 They have developed an algorithm that decomposes known compounds into their basic fragments using a set of ‘bond-cutting’ rules that take into account ‘retrosynthetic reasoning’ – how molecules are actually put together. Then they use these fragments to build up a library of new compounds that are not only more realistic and stable than ones that simply observe the laws of valency but are also more accessible to synthesis. 

Fey says that in organometallic catalysis, and perhaps in inorganic chemistry more generally, one of the big motivators is to find alternative compounds that will work as well as known variants without drawbacks such as cost or toxicity. That’s to say, you want a molecule that’s similar but without the problems. ‘Once we know a catalytic transformation is feasible, changing the ligands on the metal centre is the most common approach to fine-tuning catalyst activity or selectivity,’ she says. 

But the options are vast, and so computational screening of chemical space might be essential for focusing experiments in the right direction. All the same, Fey admits, ‘We are not yet at the point where computation routinely can come before experiment.’

Given the progress so far, however, that goal looks not only feasible but inevitable. It’s not sheer fantasy to imagine dialing in the properties you want from a molecule – biological activity, catalytic specificity, optical behaviour, you name it – and having the computer scan chemical space to give you a small set of candidates. Then you might turn to a facility like the Dial-a-Molecule scheme being developed by the UK Engineering and Physical Sciences Research Council and have them made for you by automated synthesis and delivered by mail order for you to test.

Will this automation of chemical discovery take the mystery and art out of chemistry? Or might it, like automated space missions to the planets, open our eyes to the unexpected wonders that are out there?

Philip Ball is a science writer based in London, UK