Artificial intelligence works out the grammar of chemical reactions

An image showing a book and computer

Source: © Alice Mollon/Ikon Images

Given a set of symbols, a machine-learning algorithm can learn the grammar of chemistry

Like many clichés, the comparison of chemistry to language is so often used because it works so well. We can put a finite set of ‘letters’ (atoms) together in innumerable permutations to make ‘words’ (molecules), each with its own meaning. And with words we can tell stories.

With some loosening of rigour, we can take this analogy further. The rearrangements of letters that take place in chemical reactions might be regarded as a kind of translation: English to German, say. There are rules governing those transformations, governed by grammatical constraints – for example, verbs that follow their subject in English are shunted to the end of the sentence in German. Students usually learn these rules by formal instruction – but an alternative approach is full immersion, where a student has to figure out how the translation works by intuition. Young children do this with astonishing facility; for those of us of a more mature vintage, it can be bewildering.

A team of researchers at the IBM laboratories in Rüschlikon, Switzerland, and in Cambridge, US, along with the University of Bern, Switzerland, has developed a machine-learning algorithm capable of performing this feat for chemical reaction.1 Their system is a neural network called RXNMapper. After exposure to millions of known organic reactions, it was able to work out the ‘grammar’ that allows atoms on the left-hand side of the equation to be mapped to those on the right-hand side. These rules allow it to describe how bonds are formed and broken in a set of 49,000 further reactions, with more than 99% accuracy. It’s not the first attempt at such atom-mapping, but previous versions have been guided by explicit labeling of the atoms in the equation, and used known rules of thumb rather than working from scratch. The researchers say that RXNMapper performs better than commercially available atom-mapping tools.

Smile for the algorithm

The algorithm first has to represent molecules in a text-like way. For that it uses a notation scheme called Smiles, in which, for example, saturated and unsaturated carbon atoms have distinct symbols, and ring systems are given a linear notation: acetic acid becomes CC(=O)O, and benzene is c1ccccc1. RXNMapper incorporates solvents, catalysts and so forth with no distinction from other reagents, and follows the fate of each atom – what, say, was the source of the O atom that appears on carbon atom C5 in this ring system?

The ultimate aim is not to predict outcomes for a given set of reagents, but to provide the system with the entire reaction – reactants and products – and see if it can learn the rules that govern this translation. The goal is then not to do the translation itself but to understand how it works: what the ‘meaning’ of a given word is in the context in which it occurs. Is this particular water molecule, say, acting as a solvent or as a nucleophile?

Crucially, the researchers don’t leave this as a ‘black-box’ process – they open up the box to deduce what is going on inside. It’s like unravelling the deep-learning process of Google Translate to figure out how it is doing its job – and specifically, whether it is applying the same kinds of principles that a human translator would use. In this way, what might have been a handy but rather opaque automated tool instead becomes a genuine source of knowledge, from which general rules can be extracted: the underlying grammar of chemical reactions.

Schwaller says there is no reason in principle why the scheme could not be adapted for inorganic reactions, provided that a suitable representation of the molecular systems can be devised – perhaps, for example, describing complex extended structures such as metal-organic frameworks.

What and why

The IBM team has already used a similar model to plan syntheses by conducting a step-by-step retrosynthesis, in effect ’translating’ a product back to a set of simple reagents.2 That gives a glimpse of the potential value of such automation, and indeed the RXN for Chemistry system has been made freely available to chemists globally for synthesis planning.

The RXN system exemplifies a goal for next-generation AI more generally: to provide transparency and explainability. Just as a doctor using an AI system for diagnosis needs to be able to tell patients not just a ‘what’ but a ‘why’, so too a chemist is unlikely to feel comfortable adopting a reaction scheme recommended by an AI without having also a qualitative rationale for it.

But that could be challenging. Because RXNMapper can take a much more finely grained view of the reaction repertoire, it can abstract many more ‘reaction rules’ than humans can assimilate. The IBM team recently found that around 37% of a large database of organic syntheses could be described with 1000 rules; to encompass more than four-fifths of the set, 100,000 rules were needed.3 Organic textbooks were never that complicated! So here, as elsewhere, it’s likely that humans will need to forge an alliance with AI to make the best use of what each can offer.