Machine learning tool predicts products of organic reactions by treating chemistry like language

IBM researchers have developed a program that can predict the products of organic chemistry reactions.1 Modelled on the latest language translation systems – like Google’s artificial neural network – the AI picked the right product 80% of the time despite not having been taught any organic chemistry rules.

‘What this tool is trying to do is imitate a top pro chemist in more or less the entire domain of organic chemistry,’ says Teodoro Laino, one of the researchers involved in the study at IBM in Zurich, Switzerland. His ambitious goal is shared by other chemists who have been attempting to create a functioning AI chemist since the 1970s, when organic chemist E J Corey kick-started the field by creating a chemical knowledge database.

However, making a tool based on chemistry knowledge can be time-consuming; Bartosz Grzybowski’s team took 10 years to encode their Chematica retrosynthesis program with 20,000 chemical rules. Moreover, a knowledge-based AI has difficulty tackling reactions that lie outside of its rule set. ‘There’s a way to learn organic chemistry that’s not memorising chemical rules, by just trying to find out the underlying patterns in reactions and trying to rationalise them,’ Laino says, explaining the approach that his team took.

Instead of teaching their program rules, the team gave it more than 50,000 patented reactions to train on. ‘From the reactant plus the reagents, it tries to guess the most likely product,’ explains Philippe Schwaller of the IBM team. ‘By showing it the same training set again and again, it slowly learns how to construct a valid product.’

Chemical structures are first converted into a string of letters and numbers (Smiles, simplified molecular-input line-entry system). The program then treats the reaction like a translation problem, using the robust algorithms originally developed for language processing.

Found in translation

Source: IBM

Found in translation: the program can correctly predict the outcomes of reactions in four out of five instances

After 24 hours of learning, the program was presented with a new set of patented reactions it hadn’t encountered before. It managed to give the right product 80.3% of the time. The IBM team says this means its AI outperform a comparable prediction program, created at the Massachusetts Institute of Technology (MIT), US,2 by a margin of 6.3%.

‘[The IBM team] showed a marginal improvement in accuracy and showed that this framework is applicable to this problem,’ comments Connor Coley, graduate student and part of the MIT team. However, ‘these kinds of models that […] don’t give you an understanding of what actually happens to the chemistry may have a challenge in terms of convincing the chemistry community to accept these “black box” type models’, adds Klavs Jensen, who recently created an AI chemist that combines rule-free learning with some chemical expertise.3

Others have also taken to this combinatorial approach,4 but Coley says it’s important to keep in mind that any AI can only ever be as good as the data it is being fed. The IBM program doesn’t include any reaction parameters like temperature or solvent, as these finer details are often not available in a format that would allow a machine to digest them.

So far, there has been a lack of experimental tests that could verify how prediction programs fare in practice. But Jensen says: ‘I think in a couple of years it’s realistic to expect there will be tools available that people can access and test.’