A metric more commonly used by search engines to analyse language can now power organic chemistry retrosyntheses


This 'structure cloud' shows organic chemistry 'keywords' with their size corresponding to how often they occur

How organic chemists devise synthetic routes to target molecules could be about to be revolutionised by a new approach that treats molecules as sentences, and their fragments as words. That’s thanks to a language statistic used by search engines that researchers in the US and Poland have shown can be successfully applied to retrosynthetic analysis.

‘Modern computational linguistics is pattern recognition,’ explains Bartosz Grzybowski from Northwestern University in the US. ‘What I was taught in organic chemistry for 10 years – it’s exactly the same.’ Grzybowski’s team has integrated the approach into Chematica, a synthetic pathway discovery tool scheduled for a November or December launch.

While developing Chematica, Grzybowski became aware of the advances computational linguistics were achieving in pattern recognition through his interest in philosophy. Over the following two years his team grappled with converting the ideas from language to chemical structures. ‘I don’t think before this paper there was a chemist in the world who did linguistics,’ laughs Grzybowski.

Linguists create dictionaries of maximum common substrings, series of letters and/or words shared by different sentences, and rank them by how often they occur. To emulate that, the chemists compiled a ‘dictionary’ of fragments common to different molecules. When their analysis focused specifically on functional groups, like amines or hydroxyls, the distribution of dictionary content was very different to English, making linguistic rules harder to apply. ‘The analogy is in linguistics, what distinguishes language is not the alphabet, but certain repeat patterns of words,’ observes Grzybowski. 

Language of chemistry

By contrast, examining all possible structural fragments gave Grzybowski’s team a dictionary distributed very similarly to English. They then applied a statistic known as term frequency–inverse document frequency (TF-IDF) to find a starting point for synthesis planning. TF-IDF can relate how often words occur in a sentence to how often those words occur in language more generally, identifying which words contain most information. The Northwestern team proposed that bonds with high TF-IDF scores would be most important to make, and therefore first to disconnect in retrosynthetic analysis.

To test this, they asked Janusz Jurczak and his Polish Academy of Sciences team to manually analyse linguistically-disconnected structures. Around 97% of the time at least one chemist selected one of the computer’s top three bond choices. ‘In the vast majority of cases the bonds that you should be cutting have the highest information content,’ Grzybowski tells Chemistry World.

‘“Google-type” search engines that focus on common repeat patterns to analyse and disconnect organic molecules would be game-changing,’ says Varinder Aggarwal at the University of Bristol, UK. ‘It would bring complex synthetic chemistry to a much broader community. This paper looks like a first step in this direction.’ Yet Aggarwal warns that it’s difficult to judge how successful the approach will be. ‘The proof will be provided when it’s tested against complex molecules.’

However, Phil Baran from the Scripps Research Institute in La Jolla, US, is not certain how useful this would be ‘for anyone skilled in the art of synthesis’. ‘I’d have the same problem if I tried to computationally understand what makes one painting more beautiful than another by analysing patterns and shapes,’ he says. ‘You might get the right answer some of the time but it’s unlikely you’ll be able to do anything creative since by definition you’re cataloguing what has been done before – you’ll always be stuck “inside the box”.’

Grzybowski highlights that the new approach will be a part of Chematica’s retrosynthetic disconnection suite. ‘It’s an independent measure. Measures that agree, linguistic and chemical approaches, should tell you, “this is how and where I want to make a cut”.’