Strategic compounds within a huge chemical network pinpointed by machine learning

Having trawled more than half a million compounds, an algorithm has identified the 569 molecules that could drive a circular economy, where chemical waste becomes feedstock for new materials. Called strategic molecules, they are key players in pathways that lead from biowaste – such as terpene mixtures produced by the paper industry – to valuable compounds such as drugs.

‘At the moment, if you have a waste stream and you want to produce a high value end product, there is a lot of uncertainty over what to make and by what routes,’ explains Jana Marie Weber from the University of Cambridge, UK. ‘People normally focus on very specific waste feedstocks and then see what they can make from it. Or they go from the product, step by step, and relate high value end product to the biowaste that they have.’ However, both of these approaches take an extraordinary amount of both chemical expertise and computational power.

An image showing potential strategic molecules identified by the algorithm

Source: © 2019 Elsevier Limited

The algorithm identified key strategic molecules that could help drive a circular, waste-free economy

With the help of an algorithm, Weber, Alexei Lapkin and Pietro Lió, have now identified 569 molecules that serve as key connecting points between waste and value. ‘We direct our search towards strategic molecules, and then from strategic molecules, a couple of reaction steps to some desired end product,’ Weber says. ‘By doing that, we can reduce computational time by two orders of magnitude.’

The team mined more than half a million compounds and almost a million reactions from the Reaxys database and assembled them into a gigantic chemical network. They then let an isolation forest algorithm find those molecules with the most connections and the most central positions in multistep reactions.

Among the strategic molecules are many common intermediates such as water, carbon dioxide, methanol, acetic acid and phenol. But there are also compounds important to specific industry branches, such as benzoyl peroxide – a radical starter for polymerisation – the pharmaceutical precursor piceol and the supramolecular building block tetraphenylethylene.

Matching human expertise

The algorithm proved that it can almost match human expertise – despite the fact that it doesn’t know any chemistry. By simply ranking items in the same way Google ranks search results, it found half of the compounds that had been named as important building blocks in a National Renewable Energy Laboratory report.

‘The conventional approach, how a human would identify [important compounds], usually requires significant amount of knowledge and experience, which may not be easy to acquire,’ says Dongda Zhang from the University of Manchester, UK, where he works on bioprocess systems engineering and machine learning. ‘This work provides a more systematic, smarter and more automatic approach to identify potentially important compounds at an early research stage, which is innovative and worth investigating.’

While the method can find synthesis routes from a waste component to strategic molecules and further to value-added chemicals, it doesn’t yet evaluate whether these pathways are chemically viable. ‘The key output of our work is that we can focus on early stage process development in the assembly of all possible routes,’ Weber says.