Scientists have identified human biases in datasets used to train machine learning models for computer-aided syntheses.1 They found that models trained on a small randomised sample of reactions outperformed those trained on larger human-selected datasets. The results show the importance of including experimental results that people might think are unimportant when it comes to developing computer programs for chemists.

Machine learning models are a valuable tool in chemical synthesis, but they’re trained on data from the literature where positive results are favoured, whereas the dark reactions – the experiments that were tried but didn’t work – are usually left out. ‘Including these failures is essential for generating predictive machine learning models,’ says Joshua Schrier of Fordham University, US, who was part of a team that studied hydrothermal syntheses of amine-templated metal oxides and found that biases were introduced into the literature by people’s choices of the reaction parameters.

‘We considered extra dark reactions – a class of reactions that humans don’t even attempt, not because of scientific or practical reasons, but simply because it’s humans who make the decisions,’ Schrier says. ‘We found that chemists tend to be stuck in a rut when planning new experiments, and this gets reinforced by social cues. There’s a tendency to follow the crowd, as defined by precedent in the literature.’ This results in systematic overrepresentation of some reagents and reaction conditions in experimental datasets, he says. ‘We found evidence for this both in crystallographic databases and in our collection of digitised dark reactions from laboratory notebooks.’

The researchers evaluated over 5000 amine-templated metal oxide structures deposited in the Cambridge Structural Database and found that 17% of the known amine reactants (70 ‘popular’ molecules) occur in 79% of the reported structures, while the remaining 83% (345 ‘unpopular’ molecules) are present in just 21% of the structures. They also analysed unpublished experimental records for hydrothermal vanadium borate reactions from their Dark Reactions Project and found similar biases in the pH and amine quantities used.

‘We removed this bias by intentionally rejecting the standard approach to these exploratory reactions,’ says Alexander Norquist of Haverford College, US, who was also involved in the study. He points out that there was no difference in the reaction performance when the ‘unpopular’ amines were used. ‘We created two machine learning models. One used the biased data and the other used random experiments. The model from the random experiments was stronger and better. In a laboratory test with unseen reagents, it was able to predict new reactions more successfully and discover new compounds that would be totally missed by a model trained on the anthropogenic biased data.’

Lee Cronin of the University of Glasgow, UK, says that the results are interesting but not that surprising. ‘All programs have bias, the key is declaring it,’ he says. Connor Coley at the Massachusetts Institute of Technology, US, agrees. ‘One could argue that the authors have traded one form of bias for another. Even in the “random” experiments, they had to specify the probability distribution over the parameter space they were interested in building models for,’ he says. Cronin adds that the study considered the hydrothermal formation of crystals, which is very broad and continuous. ‘It does not follow how this might apply to discontinuous things like new types of reactions,’ he says.

Sorelle Friedler of Haverford College, who also participated in the research, says that techniques are being developed to encourage fairness in machine learning. She believes that such methods would be an interesting extension to this work. But Coley thinks that the bias problem may still remain. ‘Selecting experiments algorithmically is a great way to reduce the impact of human bias, but selecting the algorithm and defining its goal are still very much subjective tasks,’ he says.