Organic chemists should embrace machine learning even though its workings cannot be fully known

Classically, a black box is a system whose inputs are controlled or known, and whose outputs can be harvested, but the internal workings remain a mystery. Take Google search – we may know roughly how it works, but details of the search algorithm are kept secret from the public. But when organic chemistry meets computing, we sometimes feel we want to know everything – black boxes can be seen as a frustrating and distrusted tool.

It’s fair to say that sometimes, comprehensive understanding lets us control all variables to avoid problems. As a student, I expressed concern over the results of a computational exercise, to be dismissed with ‘but the computer says this is what you have to do’. Three months of hard work later, we synthetic chemists were vindicated when it was found that thanks to a computing error in a system we couldn’t access directly, we had indeed been working on the wrong compounds for all that time. I have had a rigorous dose of scepticism for methods out of our control ever since!

Although caution is very well advised, we should also remember undergraduate thermodynamics, when we learn to deliberately treat chemical systems as black boxes, their complexity reduced to just a few fundamental parameters – otherwise we are unable to compute their properties. For the most complex systems a clear understanding of the system’s workings needs to be a sufficient substitute for knowing the exact pathways to reach our answer. I am particularly thinking of machine learning methods: systems whose contents are for practical purposes unknowable, and whose reasoning may not make sense to human users. However, making a leap of faith is highly uncomfortable for organic chemists, who are used to having authority and reasoning even over atomic structure. Although we will never hold every detail of an individual neural net in our mind’s eye, we can learn how they work, how it was created, and which parameters it has been allowed to exploit. An elementary understanding of the tools and some trust in expert collaborators allow us to reduce concerns in abstracted methods.

The art of abstraction

On some level, humans abstract almost everything we use. Every time you use an LC–MS as a synthetic chemist, you don’t need to mentally run through a back-to-basics understanding of the relative polarities, UV absorption and ionisability of your substrates. Simply referring to your compound as a tertiary aniline provides sufficient information for an experienced user to expect a certain outcome. These abstractions might even be directly hard-coded; for example, you may have polar and apolar generic methods set up on the instrument. There are countless popular examples of more readily recognising a concept when we give a name to it – perhaps name reactions are one case – as well as negative examples, such as seeing someone who looks like a ‘yob’ and falsely making a mental connection to troublemaking. The audience’s capability for abstraction is also a useful tool when presenting complex results: data storytelling allows a presenter to build individual bricks of data into conceptual structures, helping the audience to feel they have fewer individual concepts to wrap their heads around.

Abstractions leap to human non-interpretability when they involve computer-speed calculations or too many variables. Luckily, computers excel at these, but it can be a shock when the methods no longer fit inside a human brain. I visualise these superhuman helpers as being another layer on top of the brain, much like a laptop farming out calculations to a supercomputer cluster and then retrieving the results.

We organic chemists are not actually capable of understanding everything

And this is the advantage: some systems we don’t understand really are better than us at what they do. Although machine learning is still an emerging tool when applied to organic chemistry, particularly due to our relatively small datasets, its power is clear from our everyday use in facial recognition on our devices to voice recognition on our home assistants. (That said, machine learning is subject to the same biases as its training: it can overuse a go-to catalyst, or more alarmingly, struggle more with darker skin tones on human images.) Something that may not be clear to those outside large companies is how frequently machine learning is found useful within chemistry, too. At the end of the day, what matters is whether the results verifiably work, rather than how we arrived there, although it may be hard to swallow. We have to make a leap of faith and remember that we organic chemists are not actually capable of understanding everything.

The famously not-so-humble world of organic chemistry is being damaged by our egos and our lack of willing to submit to the higher power of abstraction. We could make our field stronger and more useful, and as any computational chemist will tell you, black box processes need not come at the cost of overall insights.