Have AlphaFold and other machine learning techniques essentially solved the formerly fiendish problem, or is there still more to be done? Clare Sansom reports
How must it feel to find that an apparently intractable scientific problem that you have devoted years of your research career to working on has, suddenly, been largely solved? Something like that happened to some biochemists in the autumn and early winter of 2020. The problem concerned was that of predicting the structure of a protein – essentially, the overall shape of the molecule, and the position of each of its atoms in space – given only the sequence of its constituent amino acids. This breakthrough is all the more important because the gold standard of experimental structure determination is still expensive and relatively slow.
Many techniques for predicting protein structure have been developed over the decades. Since 1994, these have been pitted against each other in a regular community-wide experiment known as Critical Analysis of Techniques for Structural Prediction, or Casp. Every two years, experimental structural biologists post the sequences of proteins they are working on for modellers to predict blind; once the actual structures are solved, the models are assessed, and the results announced together at a conference. Before 2018, each round simply showed a gradual improvement in accuracy from the previous one. That year’s competition revealed that several algorithms had taken a step forward in accuracy and precision. However, it was Casp14 in 2020 that was truly remarkable; the results from one group, known as Group 427, and its AlphaFold2 program, clearly stood head and shoulders above all others.
Astonishingly, the Casp14 assessors were able to rate many of this group’s models as ‘roughly equivalent to the experimental structure’ of the proteins concerned. John Moult, an emeritus professor at the Institute for Bioscience and Biotechnology Research (IBBR) at the University of Maryland, US, and one of the co-founders of Casp, simply described the result as ‘a big deal’. And it caused quite a stir on social media with Martin Steinegger, a bioinformatician at the Seoul National University in Korea, writing on Twitter: ‘The #CASP14 results are out, and #AlphaFold2 won… Protein structure prediction might be solved.’
Group 427 had not come from a specialist research lab or a well-resourced pharma company, but from a generalist software company called DeepMind, founded in 2010 to develop and apply problem-solving artificial intelligence (AI) techniques. Its first successes were in gaming, with its AlphaGo program becoming the first computer program to beat a professional Go player. It was acquired by Google in 2014 and now focuses its considerable resources on applications that ‘advance science and benefit humanity’ by, for example, diagnosing eye conditions and researching ways to save energy in cooling data centres, as well as predicting the structures of proteins.
An imprecise history
The story of HIV protease and its inhibitors provides a classic example of how knowing protein structures benefits humanity. This protein is one of only three enzymes in the virus’s tiny genome, so it was rapidly recognised as a key target for drug development. It took 10 years from the identification of the enzyme to the licensing of the first protease inhibitor, saquinavir, in 1995: this might seem a long time, but no other proteins were ‘drugged’ so soon after characterisation until the last few years. Even then, it was not technical improvements as much as the research community’s determination to overcome Covid-19 that accelerated drug development. The 3D structure of HIV’s protease, solved independently by three crystallography groups at almost the same time, was one of the most important steps in developing the successful protease drugs for Aids, and this was recognised as the first notable success for structure-based drug discovery.
The first secondary structure prediction programs were notoriously imprecise
Attempts to predict the structure of proteins, however, go back to the very first known structures. Max Perutz, John Kendrew and their groups, whose globin structures were published in 1958, were essentially working in the dark: they had no sequences for their proteins, and no knowledge even of what a protein structure might look like. It is reliably reported that they were disappointed when the structures of myoglobin and haemoglobin were revealed: the apparently irregular bundles of coils were far removed from the elegant double-helical structure of DNA, published five years earlier, and gave no immediate clue to the oxygen transport mechanism. What these structures did prove, however, was the existence of tightly coiled regions. This protein geometry (the alpha helix) and an extended structure (the beta strand) had been proposed a few years earlier by Linus Pauling and Robert Corey. Perhaps these hypotheses should be described as the first successful structure predictions.
Every stable protein structure known contains either alpha helices, beta strands or both, and the first serious attempts to predict any aspect of a structure from its sequence focused just on predicting these. Once the first few dozen structures had been solved it was clear that some amino acids occurred much more often in helices, strands or both than others, and these different probabilities formed the basis for algorithms to predict their location. Sometimes these preferences can be deduced from the amino acids’ chemical properties. Proline is by far the clearest example of this. It differs from all other amino acids in that it is a secondary amine, with its side chain bending back to form a covalent bond with the main chain nitrogen. That atom, therefore, is not available to form one of the hydrogen bonds that give beta strands and particularly alpha helices their stability, so a proline-rich sequence cannot form a helix and is unlikely to form a strand.
The first secondary structure prediction programs were notoriously imprecise. The technique has improved over the decades, and it can give interesting and useful information, but even a 100% accurate secondary structure prediction will tell you nothing about the protein’s 3D shape. The first homology model – the structure of a protein modelled on the scaffold of a known, related one – was built by hand in 1969, swapping the side chains from the known structure of lysozyme to match the sequence of alpha-lactalbumin. This crude exercise suggested, correctly, that the substrate-binding cleft on the surface of lysozyme would be shorter in lactalbumin. Much more complex, automated versions of this process – homology modelling programs – have produced some accurate results over the decades, but, crucially, only when precise structures of evolutionarily related proteins have been available. Until the late 2010s, the only programs that stood a chance of predicting a structure from scratch, without a clear template, were computationally intensive and gave unpredictable results.
DeepMind’s first version of AlphaFold was one of a handful of programs that produced notable – but not wholly exceptional – predictions as assessed at Casp13 in 2018, but it was the revised version released as AlphaFold2 that proved the real game changer. This, like its predecessor, is an artificial intelligence deep learning program based on neural networks, which learn from patterns in the data that has built up over the years in a way that mimics the processes of the human brain.
The AlphaFold2 database is one of the most important datasets since the mapping of the Human Genome
Any neural network can only be as good as the information and data that it is trained on, and it would have been impossible to produce a program like AlphaFold2 without the freely accessible databases UniProt for protein sequences and the Protein Data Bank (PDB) for experimentally determined structures. When the PDB was set up in 1971 it was the first open-access molecular biology database; it took over 40 years to grow from seven to 100,000 structures but has added the next 100,000 in only eight.
But the apparently vast structural resource of the PDB has been put into the shade by the new AlphaFold2 database (AlphaFold DB), with – currently – the 3D coordinates of more than 200 million predicted proteins. This number is particularly extraordinary given that the database is less than two years old. It was launched in July 2021 as a collaboration between EMBL and DeepMind at the EMBL European Bioinformatics Institute (EMBL-EBI) near Cambridge, UK, with 350,000 predictions; even then, Ewan Birney, joint director of the EMBL-EBI, described it as ‘one of the most important datasets since the mapping of the Human Genome’. One of the next datasets to be added to that initial 350,000 contained structures of the proteomes of over 30 pathogens regarded as global health priorities. AlphaFold DB does have one important limitation, however: it only holds predictions of single protein chains. If you look up haemoglobin there you will only find models of its single-chain constituents, not of the tetramer that is the active molecule nor of the co-factor, haem, which binds iron and oxygen. In that sense, this prediction is less biologically realistic than Perutz’ 65-year-old original structure.
No prediction can be useful without some information about how likely it is to be accurate. Each AlphaFold2 result, and the corresponding entry in the database, includes two important measures of confidence: one local – assessing how likely the structure of each residue is to be correct – and one more global, assessing how well parts of the protein are modelled compared to each other. These show that the method is much better at predicting structures of the compact protein regions known as domains than at suggesting how these domains fold together. Furthermore, many proteins include stretches where confidence scores are so poor that no structure can be recorded. This is not necessarily a problem, however, as Sameer Velankar of the EMBL-EBI explains: ‘Often parts of a protein are predicted with very low confidence and appear in the 3D views of the prediction as “spaghetti-like” tangles; these are best considered as predictions of disorder rather than actual structure predictions.’ This might be considered a negative result, but it is still important, as disorder is widespread: for example, some 30–40% of the human proteome was predicted to be intrinsically disordered. This has until now been poorly predicted, and the functions of many proteins are known to depend on the presence of disordered regions.
While AlphaFold2 is undoubtedly a game-changing advance on previous methods, it is nowhere near perfect, and its structures cannot be considered as answers in themselves. Tom Terwilliger, a structural biologist at Los Alamos National Laboratory, New Mexico, USA, has described its results as ‘always good and useful, but not always precise… it is best to think of it as a method of generating new hypotheses for testing, often with experimental data’.
Predicting the structures of protein complexes is an even harder problem
This combination of AlphaFold2 with experimental structural biology, and, indeed, the method’s relevance to global health research, can be illustrated by the work of Matthew Higgins, a professor of biochemistry at the University of Oxford, UK, in vaccine design for malaria. He is using a combination of AlphaFold2 and cryo-electron microscopy – by far the best experimental method for visualising the structures of large complexes – to build the detailed structure of a complex protein on the surface of the malaria gamete known as Pfs48/45, which has been identified as an ideal vaccine target. This prediction, with a detailed model of each part of the protein built into an electron micrograph of the complex, has alone moved the vaccine design project into the preclinical development stage. Higgins clearly believes that the technique can be of particular importance in infectious disease research: a recent review of his in the Journal of Molecular Biology was intriguingly titled ‘Can we AlphaFold our way out of the next pandemic?’
The conference that ended the most recent two-year Casp cycle, and the first to assess AlphaFold2 in regular use, was held in Antalya, Turkey, in December 2022. Jianlin (Jack) Cheng, professor of artificial intelligence and bioinformatics at the University of Missouri, US, submitted predictions, principally using AI-based methods. He reported an almost complete dominance of this approach. ‘Almost all of the 100 or so groups that submitted predictions used AlphaFold2, although they generally tweaked or extended their predictions with other software.’ He was able to conclude that predicting the structures of single, compact protein domains – even those that are embedded in cell membranes, such as G-protein coupled receptors – ‘can now be considered essentially solved’.
But trickier challenges remain. One of the most important is modelling the way that protein chains fold together to form a functioning complex, often combined with nucleic acids and sometimes with other molecules. ‘The AlphaFold2 software is actually much better than other tools at predicting the structures of complexes, but this is an even harder problem, and its accuracy there is still not very high,’ adds Cheng.
AlphaFold2 may be the first AI-based method to solve an important part of the structure prediction problem, but it is no longer the only one. In 2021, David Baker’s team at the Institute for Protein Design at the University of Washington in Seattle, US, published RoseTTAFold, which uses similar deep learning principles to DeepMind’s program and has now produced similarly impressive results.
With AI, we can learn to read the fundamental language of biology
And a different AI-based approach to structure prediction is now also working well. This method, developed by scientists in the Meta group (which includes Facebook) uses a language model trained with billions of parameters to fill in the blanks of protein sequences with amino acids, much as predictive text does with characters. It does not need to generate an alignment of evolutionarily related sequences and missing out this slow step makes the program much faster to run than the other, multiple sequence alignment based, deep learning methods. Its exceptional speed allows the team to characterise the structures of all proteins found in the mixed genetic material known as a metagenome that can be obtained directly from the environment or from clinical samples. ‘These metagenomic proteins are the least understood proteins on earth,’ explains Tom Sercu, research engineering manager at Meta. ‘Many are unusual bacterial proteins with few evolutionary relatives. Knowing their structures could have environmental as well as medical benefits; we might discover or design proteins that can degrade plastics or sequester carbon.’ Meta’s ESM Metagenomic Atlas now contains representative predicted structures from another resource held at the EMBL-EBI: a microbiome database MGnify.
So, with the structures of all protein domains now ‘essentially solved’, perhaps even using multiple methods, what is the future of protein structure prediction? Will there even be a Casp16 in 2024? ‘Yes, certainly, but it will be asking different questions,’ answers Cheng. One of the most important of these questions, certainly for drug discovery, is predicting the structures of proteins complexed with small-molecule ligands. ‘Ideally we need a system where we can take a protein structure from the PDB or AlphaFold DB, and the Smiles string of a potential ligand, and output a precise 3D structure of the complex,’ he says. ‘We are still a long way short of this, but methods that use machine learning – the very broad type of artificial intelligence used in AlphaFold2 – are beginning to produce promising results.’
Sercu is looking further into the future of artificial intelligence in the life sciences. ‘Concise mathematical equations haven’t been a good language to accurately describe the complexity of biology. But with AI, we can learn to read the fundamental language of biology, to describe observations and make predictions,’ he concludes.
Clare Sansom is a science writer based in Cambridge, UK