Around half of all human proteins are a mystery. What do they look like, asks Phil Ball

Science is full of darkness. When you start to appreciate how much we don’t know, it’s amazing that what we do know works so well. Despite much head-scratching and many elaborate experiments, we’re still little the wiser about the dark matter that we believe outweighs visible matter in the universe by a factor of five, let alone the dark energy that comprises more than two-thirds of the energy density of the cosmos. Then there’s the ‘dark matter of the genome’: the 99% of our DNA that we don’t understand.

Now a light has been shone into another of science’s dark recesses – and it’s just bright enough to show how huge a hole it is. According to a study published last year by Sean O’Donoghue, a bioinformatics expert at the Garvan Institute of Medical Research in Sydney, Australia, and his collaborators, around half of the protein sequences known in ‘advanced’ eukaryotic organisms like us – including about one in every six entire protein molecules – are a mystery. This is the dark proteome.

‘A dark protein is one that we know nothing about,’ says Michael Levitt of Stanford University in California, a 2013 chemistry Nobel laureate for his work on protein structure. This, he explains, doesn’t just mean that we don’t know what the protein looks like or what it does, but that ‘we do not recognise it at all as being similar to a protein that we do know something about’. O’Donoghue’s team has tried to evaluate what, nevertheless, we can say about these dark proteins. The first thing is simply this: there are an awful lot of them.

‘Our work doesn’t solve questions, but instead it opens up a new field,’ says O’Donoghue. ‘Every time I discuss it with others, people ask “How does that affect my work?”’ This murky part of our biochemistry, he says, is ‘the proverbial Pandora’s Box’.

Opening Pandora’s Box

Biological darkness is nothing new. It’s been long known that much of our genome has an unknown function. Only one percent of our DNA – containing about 20,000 genes – codes for proteins. About the rest we knew very little until recently, and it was often dismissed as ‘junk’: the accumulated detritus of billions of years of evolution, such as old genes that fell into disuse, parasitic DNA inserted by viruses and other pathogens, and stretches of base pairs that do nothing but replicate and multiply for the sake of their own survival.

But some of the ‘non-coding’ DNA is clearly vital. It encodes biologically relevant information, although that doesn’t get translated into proteins. Instead it is transcribed only into RNA molecules, which regulate the activity of protein-coding genes. Until recently, the common view was that while this regulatory RNA is an active and important product of the genome, it comes from only a small fraction of the non-coding DNA, and that the rest plays no significant role in the cell. It is simply copied and passed on ‘mindlessly’ when chromosomes are replicated.

Dark Protein - DNA autoradiogram/human figure conceptual artwork

Source: © Mike Miller / Science Photo Library

So it was a shock when, in 2012, an international project called Encode found that the majority – as much as 80% – of our DNA gets transcribed into RNA. Why would cells expend the energy needed to do that, unless there was some good reason? The Encode team proposed that perhaps pretty much all of this transcribed genetic material is functional, even if we don’t know what purpose it serves.

The claim caused controversy, some of it bitter. Some biologists said that the Encode team had provided no meaningful definition of ‘function’, and that perhaps it was just simpler for the cellular machinery to transcribe DNA indiscriminately rather than to have elaborate ways of indicating what needed to be turned into RNA and what did not.

Regardless of the answer, the discovery alerted researchers to the fact that there was more going on than they’d anticipated, and that it might be wise to regard much of the genome not as useless junk but simply as ‘dark’, with unknown properties or purpose.

The discovery added to a growing perception that the traditional model of genetics needs some reconsideration. In that picture, the body’s chemistry is orchestrated by proteins encoded in genes and translated via RNA in a one-way flow of information. But while we’ve known since the 1960s that genes are regulated by other genes in a dynamic network of interactions, the discovery via the Human Genome Project that we possess only 20,000 or so protein-coding genes – far fewer than had been expected – suggested that a great deal of genetic function and activity must derive from a complex network of gene interactions rather than from simple manufacture of proteins.

All the same, proteins themselves seemed to be the relatively easy part to understand. They are chains of amino acids joined in specific gene-encoded sequences, sometimes crosslinked by covalent bonds, which fold up into specific shapes owing to the interactions between amino-acid residues and with the surrounding solvent. That shape then determines the biological function, which is typically to act as a highly selective catalyst for some biochemical reaction.

So while DNA and RNA might be involved in a complicated and still rather mysterious dance, at least we know about proteins, right?

Not really. For one thing, it turns out that there is typically no simple correspondence between a coding gene and the proteins produced from it. ‘Each gene is made into many protein variants,’ says O’Donoghue.

Partly this happens because DNA coding sequences aren’t uninterrupted strings of code. The sequences that get translated into protein, called exons, are interspersed with sequences called introns, which get transcribed into the corresponding RNA molecules but are then snipped out before the RNA is translated into protein by the ribosome, the cell’s protein-making machinery. The exons are spliced back together, but not always in the same way.

‘The mean number of splice variants of a protein is around 20–40,’ O’Donoghue explains. ‘Why all that redundancy? We don’t really know how functional all those variants are.’ What’s more, many proteins are modified in some way after being translated, for example by the addition of non-peptide chemical groups. So cells make many more proteins than they have coding genes.

And we’ve not by any means figured out what all those proteins look like. Just 4% of all human proteins have actually been crystallised so that their structures can be deduced by diffraction experiments. Partly this is because not all proteins can be crystallised. Making protein crystals is itself something of a ‘dark art’, and many refuse to play ball. Some proteins are insoluble and so are not amenable to crystal-growing techniques, often because they perform their biological roles embedded in membranes and so have water-repelling surfaces that make them compatible with the fatty interior of those membranes.

Just 4% of all human proteins have actually been crystallised so that their structures can be deduced by diffraction experiments

Other proteins won’t pack together neatly in crystals because their intrinsic molecular structure is itself disorderly. The common idea that proteins are either folded into compact and rather fixed ‘globular’ shapes (enzymes) or strung out into fibrous forms (structural proteins such as keratin and silk) has been challenged in recent years by the discovery that many are intrinsically disordered, having rather loose and mutable shapes. You can no more crystallise them than you can stack together rubber bands.

‘Intrinsically disordered proteins have refined and expanded the view of how protein structure and function are related,’ says bioinformaticist Mark Gerstein of Yale University, a principal researcher on the Encode project. ‘These proteins often assume a defined structure only upon binding various interaction partners.’

Crystallography isn’t the only way to work out protein structures. Some have had their shapes deduced by other techniques, such as electron microscopy and NMR spectroscopy. The structures of many others can be deduced indirectly by computer modelling, comparing their amino-acid sequences with those of proteins with known shapes. Thanks to such modeling, a little over half of all human proteins have structures that are more or less known.

But that leaves almost half that are not: the dark proteins. What are they like?

Exploring the dark

How do you spot something that you can’t see? By eliminating the things that you can. O’Donoghue and colleagues have recently developed a web-based public-access tool called Aquaria that is able to take any protein sequence and look for homologies – matches or near-matches – with sequences of proteins of in the Protein Data Bank (PDB), the global repository for protein structures. They used it to scan all 546,000 or so entries in a protein-sequence database called Swiss-Prot, which is manually curated to ensure a high likelihood that its sequences correspond to real proteins. If Aquaria failed to find such a match with the PDB, the Swiss-Prot sequence was classified as dark.

Dark Protein - DNA autoradiogram

Source: © Tek Image / Science Photo Library

By this definition, ‘darkness’ in a protein can occur at any scale. We can be ignorant about the structure of small parts of it, or most, or all of it. Many proteins have a small bit of darkness in their sequences, but about 15% of the Swiss-Prot database corresponds to wholly dark proteins. The distribution isn’t smooth: darkness in proteins tends to come in small patches or to envelop the whole molecule, with rather few having intermediate levels of darkness.

For eukaryotes, about 44% of the Swiss-Prot database corresponds to dark regions, and a further 52% is ‘grey’, having only approximate matches to the PDB. Only 4% is ‘light’, giving an exact match. The proportions are much smaller for bacteria and archaea, which have only 14% or so darkness, but for viruses it is comparable to eukaryotes.

Small patches of unknown structure in a protein are no big deal. It’s the entirely dark proteins that are the biggest puzzle. Why haven’t their structures been deduced yet – and what might they be like?

Perhaps structure determination been hampered by disorder? Indeed, the researchers find that about a third of all the dark proteins are disordered to some degree. This already implies that we have a rather biased view of what proteins are from the selection effect involved in structure determination: we have tended to study those with a high degree of structure purely because those are the ones we can study. The popular image of protein enzymes as intricately sculpted molecular machines may, then, be misleading. Some are like this, but others are floppy, soft machines – if they’re even ‘machines’ at all.

Perhaps there are more membrane proteins than we’d thought, which are always hard to study? Well, not so much. Only about 10% of the dark proteome has the characteristic features of membrane proteins.

Intrinsically disordered proteins and hard-to-crystallise membrane proteins are part of the ‘known unknowns’ of the proteome. But what about the rest – the unknown unknowns, which constitute the majority of dark proteins? O’Donoghue and colleagues deduce that, contrary to common belief, they are orderly and globular. ‘As most “light” [known] proteins are structured, it has been supposed that dark proteins have not been recognised because they are disordered,’ says Levitt. But on the contrary, he says, the work of O’Donoghue and colleagues shows that ‘dark proteins have sequences that are just like normal light proteins’.

Despite being globular, they don’t look like any other proteins we know. That in itself is not surprising: if they did, then we’d probably be able to work out the structure already, so they wouldn’t be dark. All the same, it suggests that we’ve so far seen only a rather limited sample of the kinds of structures proteins can adopt. All those textbook examples of alpha-helix barrels, beta-sheets and so forth, may not be what proteins look like so much as ‘what proteins we’ve looked at look like’. ‘There could be a systematic repertoire of protein structures that we just don’t know about,’ says bioinformatics specialist Andrea Schafferhans of the Technical University of Munich in Germany, who collaborated with O’Donoghue on the work.

 Many of these dark proteins are also chemically modified in some way after they have been translated.

What, though, do dark proteins do? Some clues can be gleaned by looking at where these proteins reside. They’re not evenly distributed throughout the different parts of the cell, but are more common in some places than others. There are relatively few in the cell fluid or cytoplasm, where many globular enzymes do their business. But a high proportion seems destined for life outside the cell. Many are associated with secretion glands or with the extracellular spaces in tissues, suggesting that some might be defensive agents against external threats such as bacteria. Many of these dark proteins are also chemically modified in some way after they have been translated.

Gerstein thinks that these features might be precisely what make this group of dark proteins hard to study. ‘Perhaps the post-translational modifications, and the likelihood that many of these proteins are optimised for extracellular environments, makes them recalcitrant to the protein expression systems or crystallisation techniques necessary for protein structure determination,’ he says.

On the other hand, the dark proteins’ role outside the cell might simply have made them less attractive for structural studies. ‘I could imagine that, because the excreted proteins interact more with organisms outside the cell, they might have been less a focus of attention for structure determination,’ says Schafferhans – there could be a tendency to look at the cell’s ‘housekeeping’ proteins first. But any explanation of where they reside is just speculation so far, she cautions.

Light in the evolutionary tunnel

All this is particularly intriguing from an evolutionary point of view. Proteins, just like organisms, are shaped by evolution. They tend to be good solutions to evolutionary problems: to have shapes into which the amino-acid chains can fold dependably, efficiently and stably, and which (in the case of enzymes) can bind and transform their targets. Proteins, like organisms, evolve from others with different properties and structures – often, a protein with one function might evolve via small, random mutations into one with a quite different function but a similar structure. Proteins, like organisms, can be arranged on evolutionary trees.

But dark proteins don’t seem to be part of the family. ‘Many are evolutionary orphans,’ says O’Donoghue – they seem not to have any obvious relationships with other proteins. This is bound to be true by definition: the proteins are ‘dark’ precisely because they can’t be matched to known homologues. If you go fishing for stuff you don’t recognise, of course what you hook will look different. One of the challenges, then, is to figure out how much the dark proteome as currently conceived is biased by its mode of construction.

All the same, it seems clear that many dark proteins are not descended from some ancient lineage, but are newcomers. This suggests that they are evolutionary experiments. ‘The dark proteome could be an evolutionary playground for trying out new folds,’ says O’Donoghue.

That might demand a shift in thinking about what evolution does at the cellular and molecular levels. Perhaps it is far more dynamic and inventive than we give it credit for, actively churning out new molecular variants rather than relying on the gradual drift of passive genetic mutation. Remember that the dark proteome is not, on the whole, a result of unknown genes: these proteins are made by a reshuffling of the information in genetic sequences. You could say that there is more going on in evolutionary terms than you’d imagine from an inspection of the genome. ‘Protein reshuffling seems to be going on at a higher rate than we thought,’ says O’Donoghue.

The dark proteome could be an evolutionary playground for trying out new folds

Ultimately one would expect particularly useful variations to get fixed at the genetic level. But it needn’t be where that variation begins. What’s more, organisms needn’t be quite so dependent for their molecular repertoire on their evolutionary heritage. O’Donoghue thinks that all organisms probably have a significant fraction of proteins unique just to them.

‘The fact that the dark matter of the proteome has less evolutionary constraint than the other bits of proteome may suggest that it’s under less selection,’ says Gerstein. ‘This is perhaps because it’s more flexible structurally, but also in a sense more flexible in terms of accommodating various amino-acid changes compared to the structurally inflexible and fixed parts of the crystallised proteome.’ This adds momentum to the picture of genomics as a rather more fluid affair than is suggested by the old picture of identical proteins being mass-produced from a fixed genetic template.

Gerstein feels that studying the dark proteome opens up a host of interesting questions. For example, although known bacteria have a smaller dark proteome than eukaryotes, there’s a huge ‘dark microbiome’ of unculturable bacteria. Might that be more full of dark proteins – perhaps useful ones?

And what about us? ‘How does the human dark proteome compare to that of eukaryotes as a whole?’ Gerstein wonders. How well, really, do we know ourselves?

Phillip Ball is a science writer based in London, UK