A linguistic approach could revolutionise the analysis and annotation of complex proteome data, an Italian protein expert has argued.

The need for methods to make sense of the data flooding in from genomes and related protein sequences is more urgent than ever, says Mario Gimona, a molecular biologist at the Consorzio Mario Negri Sud. Linguistic analysis could replace the ambiguous protein annotation systems currently on offer.

The analogy between proteins and language starts out simply enough, as both sequences and sentences are made up of letters representing the 20 amino acids in the first case and various sounds in the second. Linguistic analysis of genes has been performed for decades, with similar considerations for proteins following later.1

The challenge is to work out how to move up to the next levels of complexity -  to read the words, phrases, and sentences of the protein language.

’This problem appears closely related to the as yet unresolved "structure prediction problem" of protein folding,’ said folding expert Kevin Plaxco from the University of California at Santa Barbara, US. ’Ideally, we would want to be able to read any string of amino acids and understand its "meaning" as easily as we read an English sentence.’

For this to succeed, one would need to understand the ’grammar’ of protein structure and folding. 

Gimona suggests that protein modules - the independently folding domains that can serve as intermediate scale design elements in larger proteins, which have often been duplicated and re-used by evolution - hold a key position for the understanding of protein grammar.2 

’Domains and modules represent the syntactic and semantic units in a protein,’ he says. He is optimistic about the application of linguistic methods in protein science: ’When the smoke has cleared, we all might have become molecular linguists!’ Michael Gross