As computational chemistry's footprint expands, Clare Sansom considers the technical challenges that remain

As computational chemistry’s footprint expands, Clare Sansom considers the technical challenges that remain

The discipline of molecular modelling or simulation, now an integral part of organic and pharmaceutical chemistry, is just under half a century old. The first computer programs for making detailed calculations of molecular structures were written in the early 1960s. The first graphical programs - capable of manipulating molecular structure on the screen - appeared a decade or so after that and were painfully slow. 

FEATURE-MODEL-MOLS-350

Computer modelling can both support and replace experimental work

While completing a PhD in biophysics in the late 1980s, I can remember having time to make coffee between setting up a display of a complex molecule and viewing the results! The highly complex simulations of molecular structure and movement that are possible today would have seemed just as far off to the graduate student of a generation ago as the vast, connected network of computers that is used to generate them. Yet even today’s longest calculations, taking weeks or months of computer processing time on powerful computers, cannot simulate many biological processes accurately, and enormous challenges still remain, particularly in data interpretation. 

An approximate approximation  

Many molecular simulations are still based on Newton’s laws of motion. In this molecular mechanics approach - which is so approximate that it is surprising that it produces good results at all - each atom is modelled as a simple point mass, and the covalent bonds between atoms as springs of a given length and flexibility. This depends on the availability of accurate descriptions of the geometry of ’real’ molecules, used to derive the average values that are used to determine, for example, the sizes of the masses and lengths of the springs. A consistent set of parameters for selected atom types is termed a force field. 

FEATURE-MODEL-MOLS-410

Source: © PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES USA

Kollman’s microsecond snapshots of a folding villin headpiece

Much simulation work sits within the disciplines of biosciences and materials sciences, and is focused mainly on large molecules such as polymers or proteins. However, by far the most precise and accurate experimentally observed molecular geometries are obtained through the x-ray crystallography of small organic or organometallic molecules. The principal repository for these structures is the Cambridge Structural Database (CSD), maintained by the Cambridge Crystallographic Data Centre (CCDC) in Cambridge, UK. 
This database has just reached an important milestone: the deposition of its 500 000th structure. ’The CSD contains every published atomic structure of a carbon-containing compound, apart from macromolecules and peptides containing more than 24 amino acids,’ says the CCDC’s executive director, Colin Groom. ’It has taken the database just over 40 years to grow to half a million structures, but its structure records go back much longer than that: the first atomic coordinates date from 1936.’ The measured bond lengths, bond angles and torsion angles of this enormous array of molecules are used to derive and test the parameters for force fields used in simulation experiments.  

Touching up crystal structures 

Parameters derived from small molecule crystal structures are also used in the computer programs used to refine crystal structures of macromolecules that have been measured experimentally. These programs have been optimised over many years for the types of atom and bond found in proteins and DNA, so these molecules are always modelled very precisely.  

However, for non-protein molecules such as enzyme inhibitors stored within protein complexes the programs are less accurate, owing to many errors in the structures of non-peptidic ’small molecule’ ligands stored as parts of protein complexes in the Protein Data Bank (the principal repository for large biological molecule structural information). There is now a concerted effort by both the Protein Data Bank and CCDC to improve these. ’When you see a conformation of a ligand bound to a protein in the Protein Data Bank that doesn’t fit with the conformation of that structure in isolation, as observed in the CSD, then either there is a mechanistic reason for it, or there is an error in the structure,’ says Jason Cole, development group manager at the CCDC.  

Two software suites with associated force fields, Amber and Charmm, were first produced in the 1980s but are still being developed and are very widely used today. Amber is an acronym for Assisted Model Building with Energy Refinement, and amusingly lends itself readily to puns such as ’the bugs in Amber have magically disappeared’. The program started out in the lab of the late Peter Kollman, a pioneer in molecular dynamics and computational chemistry based at the University of California, San Francisco, US. ’The developers deliberately kept Amber quite difficult to use, encouraging users to do some of their own coding. They wanted, and still want, to discourage users treating the program as a black box,’ says Ross Walker, an Amber developer at San Diego Supercomputer Center in the US. 

Jiggling and wiggling

The distinguished quantum US physicist Richard Feynman once said that ’everything that living things do can be understood in terms of the jigglings and wigglings of atoms.’ The name given to the computer simulation of molecular motion - Feynman’s jigglings and wigglings - is molecular dynamics. This type of calculation can reveal far more about the mechanism of a molecular system, even the simplest, than mere minimum energy calculations, but it requires a lot of computer power.  

The first published molecular dynamics simulations of proteins followed the motion of small proteins (or large peptides) for typically less than 10 picoseconds (10-11s), which is too short a timescale for any biologically significant processes to be observed. Peter Kollman’s publication in 1998 of a microsecond-scale simulation of the head-like part of the actin-binding protein villin represented a significant breakthrough. ’This work was groundbreaking in its time, and used quite a substantial portion of the US’s available academic supercomputing time,’ says Jonathan Hirst, an expert in molecular dynamics at the University of Nottingham, UK, who follows the discipline closely as editor of the Journal of Molecular Graphics and Modelling. ’Now, microsecond simulations are almost routine. But many molecular motions and mechanisms are too slow for any realistic results to arise from simulations at even this timescale. Faster number crunching will not be enough on its own; we need to develop more sophisticated simulation strategies.’ 

Reach for the sky 

Researchers are becoming more ambitious in designing simulation experiments, so, even with these and other novel techniques available, they are demanding more and more computer power. With processor clock speed reaching an apparent peak at about 3GHz, the best way to do this is to add more cores and more processors, that is, to parallelise. It is now feasible to build a cluster of over 10,000 processors, at least if you have a few million dollars to spend. Few researchers can access these resources, however. ’One of the biggest spenders in the field outside the pharmaceutical giants is David Shaw, a retired hedge fund billionaire who set up a research company based in New York, and reportedly spends millions of dollars a year on developing custom hardware for molecular dynamics,’ says Walker. ’This is far beyond the budget of any university.’  

Finance is not the only constraint on the growth of parallel computing. In molecular dynamics, in particular, it is impossible to increase the timescale simply by parallelising, as each time point has to be simulated in order. Coding becomes more complex as well. ’As supercomputers get more and more parallel, it gets harder and harder to develop code to run on them. Each new generation of parallel machines needs an order of magnitude more money spent on software development than the previous one,’ says Walker. 

Cinderella carbohydrates  

Proteins and nucleic acids are by far the most widely simulated macromolecules. Carbohydrates, and glycoconjugates - carbohydrates bound to other large molecules, such as proteins or lipids - remain to some extent neglected, or Cinderella molecules, yet they are key to many biological processes and simulating them can lead to important biological insight. Understanding protein-carbohydrate interactions is necessary, for example, to understand the mechanism of action of the only small-molecule drugs of any real use against influenza, Relenza (zanamivir) and Tamiflu (oseltamivir). These are inhibitors of influenza virus neuraminidase, the enzyme that cleaves a simple sugar (sialic acid) from the surface of glycoproteins to remove budding influenza viruses from host cells.  

FEATURE-MODEL-MOLS-275

Source: © ROCHE

Studying the mechanism of action of Tamiflu

Very recently, Dong Xu, Wilfred Li and co-workers at the University of California, San Diego, US, used molecular dynamics to implicate changes in glycan patterns in driving the changes in the binding of another influenza virus protein, haemagglutinin, to its receptors that accompany shifts in host species specificity. ’This work can be extended to complement glycan microarray studies and play an important role in the surveillance and prevention of future cross-species influenza pandemics,’ says Li. 

Carbohydrates present some particular difficulties for the modelling and simulation community. ’They are very flexible and there are hydroxyl groups everywhere, interacting through multiple hydrogen bonds,’ says Goran Widmalm, a carbohydrate chemist from Stockholm University, Sweden. Specialist force fields such as the Carbohydrate Solution Force Field have been developed to model these multiple hydrophilic interactions more closely. Mulholland contrasts the highly branched carbohydrates with linear proteins. ’Carbohydrates can be linked together in many different ways, and even a single monosaccharide can exist in a variety of chair and boat forms; any carbohydrate chain has many more possible conformations than a protein chain of the same length. There is still some debate about how a simple sugar’s conformation can change when it binds to an enzyme active site.’  

With carbohydrates and glycoconjugates showing so much flexibility and structural variability, obtaining precise, accurate crystal structures of them is both necessary and difficult. Max Crispin, a crystallographer from the University of Oxford, UK, explains that in glycoproteins the covalently bound, flexible carbohydrates lie away from the protein surface and are not always visible in x-ray crystal structures. ’The carbohydrate components of glycoprotein models in the Protein Data Bank have generally received less attention than the protein. This may have led to structural errors when carbohydrates were built into the model, sometimes as severe as building the wrong monosaccharide or including a carbohydrate structure that is never found in nature.’ Now, however, because of improvements in expression and crystallisation technology it is often possible to see accurate conformations of quite large glycans of maybe 8-15 residues bound to the protein surface, and there is some structural consistency between different instances of the same glycan. ’Overall, glycans attached to proteins appear to be less floppy than we had thought,’ explains Crispin. 

Correcting the textbook 

Mulholland has applied a combination of quantum mechanics and molecular dynamics to the interaction between lysozyme and its oligosaccharide substrate, and suggests that the classic mechanism of action of this enzyme via an oxocarbenium ion intermediate, as presented in detail in textbooks, is incorrect. ’It seems that the lysozyme reaction proceeds via a covalent intermediate after all. But these calculations can be very demanding. For some of our studies, modelling larger enzymes such as influenza virus neuraminidase, we use the University of Bristol’s new supercomputer BlueCrystal, which is capable of 37 trillion operations per second.’ 

FEATURE-MODEL-MOLS-300

Source: © UNIVERSITY OF BRISTOL

BlueCrystal - the University of Bristol’s supercomputer is capable of 37 trillion operations per second

Molecular dynamics simulations of glycans and the enzymes that process them can have practical applications in biotechnology and medicine. Walker, in collaboration with the US’s National Renewable Energy Laboratory, is using the technique to study the enzyme cellulase, which breaks down the major plant polysaccharide, cellulose. This is the most abundant organic compound on Earth, and harnessing the enzyme that digests it could be an important step towards an abundant source of biofuel. ’Cellulase is one of the slowest enzymes on the planet, and we need to understand its mechanism of action fully before we can derive mutants that are commercially viable,’ he says.  

Pamela Greenwell, from the University of Westminster, London, UK, is also using docking and molecular dynamics to investigate enzyme mechanisms of action. Her studies of glycosidase enzymes in protozoa may have applications to the design of drugs against some of the world’s most neglected diseases. 

Slow interpretation 

In common with many other theoretical techniques, however, molecular dynamics has a steep learning curve. Hans Heindl, a physician and part-time theoretical chemist who works with Greenwell, believes that computers are now so fast that the main bottleneck in molecular dynamics is in understanding what the results mean biologically: ’You wait a year for your simulations to finish, and then you need twice as long to analyse the results.’  

Undoubtedly, the most useful results are gained from a combination of experimental and theoretical approaches. Few theoreticians share the multidisciplinary background of Heindl, or of Widmalm, who leads a group with expertise in organic synthesis, NMR spectroscopy and computational chemistry. There are many wet-lab biologists, including glycobiologists, who could benefit from access to high performance, massively parallel computers and to collaboration with theoreticians. The computational grid system, set up to ease remote access to high performance machines, may provide one way forward.  

Li’s group runs an annual summer school on molecular simulation for biologists, and in 2009 Greenwell organised a course in modelling protein-carbohydrate recognition at Westminster using the university’s grid. This was over-subscribed: many of the participants, who came from all over Europe, were bench biologists with little prior exposure to theoretical techniques.  

The Westminster group is now developing a more user-friendly front end to their grid based simulation system, which will further encourage the use of this valuable technique by those who think of themselves as biologists or chemists, biotechnologists or clinicians, rather than geeks.  

Clare Sansom is a freelance science writer based in London and Cambridge, UK