In October last year, a team of natural product chemists discovered a glitch in a widely used piece of NMR software. Buried deep inside the code was a simple file sorting issue, which on certain operating systems led to incorrect values being predicted for chemical shifts. The finding cast uncertainty over results published in more than 150 scientific papers over a five year period.

Ten years is a long time in this field in terms of architecture developments, compiler developments, all sorts of developments

Lynn Kamerlin, Uppsala University

This is not the first time that an error in a piece of software code has cast a shadow over computational research, these sorts of issues are actually surprisingly common. In one famous case, a coding error was at the heart of a seven-year dispute between some of the world’s top theoretical chemists, who were trying to model the phases of supercooled water. And recently, an algorithm used in older versions of the popular molecular dynamics software Gromacs was found to introduce order of magnitude mistakes during simulations.

Ideally, code will be well documented and publicly available, allowing researchers to scrutinise scripts and locate problems. But this isn’t always the case – traditional publishing practices, as well as concerns around intellectual property, often mean that code is difficult or even impossible to access.

Even when source code is open for all to see, other factors can complicate matters. Computer programs tend to rely on an array of other pieces of software, which are continually being updated and new versions introduced. This makes repeating the exact conditions that a computational study was originally performed under surprisingly difficult. These problems have become so widespread that a ‘reproducibility crisis’ is now a major concern among computational scientists.

The answer to what?

‘One of my old Python scripts depends on 200 software packages directly or indirectly, and all these changed over time,’ says Konrad Hinsen, who develops molecular dynamics software at the French National Center for Scientific Research in Orléans. ‘In the end it becomes very difficult to run the stuff – and even if you can run it, it doesn’t mean you get the same numbers out.’ Hinsen explains that even if older scientific programs still run and produce a result, it’s not always clear what they have actually calculated. ‘It’s like the famous thing in the Hitchhikers Guide to the Galaxy: the answer is 42 – but it’s the answer to which question?’ he says.

Hinsen is particularly interested in computational science methodology and has serious concerns about reproducibility in the field. A few years ago, he co-founded the journal ReScience C to create a space where people tasked with revisiting old code could share their results. ‘We wanted to improve the problem where you have many computational prescriptions in papers which are incomplete and code which is lacking and then at some point nobody reads up what is being done,’ says Hinsen. ‘Even within the lab, one previous student does something and the following one can’t pick up the work because nowhere is it properly documented.’

An image showing the ten years challenge poster

Source: Courtesy of Nicolas P Rougier

ReScience C recently launched a challenge asking authors to go back and see if they could recreate the results of studies they published at least 10 years ago. Hinsen explains that the challenge is all about making computational science ‘more understandable, more transparent and more durable’. To take part himself, Hinsen re-examined two old programs of his own – one written 10 years ago and another from the mid-nineties. Curiously, the 25 year old code still runs perfectly, whereas the more sophisticated 10 year old code doesn’t. The failure of the modern code turned out to be due to intentional changes made in 2014 to the Python libraries on which it relied.

Computational biochemist Lynn Kamerlin explains that legacy issues tend to be more problematic when code has become dormant. With ‘living’ code – methods that are actively used by a research community ­– bugs tend to be spotted and fixed quickly. When mothballed methods are revisited the situation gets much harder. Kamerlin describes the problems she had when working with old code that had then been forgotten. ‘This was something [a colleague] developed 15–20 years earlier and never thought about, and they put it on tape gathering dust somewhere in their office,’ she recalls. The tape was eventually located, but in a form that was incompatible with modern hardware. ‘We couldn’t find a way to actually read the tape – we ended up having to re-implement it from scratch,’ says Kamerlin.

While this is an extreme example, it illustrates the problems that can arise in the fast-moving world of computing. ‘Ten years is a long time in this field in terms of architecture developments, compiler developments, all sorts of developments,’ says Kamerlin. ‘And so if you haven’t looked at [a program] for 10 years, there’s no guarantee you can actually run the code – so you can get basically the software equivalent of my tape reader problem.’

Black box problem

With the growing use of machine learning models to solve chemistry problems, the issue of reproducibility in AI studies is particularly worrisome. ‘The obvious problem is that you need a huge amount of training data and you should, in theory, keep a copy of that and make it publicly available so people can redo these things later,’ says Hinsen. ‘And this is often difficult, simply because of the size of the data – you may not be able to store it or publish it, it easily gets lost, also it often gets updated quickly and then you don’t know which version you used.’

Basically teamwork, openness, transparency – I think this is really the only way forward to ensure the safety of code

Lynn Kamerlin, Uppsala University

One issue is that many computational chemistry program coders are not formally trained software developers – they tend to be chemists who are trying to solve a problem for which no software is readily available. As a result, programming practices often lag behind what would be considered best practice within the computer science community.

‘Today you can publish a paper on machine learning in chemistry where you test [a model] on one or two benchmarks and only compare with selected baselines, which may not necessarily be state-of-the-art,’ says Massachusetts Institute of Technology’s Regina Barzilay, who develops deep learning methods for drug discovery. ‘This is a serious problem that makes it hard to see if a new method is really an advance.’ Barzilay explains that such an approach would be unacceptable in her core discipline of computer science, where new models must be assessed against as many public datasets as possible, to ensure reproducibility. ‘Unfortunately, this level of testing is still not a common practice in AI and chemistry. I hope it will change,’ she adds.

So what can be done to increase the lifetime of computational methods?

Hinsen recommends that all students starting out in computational work should have access to basic training in good programming practice, to help ensure that they can keep track of projects and avoid accidental losses of data. He has taught courses for the Software Carpentry network, which offers workshops and training in essential lab skills for research computing, and has also organised a massive open online course (MOOC) covering crucial techniques such as file management, version control and backing up data. While it’s less comprehensive than in person training, Hinsen points out that a MOOC can reach thousands of people in a single session.

A prisoner of software

Kamerlin stresses the importance of documenting everything that goes into a computational study – the need to publish not just scripts, but other details like the compiler used and software’s overall architecture. She points to free online repositories like GitHub and Zenodo where researchers can store all of the code and additional data used. Kamerlin explains that opening up code to the community can help ensure it is used and kept active, rather than being discarded as soon as it has helped solve a problem. ‘Basically teamwork, openness, transparency – I think this is really the only way forward to ensure the safety of code,’ she says.

The reproducibility issue – to address it demands more than just the mere moral imperative of being open

Alexander Hocquet, University of Lorraine

Hinsen agrees that documenting every aspect of a computational method is essential, but believes that more is needed than just making code and data files publicly available. Source code for scientific software is a complicated mixture of calculations, approximations and technical computing mechanisms required for memory management, processing data sets and optimising performance. As a result, code can often be almost indecipherable to anyone other than the original developer. According to Hinsen, this has led to a situation where the complex models underpinning many computational studies have become ‘imprisoned’ by scientific software – often the only place where these models actually exist.

To illustrate this point, Hinsen describes the bimolecular simulation of a protein, which would typically be defined by a function comprising thousands of coordinates. ‘In theory you should put it in the paper – but you can’t because no publisher wants to have a 50 page equation that describes the function of 5000 variables and nobody wants to read it and nobody could re-compute it anyway,’ he says. ‘So what happens is that somewhere in the software there is an implementation to run it, but nobody knows exactly what it does.’

Hinsen has called for a new digital scientific notation to help scientists regain control of their codes. This would create a formal language that would enable coding information to be published and scrutinised in a way that is suitable for the digital age. Such a language would be readable for both humans and machines, enabling peer review of the scientific models and software verification of the computational methods. Hinsen hopes that this will prevent scientific software packages being used in a ‘black box’ fashion, as is often the case at the moment.

Open source

The problems surrounding the transparency of code highlight a fundamental dilemma for computational science – should all code be open for all to see? If not, then how can the methods be reproduced and built upon by peers? But if so, do computational scientists risk giving up the fruits of their labour for free? Would it even be practical – for developers or users? These questions have been hotly debated among computational chemists for decades.

For Alexandre Hocquet, a former computational chemist who is now a science historian at the University of Lorraine, France, the questions shine a light on the complex ways that business and law are embedded into scientific behaviour. Hocquet points out that while the drive for open software makes sense on many levels, the makers of commercial software would argue that their model provides the necessary resources for software to be maintained. Without the revenue generated by proprietary licences, how can you support a workforce that will update programs and keep them alive? As evidence of this, Hocquet highlights the success of one of computational chemistry’s most well-known software packages. Gaussian’s strict licensing terms have long attracted the ire of many scientists, yet the program still leads the field today, more than 50 years after it was first developed.

‘The reproducibility issue – to address it demands more than just the mere moral imperative of being open,’ says Hocquet. ‘There are a lot of governance issues, a lot of licensing issues. Not every free software licence is the same – they have politics embedded into them.’

Hocquet points out that many scientific instruments are developed and maintained by companies, without regular users questioning their inner workings. ‘When you buy a Bruker NMR spectrometer, you can’t expect to know exactly what is going on inside. You rely on the standardisation of the scientific instrument of the corporate entity, which you trust has been developed, maintained, standardised, calibrated,’ says Hocquet. ‘There’s a parallel here between “should [software] be open or not?”, and “what is trust in a scientific instrument?” ­– the moral imperative to be open is far less developed when we talk about NMR spectroscopy, for example.’

Perhaps then, one positive to come out of the reproducibility crisis is that it has opened up a conversation where fundamental scientific philosophies can take centre stage. ‘The computational chemistry domain is actually a scientific field where the kind of issues, the two visions of what is trust in science are actually debated,’ says Hocquet. ‘In other fields, you don’t see those debates.’