A bit of hindsight goes a long way in measuring scientific quality, says Mark Peplow

As an assessment of merit, the Nobel prize is about as high-profile as it gets. So it has become an October tradition to chew over the results with the vigour of disgruntled sports fans dissecting their team’s performance. Why did they give it to them? Why did they leave these others out? Surely more than three people deserve this?

Yet for all the debate that inevitably follows the announcement from Stockholm, we rarely hear anyone disputing that the science itself was worthy. In the decades that often separate discovery and award, the ‘test of time’ tends to create a consensus about the work’s import, giving the Nobel committee’s hindsight a superior acuity. 

Far more difficult, and controversial, are attempts to immediately assess scientific merit, for an individual paper say. Numerical metrics – the number of citations a paper receives, and the impact factor of the journal it appears in – are increasingly used to approximate merit, and can determine the fate of job applications and grant requests. That is anathema to many scientists, who argue that the fruits of their labours cannot simply be reduced to a handful of numbers.

The San Francisco Declaration on Research Assessment has become a lightning rod for this cause, with almost 10,000 signatories demanding that funders and institutions stop using journal-level metrics as a basis for such decisions, and instead focus on the scientific content of papers. 

Merits of merit

But it appears that even judging that content is not straightforward. Adam Eyre-Walker, a biologist at the University of Sussex, Brighton, and his colleague Nina Stoletzki, recently compared three different ways of assessing merit: post-publication peer review, citations and journal impact factor.1

They reasoned that if scientists were good at picking out valuable research – however that is defined – there should be some agreement between them about which papers were good. But after reanalysing large data sets gathered by other assessment surveys, they found that the level of agreement in post-publication peer review was barely better than chance, even when experts were looking at papers within their own narrow field. The study also found evidence that assessors were using a journal’s impact factor, perhaps unconsciously, as an indicator of merit.

They were even unable to spot which papers would go on to garner the most citations – something that another study this month showed could be predicted by a pretty simple algorithm, based on the first few years of citations.2 Overall, Eyre-Walker and Stoletzki conclude that none of the methods they analysed are much good for measuring the merit of a paper.

Coincidentally, a group of chemists has just carried out a similar, informal survey over at the science policy blog ScienceGeist. Volunteers looked at a particular past issue of the Journal of the American Chemical Society and selected their three most ‘significant’ papers. They also looked at different measures of merit by guessing which papers had garnered most citations, which were most likely to interest the general public and so on.

To my eyes, the early results suggest similar conclusions to Eyre-Walker’s study: you might as well cover a wall with papers, arm a blindfolded child with some darts and let nature take its course.

Ask the REF

This all has important implications for exercises like the UK’s Research Excellence Framework (REF), the latest national assessment of research quality that concludes next year. The REF relies heavily on merit assessments by expert panels, and will be used to determine a hefty chunk of university funding allocations. Eyre-Walker reckons that the REF’s assessments are likely to be very prone to error – which is worrying for researchers, and irksome for taxpayers (the last such exercise in 2008 cost roughly £60 million to prepare and run).

Clearly, we cannot abandon merit assessment altogether. But we should strive to make it more reliable – perhaps by using a wider range of techniques, including different article-level metrics in combination with crowd-sourced judgements from a community of experts (the latter approach is used by the Faculty of 1000 web service, for example). 

And merit assessments should be used with extreme caution when considering a department’s funding, or someone’s research career. Merit is a capricious concept that can encompass the usefulness of a method, the broader implications for a field, societal impact and a host of other factors. With such diffuse definitions of merit and imperfect tools for measuring it, rushing to judge research based on a handful of metrics looks rather rash.

The shortcomings of merit assessment are a potential stumbling block in the drive to assess research quality on shorter timescales. Policymakers and funding bodies may not always have the luxurious hindsight that guides the Nobel prize committees – but they should not forget that it offers a truer test of science.