Studies that provide access to underlying data are cited 25% more often than those that don’t

Research papers that make their underlying data openly available are significantly more likely to be cited in future work, according to an analysis led by researchers at the Alan Turing Institute in London that has been published as a preprint. The study, which is currently under peer review, examined nearly 532,000 articles in over 350 open access journals published by Public Library of Science (PLoS) and BioMed Central (BMC) between 1997 and 2018, and found those that linked directly to source data sets received 25% more citations on average.

‘We found half a million papers were published by these open access journals over the study period, and one-third included data availability statements, and those papers were then examined to see if there was a citation benefit,’ explains Iain Hrynaszkiewicz, head of data publishing at the publisher Springer Nature. The results clearly point to a citation advantage, of up to 25.36%, for articles that include a link to a repository via a URL or other permanent identifier. This is consistent with the results of previous smaller studies that focussed specifically on gene expression microarray or oceanographic data.

This new evidence can better justify the increased costs associated with the introduction of stronger research data policies, Hrynaszkiewicz and colleagues say. They controlled for several factors known to affect citations, such as the number of authors and references, as well as author reputation.

‘By making both the research papers and the underlying data publicly available, the authors are increasing their visibility, and that leads to data reuse and then more citations,’ says Hrynaszkiewicz. He also points out that more successful, visible research groups might have more resources at their disposal to share underlying data and code.

New incentives for open data

Peter Suber, who directs Harvard University library’s office for scholarly communication and was not involved with the study, says the conclusions are significant because they could prompt journals to create new incentives for authors to open their datasets and link to them from within articles.

‘Many journals have open data policies, but some have trouble getting authors to comply,’ Suber says. ‘The trick is to get the data open a little before publication so that the link can be included in the text. Journals might now be motivated to increase the pressure on authors to make their data open on a specific timetable.’

Peter Murray-Rust, a chemist at Cambridge University in the UK who champions open access publishing, calls the preprint study ‘well done’ and ‘a good piece of work’. However, he says it is important to determine whether those links to data that the researchers identified actually retrieve real files that are useful. ‘A responsible scientific publisher would say you should have InChIs and MOL files, but we often have PDFs or JPEGs – these files are largely a graveyard of destroyed information,’ Murray-Rust explains. He is currently writing software to turn PDFs back into spectra in order to make them more useable.

He also argues that citations have limited use when trying to assess whether research is of high quality or seminal. ‘What we should be measuring is not the citations, but reuse of the data,’ he states. This can only happen, Murray-Rust notes, if researchers put their data in a repository and thereby create a public record that enables citation information to be measured and tracked, as well as views and downloads.