Why you should care about Fair data

Prefer Chemistry World
in Google search

No comments

The benefits of findable, accessible, interoperable and re-useable information

0418CW - Comment - Digital data concept

Source: © Shutterstock

Data in journal publishing is currently big news. In 2016, an influential manifesto coined the term ‘Fair’ (findable, accessible, interoperable and re-usable, to which I would also add provenance) to describe the properties digital data should have to benefit the academic community, and therefore support discovery and innovation.¹ The manifesto has already prompted many librarians reinvent themselves as research data managers. But Fair data is arguably better illustrated by working examples rather than aspirations – and by the benefits it has given to areas such as computational chemistry, crystallography and spectroscopy over the past two decades.

Sharing knowledge

The first examples of Fair data principles in chemistry publishing date back to the early days of electronic journals.² We made the case that these journals should be ‘a new form of scientific instrument, in allowing the delivery to the user of manipulable 3D molecular images, instrumental data, symbolic algorithms capable of evaluation locally, and other semantically intact molecular data for reuse locally by the reader’. By 1999, publishers had started to adopt the DOI to denote online articles, although it took a few more years for data itself to start acquiring its own persistent identifiers (PIDs).

At first, data tables were presented in the HTML versions of online articles, and then assigned a PID-based hyperlink to a repository storing the original data;³ the PID links described the attributes of the data, such as a licence declaration indicating its re-usability and provenance. By 2008, these tables (known as web-enhanced objects) were embedded within the article, with a Java-based interactive molecule viewer used to present data to the reader in an accessible, visual and (via the extensive built-in tools), interoperable form.

Unfortunately, the weak link for this method was Java. Many modern web browsers no longer support it in this manner, and those that do require the reader to configure their browser. The original repository-held data and the links to it continue to function as intended.

Changing the script

Today, JavaScript has almost entirely superseded Java, and older tables are being replaced (with the publisher’s agreement) with components based on JavaScript (JSmol),³ which is likely to be supported by browsers for at least the next decade. Original handle-based data PIDs can also be augmented with DataCite-issued DOIs, resulting in globally-aggregated, searchable metadata.

The latest evolution is to host an article’s tables in a data repository separate from the publisher’s website, and assign them a separate DOI. These tables no longer use local copies of data, but retrieve and display the original data on demand.⁴

A similar approach has emerged in crystallography, where the ECrystals project saw that new structures were deposited in repositories to be enhanced as Fair data. By 2014, most individual entries in the Cambridge Structural Database, a repository which has almost 1 million entries, were also assigned their own DOIs. The associated metadata points both to the original article where the data is cited and to other repositories if more complete (image) datasets are available.⁵ Such bi-directional links between articles and their data are becoming more common.

Spectroscopy has also embraced Fair principles. By including a cryptographic license file, data from free induction decay NMR can be analysed by MestreNova without the need for a full licence. In the future, we could see complete data from many more instrument types, coupled with software made available by their vendors.

Some of the benefits of Fair data can be seen in a recent multi-institutional project investigating the mechanisms of boron-catalysed amidations.⁶ All collaborators had immediate and easy access to full versions of all the data and hence the ¹¹B NMR spectra could be freely re-analysed. This resulted in a spin-off project to compare computed and measured ¹¹B chemical shifts – which now has its own collection of Fair data.

A Fair future

The original vision for Fair imagined the scientific journal evolving to absorb these data principles. After two decades of experience in creating and publishing such data, I believe the future is more likely to see journals and data repositories increasingly co-existing but not necessarily merging.

To achieve this aim, the culture of research publishing must come to see Fair data as a valuable output for which recognition is given. Repository infrastructures at local levels and standards must also evolve, where possible adapted to suit the needs of the chemical community. The recent focus on adopting electronic laboratory notebooks should also include an investment in Fair data publishing.

If we continue to make these changes, data will be treated as a first class citizen of the publishing process, acquire a findability and purpose of its own and, in the process, help to reinforce the reproducibility of scientific research and allow others to discover insights in the data.

Editor: Our house style is to only capitalise the first letter of an acronym. In all accompanying references, Fair is rendered as FAIR.

References

M D Wilkinson et al., Scientific Data, 2016, 3, 160018 (DOI: 10.1038/sdata.2016.1)
D James et al., New. Rev. Information Networking, 1995, 61 (DOI: 10.1080/13614579509516846)
H S Rzepa, Imperial College Research Data Services repository, 2018 (DOI: 10.14469/hpc/3657)
M. J. Harvey et al., J. Cheminformatics, 2015, 7, 37 (DOI: 10.1186/s13321-015-0081-7)
H S Rzepa, Imperial College Research Data Services repository, 2018 (DOI: 10.14469/hpc/3738)
S Arkhipenko, et al., Chem. Sci.,, 2018, (DOI: 10.1039/C7SC03595K)