A machine learning algorithm that can flag papers that may have come from paper mills could help publishers fight fake scientific studies.

Paper mills may be the biggest organised fraud perpetrated on scientific journals ever. While there have been instances of individual researchers manipulating images or simply inventing data, paper mills serve up professional fakery on an industrial scale. Buyers can purchase a paper, or authorship of one, on any topic based on entirely made-up results.

Biochemical and biomedical journals have been hit particularly hard, flooded with hundreds of fake research manuscripts. In early 2020, two independent groups of image detectives discovered more than 700 professionally manipulated papers – though there could be 10 times as many according to science integrity consultant Elisabeth Bik.

A new software tool called Papermill Alarm could help publishers detect potentially fabricated papers as they are submitted. Created by Adam Day from the UK data services company Clear Skies, it analyses a manuscript’s title and abstract, checking if it shares stylistic features with any of the paper mill articles the deep-learning algorithm was trained on.

The tool can’t tell whether an article is definitely fabricated, but flagging those that may need to be investigated could stop bogus science going through the publication process unchallenged. While some publishers already use automated methods to help detect fraudulent activity, they aren’t based on textual analysis. Six publishers have already expressed interest in using Papermill Alarm, according to an article in Nature.

Day told Nature that he found 1% of all studies listed on the citation database PubMed share similarities with paper mill articles. An analysis of over 53,000 papers by the UK Committee on Publishing Ethics shows that most journals deal with about 2% of suspect manuscripts among their submissions. However, journals that accept paper mill studies often see a sharp increase in fake papers, comprising up to 46% of submissions.

While paper mill studies are unlikely to become highly cited, they do have the potential to slow down legitimate research if the results are taken as the basis for other studies. Moreover, letting fraudulent work become part of the scientific record undermines researchers’ trust in each other’s work, and damages public trust in published data.