Matt Lightfoot on navigating a career around the Cambridge Structural Database
Matt Lightfoot is a chemist who hasn’t worked in a lab for almost 20 years. ‘But doing this job, you’re surrounded by chemistry,’ he says.
As principal scientific editor at the Cambridge Crystallographic Data Centre (CCDC), Lightfoot helps to look after the Cambridge Structural Database (CSD), a repository of organic and organometallic crystal structures that has been collecting compounds since the 1960s. Earlier this year the CSD celebrated an important milestone – its one millionth structure.
When Lightfoot started his career at CCDC in 2001, the CSD was around a fifth of its current size with 200,000 structures. He was already familiar with the database, having come straight from finishing a PhD in alkali metal coordination chemistry at the University of Manchester, during which he was a regular user of the CSD. Lightfoot joined as one of a small team of editors working to convert crystal structures submitted by researchers into CSD entries.
‘During my PhD I was in the lab quite a lot, so it was quite a difference, especially early on,’ he remembers about starting at the CCDC. ‘You could be sat at the computer for long periods of time, because [making database entries] was quite a slow, manual process.’
Back then only a small proportion of the structures were submitted to the CSD electronically. Many had to be typed up from printed journal articles. And even when structures arrived in crystallographic information file format (the standard format for representing crystallographic information), the process of entering them into the database was laborious. ‘In the file you’ve just got coordinates, and it doesn’t say what’s bonded to what, or what the chemistry is,’ says Lightfoot. ‘We had to work all that out.’
Of course, a lot has changed since then. Now, 99% of submissions to the CSD are electronic and specialist software helps to automate much of the process. The CCDC also has agreements with most of the major publishers, who directly submit crystallographic information from accepted publications.
It was around nine years ago, Lightfoot says, when serious efforts to make the CSD more efficient and automated began. By that time, he was managing the group of database editors. ‘We were processing about 25 to 30 structures a day, and that wasn’t sustainable when we were getting 50–60,000 a year,’ he says. This prompted a review of the process and Lightfoot was appointed to lead a three-year project to overhaul the in-house systems to allow the team to work faster and more effectively.
‘I became the product owner – the kind of internal user requirement person, helping a team of developers understand what the requirements are,’ he says. The job involved working closely with the developers to re-write the CSD’s ageing software from scratch. Lightfoot’s earlier years of experience as an editor proved invaluable – he knew how the database worked and understood the needs of the researchers who would be using it.
In 2013 the new system was launched. Now, its editors can process about 100 structures a day.
Following on from this, Lightfoot has remained a product owner, looking after internal and external projects. ‘I probably spend half my time now working with developers and half in the database,’ he says. ‘It’s quite varied, and quite busy.’
One of the projects he is currently involved in is a collaboration with other databases that hold structural data on inorganic compounds, helping to improve access to their data through a deposition portal that the CCDC developed. ‘We’ve had lots of good feedback on how it’s useful for the community, as the boundary between inorganic and organic is becoming less important,’ says Lightfoot.
The CSD itself continues to grow and develop, and the rate at which new structures are submitted continues to increase. Lightfoot says the one million structure milestone was ‘quite an achievement’ for the crystallographic community, and that many exciting opportunities lie ahead with the emergence of machine learning technology that can process large amounts of data.
‘You don’t have such a good resource in a lot of disciplines,’ he says. ‘When I started there were just over 200,000 [structures], and now there’s a million – that’s quite a lot of data.’ Machine learning has already helped the CCDC to improve how it automatically curates new structures, and Lightfoot is keen to use similar approaches to learn more about the structural data held in the database. Importantly, the new technology also opens up new opportunities to the wider scientific community: ‘I am excited that our high data quality will enable others to use AI and machine learning to gain new insights from the CSD.’