Paul Groth explains why linked data is starting to revolutionise medicinal chemistry
Since its introduction about 20 years ago, the T-shape has become the moniker to describe individuals and organisations that combine deep domain knowledge with the ability to span disciplines. Personally, I’ve met PhD chemists who can easily move from discussing the impact of chemistry in agriculture to the minutiae of open source software licenses. Likewise, it’s likely that you, like me, are part of an interdisciplinary team. For example, at the US Food and Drug Administration the review team for drugs going into clinical research includes a chemist, microbiologist, statistician, pharmacologist and a medical officer. For these sorts of individuals and teams, it’s not enough to have deep data on one subject; it’s necessary to have T-shaped data – deep data that connects across disciplines.
Is this sort of T-shaped data available? In medicinal chemistry, it’s getting there. PubChem has a plethora of structured data, including 89m compounds, 229m bioactivities and thousands of gene and protein targets. Wikipedia provides deep entries on drugs connecting to well-known chemical databases such as ChemSpider. But what’s striking about these resources –and many other databases – is not the size of the data available: it’s the links.
I can easily click from Wikipedia to PubChem to DrugBank to UniProt. By doing so, I’ve traversed both depth and discipline, navigating from a general knowledge space to deep medicinal chemistry information to focused commercial drug information and out to another deep set of information about proteins. Even better, most of these sites are backed by structured databases that can be downloaded or accessed via application programming interfaces (APIs) that make the information available on other sites (similar to how many websites allow you to purchase products via Amazon). However, manual traversal is not a scalable, especially for broad-based or automated data analyses.
We need to build upon the combination of links and structured data to enable machines to collate data for us. This is exactly the intent of linked data – which allows for the collation of data across domains. For example, using the linked data for PubChem, one can write a database style query that covers the resources mentioned above but can do so for multiple compounds.
At a recent event, I saw how linked data could be used to help perform target validation in drug discovery.1 Edgar Jacoby and his team developed computational protocols using the Open PHACTS linked data platform to enable them to provide input into phenotypic screening that covered not only target and chemical space, but also disease and medical pathways. This was achieved because the underlying data was linked using common identifiers such as those from the Gene Ontology and Chemical Entities of Biological Interest (ChEBI) databases. In this case, linked data provided a horizontal view across the disciplines with the ability to dive deep into information about diseases or bioactivity.
Linked data also allows for other interdisciplinary uses, such as identifying business opportunities related to potential compounds; presenting scientific information in multiple languages; or connecting consumer information to deep chemical knowledge.
A key resource for this sort of use will be Wikidata. Wikidata is a curated database hosted by the Wikimedia Foundation. It provides both unambiguous identifiers as well as definitions across domains including for chemistry (Wikidata Chemistry). Since it’s under the Wikipedia umbrella, and is encyclopedic in nature, this provides an excellent first point of call for both users and providers of linked data. A second resource is identifiers.org, which maps existing identifiers to one another, creating links between entities in different data sources. We are also beginning to see commercial services such as Reaxys Medicinal Chemistry provide interlinked data as well.
By interconnecting data between domains, linked data provides a strong foundation for T-shaped working. However, data is only one part of the story. Mashing up this data using both domain specific tools (like RDKit and Bioeclipse) or generic ones (like Knime or scikitlearn) is becoming essential for addressing domain spanning problems. Even if you’re not hands-on with data and tools, it is still vital to understand broadly what they do so that you understand how either you or your team members can use them to attack these broad-based problems.
Major challenges, from breakthroughs in precision medicine through to developing low carbon systems, require the work of T-shaped people and teams. Access to and ability to work with interlinked, multi-domain data will be the foundation for us to succeed.
1 Jacoby et al,Linking Life Science Data, Open PHACTS, 2016