Scientists developed the first open-source tool to translate chemical structures to their Iupac names using machine learning software designed by Google.
Since its foundation in 1919, Iupac, the International Union of Pure and Applied Chemistry, has been maintaining a system of naming chemical compounds. However, other systems to identify chemical structures that are more convenient for computer processing have emerged in the last few decades. The simplified molecular input line entry system (Smiles) describes chemical structures using line notation – for example, butan-2-ol is written as CCC(C)O.
But Iupac nomenclature is not going anywhere as it is the one most easily understood by humans, so it continues to be prevalent in teaching, chemical journals and patents. But there is no open source tool to convert between Smiles notation and Iupac names. Programs such as ChemDraw already include structure-to-name algorithms, but these are not free to access and can’t use Smiles as input.
Google recently developed artificial neural networks to improve translation of natural languages, called a Transformer. Scientists in Russia built on this to produce a program that translates Smiles strings and structure drawings to their Iupac names and vice versa.
PubChem has nearly 100 million different molecular structures, which the group used to train and test the program. Then, 100,000 of these molecules were randomly selected to validate the algorithm.
The software recognised when one molecule could have multiple Iupac names, which is often the case in large and highly functionalised structures. However, it did struggle with very small molecules, namely methane, and sometimes missed parts of very large compounds. Overall, it was 98.9% accurate when converting Smiles structures to Iupac names.
L Krasnov et al, Sci. Rep., 2021, 11, 14798 (DOI: 10.1038/s41598-021-94082-y)