Researchers in Russia have put together the world’s largest dataset to date for training deep neural network models. The dataset contains around six million conformations of about one million drug-like molecules.

From a computational point of view, one must know details such as conformation energies and the Hamiltonian matrix parameters to forecast the biological activity of a potential drug long before it is synthesised in a lab. Density functional theory (DFT) can be used to predict such parameters, but quantum chemical calculations tend to be time-consuming and computationally expensive. Machine-learning, however, can be used to lower the computational complexity of DFT.

Frustrated by a lack of datasets for training machine learning models, the team set out to fill this gap and ultimately reduce the computational costs surrounding medicinal chemistry. They began with a training set of 100,000 molecules with 436,581 conformations and calculated their conformation energies and the Hamiltonian coefficients using DFT. This training set was significantly larger than the datasets used in publicly available deep neural networks models. The researchers then compared the performance of the original DFT-based models with test sets containing different molecules. The team noted these models performed much better after being trained with larger datasets.

The team made the code publicly available to encourage other researchers to use and develop the dataset, which they hope will aid future quantum chemistry studies.