Small molecule machine learning: All models are wrong, some may not even be useful

biorxiv(2024)

引用 0|浏览9
暂无评分
摘要
Small molecule machine learning tries to predict chemical, biochemical or biological properties from the structure of a molecule. Applications include prediction of toxicity, ligand binding or retention time. A recent trend is to develop end-to-end models that avoid the explicit integration of domain knowledge via inductive bias. A central assumption in doing so, is that there is no coverage bias in the training and evaluation data, meaning that these data are a representative subset of the true distribution we want to learn. Usually, the domain of applicability is neither considered nor analyzed for such large-scale end-to-end models. Here, we investigate how well certain large-scale datasets from the field cover the space of all known biomolecular structures. Investigation of coverage requires a sensible distance measure between molecular structures. We use a well-known distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which agrees well with the chemical intuition of similarity between compounds. Unfortunately, this computational problem is provably hard, severely restricting the use of the corresponding distance measure in large-scale studies. We introduce an exact approach that combines Integer Linear Programming and intricate heuristic bounds to ensure efficient computations and dependable results. We find that several large-scale datasets frequently used in this domain of machine learning are far from a uniform coverage of known biomolecular structures. This severely confines the predictive power of models trained on this data. Next, we propose two further approaches to check if a training dataset differs substantially from the distribution of known biomolecular structures. On the positive side, our methods may allow creators of large-scale datasets to identify regions in molecular structure space where it is advisable to provide additional training data. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
small molecule machine learning,small molecule,machine learning,models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要