Overproduce and select, or Determine Optimal Molecular Descriptor Subset via Configuration Space Optimization? Application to the Prediction of Ecotoxicological Endpoints.

Molecular informatics(2023)

引用 2|浏览10
暂无评分
摘要
Predicting the likely biological activity (or property) of compounds is a fundamental and challenging task in the drug discovery process. Current computational methodologies aim to improve their predictive accuracies by using deep learning (DL) approaches. However, shallow learning-based methodologies for small- and medium-sized chemical datasets have demonstrated to be most suitable for. The latter start with a universe of molecular descriptors (MDs), then apply different feature selection algorithms, and finally construct a predictive model for the intended learning task. We demonstrate here that this approach may miss relevant information by assuming that the initial universe of MDs codifies, when it does not, all relevant aspects for the respective learning task. We argue that the limitation is mainly because of the constrained intervals of the parameters used in the algorithms that compute MDs, parameters that define the Descriptor Configuration Space (DCS). We propose to relax these constraints in an open CDS approach, so that a larger universe of MDs can initially be considered. We model the generation of MDs as a multicriteria optimization problem and tackle it with a variant of the standard genetic algorithm. As a novel component, the individual fitness function is computed by aggregating four criteria via the Choquet integral using a fuzzy (non-additive) measure. Experimental results on benchmarking chemical datasets show that the proposed approach generates a meaningful DCS by improving state-of-the-art approaches in most of the datasets.
更多
查看译文
关键词
ecotoxicological endpoints,genetic algorithm,molecular descriptors
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要