Molecular sequence classification using efficient kernel based embedding

Sarwan Ali, Tamkanat E. Ali,Taslim Murad,Haris Mansoor, Murray Patterson

Information Sciences(2024)

引用 0|浏览0
暂无评分
摘要
The alarming spread of diseases across the globe has become a major concern for global healthcare agencies. The research community is actively involved in inventing better and more efficient ways of detecting and treating diseases to solve this global challenge. The abundance of molecular sequence data has eased the path for researchers to develop Machine Learning (ML) based solutions. The performance of the ML models used to classify molecular sequences depends heavily on the type of embedding used to obtain an appropriate numerical representation of the molecular sequences. In recent years, many embedding approaches have been introduced for molecular sequence analysis. However, there is still a need for improvement as far as the efficiency of the methods is concerned (i.e., the ability to capture pairwise relationships and patterns effectively, which could affect the classification performance). To provide a solution to this problem, we propose an efficient kernel-based technique for embedding generation from molecular sequences, which involves computing a kernel matrix using the Sinkhorn-Knopp algorithm and the normalized pairwise distances between k-mers in a manner that satisfies the constraints of a probability distribution. Further, kernel principal component analysis (PCA) is applied to get the top PCs, which are then used as the final embedding. As a result of the experiments, we obtained an ROC-AUC score of 0.657 for our method, which is higher than the scores obtained using baselines. This clearly shows that the low-dimensional embedding obtained through the proposed approach provides an efficient and effective solution for molecular sequence analysis.
更多
查看译文
关键词
Molecular sequence classification,k-mers,Kernel matrix,Protein subcellular localization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要