Multimodal biological analysis using NLP and expression profile

2018 New York Scientific Data Summit (NYSDS)(2018)

引用 3|浏览1
暂无评分
摘要
The goal of this project is to gather biological data from different sources and use computational analysis to evaluate it together. Two data sources were used: microarray gene expression data for Arabidopsis thaliana, and gene co-occurrences in scientific literature extracted from bioRxiv using natural language processing (NLP). For analysis, the microarray data was normalized, its dimensionality was reduced using principal component analysis (PCA), and it was grouped into different numbers of clusters using K-means clusters. Then these expression clusters were compared to the co-occurrence pairs in the NLP data, to evaluate the quality of the NLP extractions. This evaluation was done using entropy analysis on the combined data, compared to the maximum entropy in the clustering alone. As a result, the evaluation of the NLP data shows that the results do correspond to the clusters from the microarray data, and may be used for further analysis.
更多
查看译文
关键词
bioinformatics,genomics,clustering,named-entity recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要