Straightforward Feature Selection for Scalable Latent Semantic Indexing

SDM(2009)

引用 27|浏览67
暂无评分
摘要
Latent Semantic Indexing (LSI) has been validated to be effective on many small scale text collections. However, little evidence has shown its effectiveness on unsampled large scale text corpus due to its high computational complexity. In this paper, we propose a straightforward feature selection strategy, which is named as Feature Selection for Latent Semantic Indexing (FSLSI), as a preprocessing step such that LSI can be efficiently approximated on large scale text corpus. We formulate LSI as a continuous optimization problem and propose to optimize its objective function in terms of discrete optimization, which leads to the FSLSI algorithm. We show that the closed form solution of this optimization is as simple as scoring each feature by Frobenius norm and filter out the ones with small scores. Theoretical analysis guarantees the loss of the features filtered out by FSLSI algorithm is minimized for approximating LSI. Thus we offer a general way for studying and applying LSI on large scale corpus. The large scale study on more than 1 million TREC documents shows the effectiveness of FSLSI in Information Retrieval (IR) tasks.
更多
查看译文
关键词
computational complexity,feature selection,closed form solution,continuous optimization,discrete optimization,latent semantic indexing,information retrieval,objective function
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要