Exploiting query features in language modeling approach for information retrieval

Exploiting query features in language modeling approach for information retrieval(2004)

引用 24|浏览8
暂无评分
摘要
Recent advances in Information Retrieval (IR) are based on statistical language models. Most retrieval experiments demonstrating the language modeling approach use smoothed unigram language models that only exploit the term occurrence statistic in probability estimation. Experiments with additional features like bigrams have met with limited success. However, language models incorporating n-gram, word-triggers, topic of discourse, syntactic and semantic features have shown significant improvements in speech recognition. The main thrust of this dissertation is to identify the need to design language models for IR that satisfy its specific modeling requirements and demonstrate it by designing language models that (1) incorporate IR-specific features (biterm language model), (2) correspond to better document and query representations (concept language model) and (3) combine evidence from the different information sources (language features) towards modeling the relevance of a document to a given query (maximum entropy language models for IR). Illustrating the difference between the language modeling requirements of speech recognition and information retrieval, the dissertation proposes biterm language model that identifies term co-occurrence rather than order of term occurrence as an important feature for IR. Biterm language models handle the local variation in the surface form of the words that express a concept of interest. It is, however, these concepts that need to be modeled in the queries to improve retrieval performance. Concept language models proposed here model user's information need as a sequence of concepts and the query as an expression of such concepts of interest. Empirical results demonstrate significant improvements in retrieval performance. While mixture models, that combine statistical evidence from different information sources to estimate the probability distribution, are easy to implement, they seem to make suboptimal use of their components. A natural method of combining information sources based on the Maximum Entropy Principle, that has been shown to be effective in speech recognition, is proposed here as a solution to the information retrieval problem. In the context of document likelihood models, the maximum entropy language model for information retrieval provides a better mechanism for incorporating external knowledge and additional syntactic and semantic features of the language in language models for IR.
更多
查看译文
关键词
speech recognition,language modeling approach,maximum entropy language model,information retrieval,language model,biterm language model,language modeling requirement,language feature,query feature,concept language model,unigram language model,statistical language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要