Chrome Extension
WeChat Mini Program
Use on ChatGLM

The hypergeometric test performs comparably to a common TF-IDF variant on standard information retrieval tasks

CoRR(2023)

Cited 0|Views4
No score
Abstract
Term frequency-inverse document frequency, or tf-idf for short, and its many variants form a class of term weighting functions the members of which are widely used in information retrieval applications. While tf-idf was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that the hypergeometric test of statistical significance corresponds very nearly with a common tf-idf variant on selected real-data document retrieval and summarization tasks. These findings suggest that a fundamental mathematical connection between the tf-idf variant and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We offer the empirical case study herein as a first step toward explaining the long-standing effectiveness of tf-idf from a statistical significance testing foundation.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined