A diversity measure leveraging domain specific auxiliary information.

CIKM '11: International Conference on Information and Knowledge Management Glasgow Scotland, UK October, 2011(2011)

引用 0|浏览3
暂无评分
摘要
This article deals with the notion of reduction in uncertainty when the probability mass is distributed over similar values than dissimilar values. Shannon's entropy is a frequently used information theoretic measure of the uncertainty associated with random variables, but it depends solely on the set of values the probability mass function assumes, and does not take into consideration whether the mass is distributed among extreme values or not. A similarity structure, possibly obtained through domain knowledge, on the values assumed by the random variable may reduce the associated uncertainty. More the similarity, less the uncertainty. A novel measure named Similarity Adjusted Entropy (or Sim-adjusted Entropy for short), that generalizes Shannon's entropy, is then proposed to capture the effects of this similarity structure. Sim-adjusted entropy provides a mechanism for incorporating the domain expertise into an entropy based framework for solving various data mining tasks. Applications highlighted in this manuscript include clustering of categorical data and measuring audience diversity. Experiments performed on Yahoo! Answers data set demonstrate the ability of the proposed method to obtain more cohesive clusters. Another set of experiments confirm the utility of the proposed measure for measuring audience diversity.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要