The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling

ICML(2010)

引用 203|浏览81
暂无评分
摘要
The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model—each data point is modeled with a col- lection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhib- ited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that de- couples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonpara- metric prior on binary matrices. The ICD as- signs a subset of the shared mixture components to each data point. This subset, the data point's "focus", is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model.
更多
查看译文
关键词
col,hierarchical dirichlet process,mixture model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要