Human in-the-Loop Estimation of Cluster Count in Datasets via Similarity-Driven Nested Importance Sampling
CoRR(2023)
摘要
Identifying the number of clusters serves as a preliminary goal for many data
analysis tasks. A common approach to this problem is to vary the number of
clusters in a clustering algorithm (e.g., 'k' in $k$-means) and pick the value
that best explains the data. However, the count estimates can be unreliable
especially when the image similarity is poor. Human feedback on the pairwise
similarity can be used to improve the clustering, but existing approaches do
not guarantee accurate count estimates. We propose an approach to produce
estimates of the cluster counts in a large dataset given an approximate
pairwise similarity. Our framework samples edges guided by the pairwise
similarity, and we collect human feedback to construct a statistical estimate
of the cluster count. On the technical front we have developed a nested
importance sampling approach that yields (asymptotically) unbiased estimates of
the cluster count with confidence intervals which can guide human effort.
Compared to naive sampling, our similarity-driven sampling produces more
accurate estimates of counts and tighter confidence intervals. We evaluate our
method on a benchmark of six fine-grained image classification datasets
achieving low error rates on the estimated number of clusters with
significantly less human labeling effort compared to baselines and alternative
active clustering approaches.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要