On the distribution of cosine similarity with application to biology

arXiv (Cornell University)(2023)

引用 0|浏览10
暂无评分
摘要
Cosine similarity is an established similarity metric for computing associations on vectors, and it is commonly used to identify related samples from biological perturbational data. The distribution of cosine similarity changes with the covariance of the data, and this in turn affects the statistical power to identify related signals. The relationship between the mean and covariance of the distribution of the data and the distribution of cosine similarity is poorly understood. In this work, we derive the asymptotic moments of cosine similarity as a function of the data and identify the criteria of the data covariance matrix that minimize the variance of cosine similarity. We find that the variance of cosine similarity is minimized when the eigenvalues of the covariance matrix are equal for centered data. One immediate application of this work is characterizing the null distribution of cosine similarity over a dataset with non-zero covariance structure. Furthermore, this result can be used to optimize over a set of transformations or representations on a dataset to maximize power, recall, or other discriminative metrics, with direct application to noisy biological data. While we consider the specific biological domain of perturbational data analysis, our result has potential application for any use of cosine similarity or Pearson's correlation on data with covariance structure.
更多
查看译文
关键词
cosine similarity,biology
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要