MultiContrievers: Analysis of Dense Retrieval Representations
CoRR(2024)
摘要
Dense retrievers compress source documents into (possibly lossy) vector
representations, yet there is little analysis of what information is lost
versus preserved, and how it affects downstream tasks. We conduct the first
analysis of the information captured by dense retrievers compared to the
language models they are based on (e.g., BERT versus Contriever). We use 25
MultiBert checkpoints as randomized initialisations to train MultiContrievers,
a set of 25 contriever models. We test whether specific pieces of information
– such as gender and occupation – can be extracted from contriever vectors of
wikipedia-like documents. We measure this extractability via information
theoretic probing. We then examine the relationship of extractability to
performance and gender bias, as well as the sensitivity of these results to
many random initialisations and data shuffles. We find that (1) contriever
models have significantly increased extractability, but extractability usually
correlates poorly with benchmark performance 2) gender bias is present, but is
not caused by the contriever representations 3) there is high sensitivity to
both random initialisation and to data shuffle, suggesting that future
retrieval research should test across a wider spread of both.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要