Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval
CoRR(2024)
Abstract
This study investigates the existence of positional biases in
Transformer-based models for text representation learning, particularly in the
context of web document retrieval. We build on previous research that
demonstrated loss of information in the middle of input sequences for causal
language models, extending it to the domain of representation learning. We
examine positional biases at various stages of training for an encoder-decoder
model, including language model pre-training, contrastive pre-training, and
contrastive fine-tuning. Experiments with the MS-MARCO document collection
reveal that after contrastive pre-training the model already generates
embeddings that better capture early contents of the input, with fine-tuning
further aggravating this effect.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined