ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models
CoRR(2024)
摘要
Feature attribution methods (FAs), such as gradients and attention, are
widely employed approaches to derive the importance of all input features to
the model predictions. Existing work in natural language processing has mostly
focused on developing and testing FAs for encoder-only language models (LMs) in
classification tasks. However, it is unknown if it is faithful to use these FAs
for decoder-only models on text generation, due to the inherent differences
between model architectures and task settings respectively. Moreover, previous
work has demonstrated that there is no `one-wins-all' FA across models and
tasks. This makes the selection of a FA computationally expensive for large LMs
since input importance derivation often requires multiple forward and backward
passes including gradient computations that might be prohibitive even with
access to large compute. To address these issues, we present a model-agnostic
FA for generative LMs called Recursive Attribution Generator (ReAGent). Our
method updates the token importance distribution in a recursive manner. For
each update, we compute the difference in the probability distribution over the
vocabulary for predicting the next token between using the original input and
using a modified version where a part of the input is replaced with RoBERTa
predictions. Our intuition is that replacing an important token in the context
should have resulted in a larger change in the model's confidence in predicting
the token than replacing an unimportant token. Our method can be universally
applied to any generative LM without accessing internal model weights or
additional training and fine-tuning, as most other FAs require. We extensively
compare the faithfulness of ReAGent with seven popular FAs across six
decoder-only LMs of various sizes. The results show that our method
consistently provides more faithful token importance distributions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要