Predicting Attention Sparsity in Transformers

Marcos Treviso,António Góis,Patrick Fernandes,Erick Fonseca,André F. T. Martins

PROCEEDINGS OF THE SIXTH WORKSHOP ON STRUCTURED PREDICTION FOR NLP (SPNLP 2022)（2022）

引用 0|浏览85

暂无评分

摘要

Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models along their Pareto curves, important to guide future benchmarks for sparse attention models.

查看译文

关键词

attention sparsity

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要