PF-BERxiT: Early exiting for BERT with parameter-efficient fine-tuning and flexible early exiting strategy

Xiangxiang Gao,Yue Liu,Tao Huang,Zhongyu Hou

Neurocomputing（2023）

Cited 0|Views10

No score

Abstract

The industrial usage of huge pre-training language models such as BERT and ALBERT are limited by the computational probability problem in the fine-tuning process and overthinking problem in the inference process. PF-BERxiT has been proposed to optimize the pre-trained languages with a novel parameter-efficient fine-tuning method and a flexible early exiting strategy. Significantly, the new parameter-efficient fine-tuning model integrates a bottleneck adapter architecture parallel to the transformer architecture, and only the adapter's parameters are adjusted. In addition, we integrate an extra sub-learning module to learn the samples' characteristics, improving the accuracy and efficiency simultaneously. The flexible exiting strategy allows the model to exit early if the similarity score of adjacent layers is less than the threshold for pre-defined times. It is more flexible than previous early exiting methods, for it can simultaneously adjust the similarity score thresholds and patience parameters according to the request traffic. Extensive experiments are conducted on the GLUE benchmark, demonstrating that: (1) PF-BERxiT outperforms conventional training and parameter-efficient strategies with only a few parameters fine-tuned. (2) PF-BERxiT strikes a better balance between model performances and speedup ratios than previous state-of-the-art (SOTA) early exiting methods such as PABEE and BERxiT. (3) Ablation studies in the fine-tuning process demonstrate that the best bottleneck dimension r of the adapters is 32, and the adapters placed parallel to the feed-forward module are more efficient. (4) Ablation studies in the inference process demonstrate that for variants of PF-BERxiT with different similarity scores, PF-BERxiT-kl and PF-BERxiT-bikl attain better speedup-accuracy trade-offs than PF-BERxiT-rekl. Our PF-BERxiT helps attain a better trade-off between performance and efficiency, providing a reference for the efficient application of neural computing.

Translated text

Key words

Parameter-efficient,Flexible early exiting,Fine-tuning,BERxiT

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined