SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

CoRR(2023)

引用 0|浏览21
暂无评分
摘要
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient, as they require parallel corpora, days to train, and hours to decode. This article introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability, and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle-, and high-resource scenarios, where we compare the performance of using different segmentationmethods. The experimental results demonstrate that, on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT), on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi -> En, WMT16 Ro -> En, and WMT15 Fi -> En datasets and competitive results on the WMT14 De -> En and WMT14 Fr -> En datasets. Furthermore, our method is 17.8x faster during training and up to 36.8x faster during decoding in a high-resource scenario compared to DPE. We provide extensive analysis, including why monolingual word-level data is enough to train SelfSeg.
更多
查看译文
关键词
Subword segmentation,self-supervised learning,machine translation,efficient NLP,subword regularization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要