Unsupervised Morphological Tree Tokenizer
arxiv(2024)
摘要
As a cornerstone in language modeling, tokenization involves segmenting text
inputs into pre-defined atomic units. Conventional statistical tokenizers often
disrupt constituent boundaries within words, thereby corrupting semantic
information. To address this drawback, we introduce morphological structure
guidance to tokenization and propose a deep model to induce character-level
structures of words. Specifically, the deep model jointly encodes internal
structures and representations of words with a mechanism named
MorphOverriding to ensure the indecomposability of morphemes. By
training the model with self-supervised objectives, our method is capable of
inducing character-level structures that align with morphological rules without
annotated training data. Based on the induced structures, our algorithm
tokenizes words through vocabulary matching in a top-down manner. Empirical
results indicate that the proposed method effectively retains complete
morphemes and outperforms widely adopted methods such as BPE and WordPiece on
both morphological segmentation tasks and language modeling tasks. The code
will be released later.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要