Token Alignment via Character Matching for Subword Completion
CoRR(2024)
摘要
Generative models, widely utilized in various applications, can often
struggle with prompts corresponding to partial tokens. This struggle stems from
tokenization, where partial tokens fall out of distribution during inference,
leading to incorrect or nonsensical outputs. This paper examines a technique to
alleviate the tokenization artifact on text completion in generative models,
maintaining performance even in regular non-subword cases. The method, termed
token alignment, involves backtracking to the last complete tokens and ensuring
the model's generation aligns with the prompt. This approach showcases marked
improvement across many partial token scenarios, including nuanced cases like
space-prefix and partial indentation, with only a minor time increase. The
technique and analysis detailed in this paper contribute to the continuous
advancement of generative models in handling partial inputs, bearing relevance
for applications like code completion and text autocompletion.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要