Getting the most out of your tokenizer for pre-training and domain adaptation
CoRR(2024)
摘要
Tokenization is an understudied and often neglected component of modern LLMs.
Most published works use a single tokenizer for all experiments, often borrowed
from another model, without performing ablations or analysis to optimize
tokenization. Moreover, the tokenizer is generally kept unchanged when
fine-tuning a base model. In this paper, we show that the size,
pre-tokenization regular expression, and training data of a tokenizer can
significantly impact the model's generation speed, effective context size,
memory usage, and downstream performance. We train specialized Byte-Pair
Encoding code tokenizers, and conduct extensive ablations on the impact of
tokenizer design on the performance of LLMs for code generation tasks such as
HumanEval and MBPP, and provide recommendations for tokenizer hyper-parameters
selection and switching the tokenizer in a pre-trained LLM. We perform our
experiments on models trained from scratch and from pre-trained models,
verifying their applicability to a wide range of use-cases. We find that when
fine-tuning on more than 50 billion tokens, we can specialize the tokenizer of
a pre-trained LLM to obtain large gains in generation speed and effective
context size.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要