Language Models as Science Tutors
CoRR(2024)
摘要
NLP has recently made exciting progress toward training language models (LMs)
with strong scientific problem-solving skills. However, model development has
not focused on real-life use-cases of LMs for science, including applications
in education that require processing long scientific documents. To address
this, we introduce TutorEval and TutorChat. TutorEval is a diverse
question-answering benchmark consisting of questions about long chapters from
STEM textbooks, written by experts. TutorEval helps measure real-life usability
of LMs as scientific assistants, and it is the first benchmark combining long
contexts, free-form generation, and multi-disciplinary scientific knowledge.
Moreover, we show that fine-tuning base models with existing dialogue datasets
leads to poor performance on TutorEval. Therefore, we create TutorChat, a
dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to
fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized
in math have a 32K-token context window, and they excel at TutorEval while
performing strongly on GSM8K and MATH. Our datasets build on open-source
materials, and we release our models, data, and evaluations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要