Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
arxiv(2024)
Abstract
Generating rationales that justify scoring decisions has been a promising way
to facilitate explainability in automated scoring systems. However, existing
methods do not match the accuracy of classifier-based methods. Plus, the
generated rationales often contain hallucinated information. To address these
issues, we propose a novel framework capable of generating more faithful
rationales and, more importantly, matching performance with classifier-based
black-box scoring systems. We first mimic the human assessment process by
querying Large Language Models (LLMs) to generate a thought tree. We then
summarise intermediate assessment decisions from each thought tree path for
creating synthetic rationale data and rationale preference data. Finally, we
utilise the generated synthetic data to calibrate LLMs through a two-step
training process: supervised fine-tuning and preference optimization. Extensive
experimental results demonstrate that our framework achieves a 38
performance improvement in the QWK score compared to prior work while producing
higher-quality rationales, as recognised by human evaluators and LLMs. Our work
sheds light on the effectiveness of performing preference optimization using
synthetic preference data obtained from thought tree paths.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined