Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
CoRR(2024)
摘要
In the dynamic landscape of generative NLP, traditional text processing
pipelines limit research flexibility and reproducibility, as they are tailored
to specific dataset, task, and model combinations. The escalating complexity,
involving system prompts, model-specific formats, instructions, and more, calls
for a shift to a structured, modular, and customizable solution. Addressing
this need, we present Unitxt, an innovative library for customizable textual
data preparation and evaluation tailored to generative language models. Unitxt
natively integrates with common libraries like HuggingFace and LM-eval-harness
and deconstructs processing flows into modular components, enabling easy
customization and sharing between practitioners. These components encompass
model-specific formats, task prompts, and many other comprehensive dataset
processing definitions. The Unitxt-Catalog centralizes these components,
fostering collaboration and exploration in modern textual data workflows.
Beyond being a tool, Unitxt is a community-driven platform, empowering users to
build, share, and advance their pipelines collaboratively. Join the Unitxt
community at https://github.com/IBM/unitxt!
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要