TextMachina: Seamless Generation of Machine-Generated Text Datasets
CoRR(2024)
Abstract
Recent advancements in Large Language Models (LLMs) have led to high-quality
Machine-Generated Text (MGT), giving rise to countless new use cases and
applications. However, easy access to LLMs is posing new challenges due to
misuse. To address malicious usage, researchers have released datasets to
effectively train models on MGT-related tasks. Similar strategies are used to
compile these datasets, but no tool currently unifies them. In this scenario,
we introduce TextMachina, a modular and extensible Python framework, designed
to aid in the creation of high-quality, unbiased datasets to build robust
models for MGT-related tasks such as detection, attribution, or boundary
detection. It provides a user-friendly pipeline that abstracts away the
inherent intricacies of building MGT datasets, such as LLM integrations, prompt
templating, and bias mitigation. The quality of the datasets generated by
TextMachina has been assessed in previous works, including shared tasks where
more than one hundred teams trained robust MGT detectors.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined