MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages
CoRR(2024)
摘要
Text detoxification is a textual style transfer (TST) task where a text is
paraphrased from a toxic surface form, e.g. featuring rude words, to the
neutral register. Recently, text detoxification methods found their
applications in various task such as detoxification of Large Language Models
(LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic
speech combating in social networks (Deng et al., 2023; Mun et al., 2023;
Agarwal et al., 2023). All these applications are extremely important to ensure
safe communication in modern digital worlds. However, the previous approaches
for parallel text detoxification corpora collection – ParaDetox (Logacheva et
al., 2022) and APPADIA (Atwell et al., 2022) – were explored only in
monolingual setup. In this work, we aim to extend ParaDetox pipeline to
multiple languages presenting MultiParaDetox to automate parallel
detoxification corpus collection for potentially any language. Then, we
experiment with different text detoxification models – from unsupervised
baselines to LLMs and fine-tuned models on the presented parallel corpora –
showing the great benefit of parallel corpus presence to obtain
state-of-the-art text detoxification models for any language.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要