Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models
CoRR(2024)
摘要
Text watermarking technology aims to tag and identify content produced by
large language models (LLMs) to prevent misuse. In this study, we introduce the
concept of ”cross-lingual consistency” in text watermarking, which assesses
the ability of text watermarks to maintain their effectiveness after being
translated into other languages. Preliminary empirical results from two LLMs
and three watermarking methods reveal that current text watermarking
technologies lack consistency when texts are translated into various languages.
Based on this observation, we propose a Cross-lingual Watermark Removal Attack
(CWRA) to bypass watermarking by first obtaining a response from an LLM in a
pivot language, which is then translated into the target language. CWRA can
effectively remove watermarks by reducing the Area Under the Curve (AUC) from
0.95 to 0.67 without performance loss. Furthermore, we analyze two key factors
that contribute to the cross-lingual consistency in text watermarking and
propose a defense method that increases the AUC from 0.67 to 0.88 under CWRA.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要