Discourse Component to Sentence (DC2S) - An Efficient Human-Aided Construction of Paraphrase and Sentence Similarity Dataset.

LREC(2020)

引用 0|浏览7
暂无评分
摘要
Assessing the similarity of sentences and detecting paraphrases is an essential task both in theory and practice, but achieving a reliable dataset requires high resource. In this paper, we propose a discourse component-based paraphrase generation for the directive utterances, which is efficient in terms of human-aided construction and content preservation. All discourse components are expressed in natural language phrases, and the phrases are created considering both speech act and topic so that the controlled construction of the sentence similarity dataset is available. Here, we investigate the validity of our scheme using the Korean language, a language with diverse paraphrasing due to frequent subject drop and scramblings. With 1,000 intent argument phrases and thus generated 10,000 utterances, we make up a sentence similarity dataset of practically sufficient size. It contains five sentence pair types, including paraphrase, and displays a total volume of about 550K. To emphasize the utility of the scheme and dataset, we measure the similarity matching performance via conventional natural language inference models, also suggesting the multi-lingual extensibility.
更多
查看译文
关键词
discourse component, paraphrasing, generation, sentence similarity test
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要