Chrome Extension
WeChat Mini Program
Use on ChatGLM

MCSCSet: A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction

Wangjie Jiang, Zhihao Ye, Zijing Ou, Ruihui Zhao, Jianguang Zheng, Yi Liu, Bang Liu, Siheng Li, Yujiu Yang, Yefeng Zheng

Conference on Information and Knowledge Management(2022)

Cited 0|Views41
No score
Abstract
Chinese Spelling Correction (CSC) is gaining increasing attention in recent years. Despite its extensive use in many applications, such as search engine and optical character recognition system, little has been explored in medical scenarios in which complex and uncommon medical entities are easily misspelled. Correcting the misspellings of medical entities is arguably more difficult than those in the open domain due to its requirements of specific domain knowledge. In this work, we define the task of Medical-domain Chinese Spelling Correction (MCSC) and propose MCSCSet, a large-scale specialist-annotated dataset that contains about 200k samples. In contrast to existing open-domain CSC datasets, MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists. Our work further offers a medical-domain confusion set consisting of the common error-prone characters in medicine and their corresponding misspellings. Extensive empirical studies have shown significant gaps between the open-domain and medical-domain spelling correction, highlighting the need to develop high-quality datasets that allow for CSC in specific domains. Moreover, our work benchmarks several representative methods, establishing baselines for future work.
More
Translated text
Key words
dataset,medical domain,Chinese spelling correction
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined