Chinese Word Segmentation with Many Rare Terms in Low-Resource Scenarios.

Jianhua Xu, Binbin Zhang,Jianyu Li, Yongjia Zhang

DSDE '22: 2022 the 5th International Conference on Data Storage and Data Engineering(2022)

Cited 0|Views0
No score
Abstract
In constructing the domain-specific knowledge graphs, we can use the texts accumulated in the domain as data sources for analysis. However, in many domains, there are many rare terms in the text that make the generic corpus inapplicable, but no domain-specific corpus is available. Using the existing Chinese word segmentation (CWS) corpus and methods, this type of texts cannot be effectively segmented. For such special texts without applicable corpus, this paper proposes a domain dictionary-based Chinese word segmentation method based on the BiLSTM-CNN-CRF method. We firstly manually label a part of the samples, then combine randomly selected words from the dictionary into the manually labeled sentences to generate pseudo-labeled data, and merge the two to get a composite training set. Then we preprocess the texts, replace the rare terms with non-segmentable strings to further improve the accuracy of word segmentation. The experimental results show that our approach has higher accuracy, recall and F1 score in the task of segmenting texts with many rare terms in low-resource scenarios. Our approach can be applied to the task of Chinese word segmentation in specific domains containing rare terms.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined