Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa

Achmad Yohni Wahyu Finansyah, FNU Afiahayati,Vincent Michael Sutanto

Scientific Journal of Informatics(2022)

引用 1|浏览0
暂无评分
摘要
Purpose: More and more data are stored in text form due to technological developments, making text data processing more difficult. It also causes problems in the text preprocessing algorithm, one of which is when two texts are identical, but are considered distinct by the algorithm. Therefore, it is necessary to normalize the text to get the standard form of words in a particular language. Spelling correction is often used to normalize text, but for Bahasa Indonesia, there has not been much research on the spell correction algorithm. Thus, there needs to be a comparison of the most appropriate spelling correction algorithms for the normalization process to be effective.Methods: In this study, we compared three algorithms, namely Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. These algorithms were evaluated using questionnaire data and tweet data, which both are in Bahasa Indonesia.Result: The fastest normalization time is obtained by the Jaro-Winkler, taking an average of 31.01 seconds for questionnaire data and 59.27 seconds for tweet data. The best accuracy is obtained by the Levenshtein Distance with a value of 44.90% for the questionnaire data and 60.04% for the tweet data. Novelty: The novelty of this research is to compare the similarity measure algorithm in Bahasa Indonesia. Therefore, the most suitable similarity measure algorithm for Bahasa Indonesia will be obtained.
更多
查看译文
关键词
text normalization,similarity measure algorithm,data preprocessing stage
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要