A time-series analysis of vocabulary in Japanese texts: Non-characteristic words and topic words

Quantitative Approaches to Universality and Individuality in Language(2022)

引用 0|浏览2
暂无评分
摘要
In this study, I analyzed the distribution of words in the text from a time-series perspective. The data were comprised of 635 texts, where a token ranges from 1950-2050 words from the Balanced Corpus of Contemporary Written Japanese. Each text was divided into 10 segments containing an equal number of words, and the distribution of words among them was investigated. The relationship between the frequency of appearances and the characteristics of the words was also analyzed. From the results, the following conclusions were drawn. (1) The distribution of words among the segments follows a decreasing curve, like a Zipf’s curve, but starts to rise close to the end of the curve. (2) At the token level, as the word appearance ratio increases, the ratio of particles increases, and the ratio of nouns decreases. Additionally, the ratio of auxiliary verbs becomes slightly higher, and there is no considerable change in the ratio of verbs. (3) Conversely, at the type level, the proportion of parts of speech remains almost unchanged. (4) The average number of words that appear in all segments was about 12 words per text, and there was no significant difference between the registers. (5) Four hundred and seventy different words appeared in all segments. They were divided into topic words, scene words, function words, and noncharacteristic words from the discourse structure point of view, and were classified according to the number of text appearances.
更多
查看译文
关键词
japanese texts,vocabulary,time-series time-series,non-characteristic
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要