Associative Root-Pattern Data And Distribution In Arabic Morphology

DATA(2018)

引用 5|浏览3
暂无评分
摘要
This paper intends to present a large-scale dataset for Arabic morphology from a cognitive point of view considering the uniqueness of the root-pattern phenomenon. The center of attention is focused on studying this singularity in terms of estimating associative relationships between roots as a higher level of abstraction for words meaning, and all their potential occurrences with multiple morpho-phonetic patterns. A major advantage of this approach resides in providing a novel balanced large-scale language resource, which can be viewed as an instantiated global root-pattern network consisting of roots, patterns, stems, and particles, estimated statistically for studying the morpho-phonetic level of cognition of Arabic. In this context, this paper asserts that balanced root-distribution is an additional significant key criterion for evaluating topic coverage in an Arabic corpus. Furthermore, some additional novel probabilistic morpho-phonetic measures and their distribution have been estimated in the form of root and pattern entropies besides bi-directional conditional probabilities of bi-grams of stems, roots, and particles. Around 29.2 million webpages of ClueWeb were extracted, filtered from non-Arabic texts, and converted into a large textual dataset containing around 11.5 billion word forms and 9.3 million associative relationships. As this dataset is predominantly considering the root-pattern phenomenon in Semitic languages, the acquired data might be significant support for researchers interested in studying phenomena of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, and cognitively motivated query expansion, spell-checking, and information retrieval. Furthermore, based on data distribution and frequencies, constructing balanced corpora will be easier.
更多
查看译文
关键词
cognitive linguistics, Arabic morphology, corpus linguistics, associative root-pattern network, bi-gram analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要