Not All Character N -grams Are Created Equal: A Study in Authorship Attribution

north american chapter of the association for computational linguistics(2015)

引用 225|浏览24
暂无评分
摘要
Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of charactern-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that characterngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要