Authorship attribution of comments in Portuguese extracted from Reddit

Vinicius Alves Matias,Luciano Antonio Digiampietri

REVISTA BRASILEIRA DE COMPUTACAO APLICADA(2023)

引用 0|浏览5
暂无评分
摘要
Internet interaction environments such as social networks transfer large-scale textual data that implicitly carry the writing styles of each network user. Given the constant and intense flow of information through information systems of this type, it is necessary to develop techniques that can distinguish a text between two candidate authors for reasons of, for example, avoiding the return of users banned from the platform. This paper addressed and evaluated different ways of performing authorship attribution through natural language processing and machine learning, based on comments in Portuguese extracted from Reddit social network. This paper aims to update the authorship attribution literature using Portuguese as the primary language given the scarcity of updated works in this language. The results of several viable methods for the task of binary authorship were exposed and evaluated in the question of feasibility according to their statistical significance, achieving two independent models in the same confidence interval that reached 0.88 of F1-score and 0.94 of AUC with extraction of textual attributes through BERTimbau embeddings and through TF-IDF of words.
更多
查看译文
关键词
portuguese
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要