Discovering Named Entities at Scale of a Data Mining Perspective on Large Text Corpora

2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA)(2024)

引用 0|浏览0
暂无评分
摘要
The need to protect sensitive data is expanding and becoming more and more critical, in part because of the regulations and directives that the European Union has enforced. While efforts to create automated systems are ongoing, human or semi-automated processes are usually used to support them. In this study, we have created a component that can identify and extract sensitive information from unstructured European Portuguese text. Creating a solution that helps companies understand their data and comply with security and legal obligations was the aim. We examined a hybrid approach to the Named Entity Recognition problem in Portuguese. We propose a novel security framework, the entity recognition model, which relies on regular expressions, known-entity dictionaries, randomised conditional fields (RCF), and four feature templates. This model is known as RDF-RCF. To further enhance recognition performance, the RCF-based extractor makes use of the entities found by the rule-based and dictionary-based extractors. Specific and common security entities may be retrieved using the known-entity dictionary. The phrase based on rules is able to match security entities with good accuracy in simpler scenarios. SIGARRA News Corpus, DataSense NER Corpus, and HAREM Golden Collection were the corpora that were used for testing and training. The outcomes of the experiments demonstrate that can outperform cutting-edge techniques in terms of performance.
更多
查看译文
关键词
European Union,Randomly conditional fields (RCF),Portuguese language's,Named Entity Recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要