Exploring the Design Space of Unsupervised Blocking with Pre-trained Language Models in Entity Resolution.

Advanced Data Mining and Applications: 19th International Conference, ADMA 2023, Shenyang, China, August 21–23, 2023, Proceedings, Part I(2023)

引用 0|浏览8
暂无评分
摘要
Entity resolution (ER) finds records that refer to the same entities in the real world. Blocking is an important task in ER, filtering out unnecessary comparisons and speeding up ER. Blocking is usually an unsupervised task. In this paper, we develop an unsupervised blocking framework based on pre-trained language models (B-PLM). B-PLM exploits the powerful linguistic expressiveness of the pre-trained language models. A design space for B-PLM contains two steps. (1) The Record Embedding step generates record embeddings with pre-trained language models like BERT and Sentence-BERT. (2) The Block Generation step generates blocks with clustering algorithms and similarity search methods. We explore multiple combinations in above two dimensions of B-PLM. We evaluate B-PLM on six datasets (Structured + dirty, and Textual). The B-PLM is superior to previous deep learning methods in textual and dirty datasets. We perform sufficient experiments to compare and analyze different combinations of record embedding and block generation. Finally, we recommend some good combinations in B-PLM.
更多
查看译文
关键词
unsupervised blocking,entity resolution,language models,pre-trained
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要