The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkage

mag(2013)

引用 27|浏览21
暂无评分
摘要
Record Linkage (RL) is the task of identifying two or more records referring to the same entity (e.g., a person, a company, etc.). RL systems have traditionally handled all input record types in the same way. In an industrial setting, however, business imperatives (such as privacy constraints, government regulation, etc.) often force RL systems to operate with extremely high levels of false positive/negative error rates. For instance, false positive errors can be life threatening when identifying medical records, while false negative errors on criminal records can lead to serious legal issues. In this paper we introduce RL models based on Cost Sensitive Alternating Decision Trees (ADTree), an algorithm that uniquely combines boosting and decision trees algorithms to create shorter and easier-to-interpret linking rules. These models present a two-fold advantage when compared to traditional RL approaches. First, they can be naturally trained to operate at industrial precision/recall operating points. Second, the shorter output rules are so clear that it can effectively explain its decisions to non-technical users via score aggregation or visualization. Experiments show that the proposed models significantly outperformed other baselines on the desired industrial operating points, and the improved understanding of the model’s decisions led to faster debugging and feature development cycles. We then describe how we deployed the model to a commercial RL system with several billion personal records covering nearly the entire U.S. population as input, and obtained a 6:1 ratio of input records to output profiles, with an estimated 99.6%/86.2% precision/recall trade-o↵. This system was then deployed in a commercial e-commerce website, as well as to the subdomain of linking criminal records, obtaining an impressive 99.7%/82.9% precision/recall overall trade-o↵.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要