Large-scale matching algorithm for linking biomedical data warehouse records with the national mortality database in France (Preprint)

crossref(2022)

引用 0|浏览0
暂无评分
摘要
BACKGROUND Often missing or uncertain in biomedical data warehouse (BDW), vital status after discharge is central to the value of BDW for medical research. The French national mortality database (FNMD) offers open-source nominative records of every death. Matching large scale BDWs records with the FNMD combines multiple challenges: the absence of unique common identifier between the two databases, names changing over life, clerical errors and the exponential growth of the number of comparisons to compute. OBJECTIVE We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performances. METHODS We developed a deterministic algorithm based (i) on advanced data cleaning and knowledge of the naming system and (ii) on the Damerau-Levenshtein Distance (DLD). The algorithm performance was independently assessed in three university hospitals‘BDW data: Lille, Nantes, and Rennes. Specificity was evaluated based on alive subjects on the 1st January 2016, i.e. subjects with at least one hospital encounter before and after this date. Sensitivity was evaluated with subjects recorded as deceased between 1 January 2001 and 31 December 2020. DLD based algorithm was compared to a direct matching algorithm with minimal data cleaning as reference. RESULTS All centers combined, sensitivity was 11% higher for the DLD based algorithm (93.3%, 95% Confidence Interval: [92.8-93.9]) than the direct algorithm (82.7% [81.8-83.6], P=<.001%). Sensitivity was superior for men in two centers (Nantes: 87% [85.1-89] vs 83.6% [81.4- 85.8], P=.006%) and for subjects born in France in all centers (Nantes: 85.8% [84.3 - 87.3] vs 74.6% [72.8 - 76.4], P< .001%). Statistically significant sensitivity differences were observed between centers for sensitivity of the DLD based algorithm (85.3% for Nantes vs 97.3% for Lille and Rennes, P<.001%). Specificity was higher than 98% in all subgroups. Our algorithm was able to match tens of millions of death records from BDW, with parallel computing capabilities and low RAM requirements. The R open source script is available at https://gitlab.com/ricdc/insee-deces. CONCLUSIONS Overall, sensitivity/recall was 11% higher using the DLD-based algorithm than the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used for matching any large scale databases. While matching operations using names are considered as sensitive computational operations, the here-released Inseehop package is easy to run on premise facilitating compliance with cybersecurity local framework. The use of advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combination of open source of external data that improve the usage value of BDWs. CLINICALTRIAL
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要