An Algorithm for Matching Heterogeneous Financial Databases: A Case Study for COMPUSTAT/CRSP and I/B/E/S Databases

Applied Economics and Finance(2014)

引用 1|浏览11
暂无评分
摘要
Rigorous and proper linking of financial databases is a necessary step to test trading strategies incorporating multimodal sources of information. This paper proposes a machine learning solution to match companies in heterogeneous financial databases. Our method, named Financial Attribute Selection Distance (FASD), has two stages, each of them corresponding to one of the two interrelated tasks commonly involved in heterogeneous database matching problems: schema matching and entity matching. FASD's schema matching procedure is based on the Kullback-Leibler divergence of string and numeric attributes. FASD's entity matching solution relies on learning a company distance flexible enough to deal with the numeric and string attribute links found by the schema matching algorithm and incorporate different string matching approaches such as edit-based and token-based metrics. The parameters of the distance are optimized using the F-score as cost function. FASD is able to match the joint COMPUSTAT/CRSP and Institutional Brokers' Estimate System (I/B/E/S) databases with a F-score over 0.94 using only a hundred of manually labeled company links.
更多
查看译文
关键词
kullback leibler divergence
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要