Effectiveness results for popular e-discovery algorithms.

ICAIL(2017)

引用 33|浏览4
暂无评分
摘要
E-Discovery applications rely upon binary text categorization to determine relevance of documents to a particular case. Although many such categorization algorithms exist, at present, vendors often deploy tools that typically include only one text categorization approach. Unlike previous studies that vary many evaluation parameters simultaneously, fail to include common current algorithms, weights, or features, or used small document collections which are no longer meaningful, we systematically evaluate binary text categorization algorithms using modern benchmark e-Discovery queries (topics) on a benchmark e-Discovery data set. We demonstrate the wide variance of performance obtained using the different parameter combinations, motivating this evaluation. Specifically, we compare five text categorization algorithms, three term weighting techniques and two feature types on a large standard dataset and evaluate the results of this test suite (30 variations) using metrics of greatest interest to the e-Discovery community. Our findings systematically demonstrate that an e-Discovery project is better served by a suite of, rather than a single, algorithms since performance varies greatly depending on the topic, and no approach is uniformly superior across the range of conditions and topics. To that end, we developed an open source project called FreeDiscovery that provides e-Discovery projects with simplified access to a suite of algorithms.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要