Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies
CoRR(2024)
Abstract
Redacted emails satisfy most privacy requirements but they make it more
difficult to detect anomalous emails that may be indicative of data
exfiltration. In this paper we develop an enhanced method of Active Learning
using an information gain maximizing heuristic, and we evaluate its
effectiveness in a real world setting where only redacted versions of email
could be labeled by human analysts due to privacy concerns. In the first case
study we examined how Active Learning should be carried out. We found that
model performance was best when a single highly skilled (in terms of the
labelling task) analyst provided the labels. In the second case study we used
confidence ratings to estimate the labeling uncertainty of analysts and then
prioritized instances for labeling based on the expected information gain (the
difference between model uncertainty and analyst uncertainty) that would be
provided by labelling each instance. We found that the information maximization
gain heuristic improved model performance over existing sampling methods for
Active Learning. Based on the results obtained, we recommend that analysts
should be screened, and possibly trained, prior to implementation of Active
Learning in cybersecurity applications. We also recommend that the information
gain maximizing sample method (based on expert confidence) should be used in
early stages of Active Learning, providing that well-calibrated confidence can
be obtained. We also note that the expertise of analysts should be assessed
prior to Active Learning, as we found that analysts with lower labelling skill
had poorly calibrated (over-) confidence in their labels.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined