Real-Time Self-Correcting Email Classifier

An email-classification system working in a corporate environment might try to cluster inbound emails sent from the same person or entity (henceforth “campaigns”). However, similarity classification can be a non-trivial problem given that, in the context of phishing for example, emails may be sent from addresses that subtly change throughout the campaign. Given the degree of automation in phishing production, a single actor can easily change the email address or subject line with each email in the campaign. Incidentally, this poses a problem for systems that rely on maintaining lists of known abusive email addresses.

We created a system that tracks a wide range of indices based on the From header, subject line, URLs, and attachments. When an email arrives, the system checks these indices for similarities to other recent emails—adding the email to an existing cluster or forming a new cluster if it considers the email similar to another email that does not already belong to a cluster. We could not assume any single index will remain fixed during a campaign, and so the system also tolerates a degree of fuzziness.

There is an additional problem that some emails may have coincidental similarities. If a coincidental similarity is established between one email and several others, and then further coincidental similarities are established between these emails and new arrivals, it is possible that a runaway effect quickly clusters a large number of emails that do not constitute a real campaign.

A mechanism for self-correction

Rather than solving the problem by tracking ever finer criteria, we added a mechanism for self-correction that solves the problem of coincidental similarity by detecting fluctuation in sequences of scores computed for a different purpose elsewhere in the system. We posited that emails sent by the same actor with the same intent will be marked by the Darktrace anomaly-detection system in a predictable way. We hypothesized that the chronological sequence of these anomaly scores would exhibit smoothness if the emails are indeed from the same actor. By contrast, if the system clusters emails that are in fact unrelated, the real-time sequence of anomaly scores would be far more likely to exhibit an unpredictable fluctuation. Our data shows this hypothesis to be correct and so we built a similarity classifier to compute a measure of fluctuation in real time. If the measure exceeds a certain threshold, the classifier stops clustering and removes the cluster’s status as a campaign.

‍

Researcher

Dr. Steven Haworth

Research Abstracts

Using epidemiology theory to identify the most damaging network devices

Rapid Process-Chain Anomaly Detection Using a Multistage Classifier

Sorting long lists of file names by relevance and sensitive content

Using graph theory to identify critical nodes within computer networks

Programmatically monitoring disparate SaaS environments

Robust identification of ransomware encryption over SMB

Structural construct detection as a variable indicator of compromise

Securing tenant data in the public cloud

Detect Stealthy Crypto Mining

Automatic Identification of Scanned IP Ranges

A real-time, self-correcting similarity classifier for emails

Recognizing when similar but subtly different emails were sent by the same sender or as part of a campaign, but also recognizing when the similarity is coincidental.

Backed in Research.