Analysis of email structure to detect malicious intent

Using AI to determine whether an email is trying to induce the reader to do something that would lead to compromise, such as clicking on a link or opening a file.

Download this research paper

Malicious emails usually try to induce the recipient to take a particular action. For example, extortion emails may try to force the recipient to make a cryptocurrency payment, and phishing emails may tempt the recipient to click a link that points to a malicious payload or a fake login screen.

Looking for such emails by watching for specific content, addresses or domains is a poor strategy because these features change over time. The content might change to capitalize on a contemporary topic, such as the COVID-19 pandemic, and they are sent from different addresses to circumvent blocklists.

We have created a classifier that analyzes incoming email and assigns a score in four categories of inducement: extortion, solicitation, phishing, and other spam.

This is achieved by analyzing email structure and non-specific content. Structure variables include, among other things, the number of sentences, average sentence length, average paragraph length, and the number of characters that come before the first link. Content variables include the number and density of hyperlinks, references to currency, HTML tags, non-standard punctuation, etc.

Some language processing also identifies words and phrases associated with each category and contributes to the four scores.

This approach deals with new content and new email addresses better than many other approaches and is demonstrably effective. For example, this classifier has identified many phishing emails related to COVID-19 despite being trained on data from before the pandemic. One email encouraged employees to log in and contribute to their company’s COVID relief fund.

Additionally, a profile of typical behavior can be developed for each sender by tracking these scores over time. By comparing inducement scores of a new email to the scores of previous emails (using both probabilistic and substantiality metrics) we can obtain an “inducement shift” score. We can pass this score to an autonomous-response platform as a supplement to the inducement score, because a large shift might indicate an account takeover and deserves a more robust response.