Matryoshka Bloom filters quickly and efficiently determine rarity

A novel data structure for the efficient storage and retrieval of statistics on how common strings are in specific environments.

Download this research paper

We consider the rarity of certain strings when analyzing the behavior of devices and users. The strings might be hostnames or domain names in HTTP requests, email addresses, or executable names. We can determine their rarity by observing network traffic, making lists of known entities over time, and comparing new observations to old.

Matryoshka Bloom filters form the basis of a new method to determine rarity by finding out how often a string appears in these lists. A Bloom filter is a method of probabilistically determining if a specific string is a member of a set. A Bloom filter contains an array of bits and computes several hashes for each string to produce a list of indices that point to elements of this array. If these elements are all ones, it is highly probable that the string appears in the set. If a single bit is set to zero, the string does not appear in the set. The Bloom filter uses much less memory than a hash set would use to determine set membership, at the cost of a small but non-zero probability of a false positive. Choosing an appropriate set of filter parameters (such as array size and number of hashes) can reduce this cost.

However, a standard Bloom filter can only give a binary answer to the question of membership. By structuring the popularity data as a sequence of nested sets, each of which is a subset of the next (like Matryoshka dolls), we can construct a series of Bloom filters. These filters are able to give each string a rarity score when queried (between zero and the number of “popularity” sets available). This structure increases the fault tolerance of the filters and reduces the scope for false positives. Additionally, the small size of the filters allows us to distribute the information needed to perform these lookups among devices on a distributed deployment. This reduces bandwidth requirements to keep our rarity data up-to-date.