Data Structure: Bloom Filter
Is this element present in the set? Bloom filter can provide a quicker way to check whether or not the given element is in the set. It is a space efficient data structure because it a bit array and doesn’t save actual elements. However, the downside is it may give false positive result.
What is a Bloom Filter?
A bloom filter is a probabilistic data structure which was first introduced by Burton Howard Bloom in 1970. It can quickly tell if an element is either definitely not present in the set or probably present in the set.
Data structure
A bloom filter is a bit array of m bits. An empty bloom filter is initially set all bits to 0. When an element is added to the set, bits of corresponding indices which are generated by multiple hash functions will be set to 1. Therefore, it stores hashed data by change value of bits at the corresponding indices rather than store actual elements. It is very space efficient.
Insert Elements
When an element is added to the bloom filter, it must be hashed using multiple independent hash functions. Each hash function will generate a corresponding index in the bit array. Then, it will set those indices to 1.
For example, we want check if a url “https://testurl1.com” is existed in the set. First, we hash the given url using three hash functions and then get 3 corresponding indices, (0, 4 ,7). Then, the value of those indices in the bit array are set to 1.
Lookup Elements
When we want to check set membership of an element, we need to hash the given element using the same hash functions that we use to hash an insert element and get corresponding indices. If all the bits at those indices are 1, the element is probably existed in the set. However, if any bit at those indices is 0, the element is definitely not existed in the set.
For example, when we check whether or not url “https://testurl1.com” is present in the set, we hash the url and get corresponding indices to check. The values at those indices in the bit array are (1, 1, 1). Therefore, this url probably in the set.
On the other hand, when we check if url “https://testurl2.com” is present in the set, we observed that values at those indices are (1, 1, 0). One of bits at those indices are 0, so this url is definitely not in the set.
Why False Positive?
Why bloom filter may give false positive result? Assume in our bloom filter, we have two urls, “https://testurl1.com” and “https://testurl2.com” in the set. (see below image) If we want to check whether or not “https://testurl3.com” is in the set, we hash the url and get corresponding indices, (0, 7 ,9). Then look at the bit array, it shows that all bits at those indices have been set to 1 by previous urls. The bloom filter still consider “https://testurl3.com” as a set member because all the bits at indices, (0, 7, 9) are 1. Thus, It gives a false positive result. ( https://testurl3.com” is NOT in the set, so this result is not current.)
As the added elements are increased in the bloom filter, the more bits are filled up, the higher the false positive rate.
How to decrease false positive rate?
1. Larger bit array
A larger bit array will have less false positives. The smaller bit array will suffer more false positives. Because the smaller bit array will be filled up quicker.
I implemented a simple bloom filter using go bloom package. Built different sizes of bloom filters and insert ten urls(from “https://testurl1.com” to “https://testurl10.com”) to all bloom filters. Then test ten not existed urls (from “https://testurl11.com” to “https://testurl20.com”) to check false positive results.
The result shows that the smaller the size of the filter, the false positive rate is higher.
2. More hash functions
Although using more hash functions can reduce the probability of collision, the more hash functions we use, the quicker the bit array will be filled up. Thus, as we insert more elements to the bloom filter, the probability of collision may get worsen.
I built bloom filters with different number of hash functions and insert ten urls(from “https://testurl1.com” to “https://testurl10.com”) to all bloom filters. Then test ten not existed urls (from “https://testurl11.com” to “https://testurl20.com”) to check false positive results.
The result shows that the more hash functions doesn’t decrease false positive rate efficiently.
The performance of a bloom filter is determined by the size of the bit array, hash functions you choose, the number of hash functions you use and how many data you want to store.
You can access the source code here.