Having finished Coursera’s Algorithms course recently, I looked for more interesting algorithms to implement. I found one data structure that can store large amount of data, use less memory than the competition, it’s implementation is quite short and concise, and gives an opportunity to practice some probability theory. That is the Bloom Filter.
You can use a Bloom Filter to answer a question similar to “did I see this item before?”, and you want the answer quickly—usually to avoid an otherwise costly operation to locate the item.
For example, imagine a server whose data is stored across the network. When getting a request, it’s worth checking whether the item even exists before performing a network call to retrieve the item.
Bloom Filter uses a bit array to know which items it has seen before. Each item is saved by having a few bits turned from 0 to 1. The exact bits are determined by running the item through a few hash function and converting the resulting hash number to a number from zero to n where n is the last bit.
The other data structure you might use for this purpose is a hash table. However, because a hash table needs to store the complete key, it uses a lot more memory. If we have 1 billion keys and each key is 255 bytes we will need at least 255GB. That’s more than most EC2 machines will have. However a comparable bloom filter will only need 12GB to store 1 billion keys.
1 << 128 = 1
In c++ the same operation would have resulted in 0.
When researching bloom filters, these articles were useful:
- Wikipedia is a good starting point.
- Bill Mill’s Blog had a nice visualization of bloom filter and recap of the formulas for choosing n, m, and k
- Ilya Sterin’s Blog summarized the paper demonstrating using only two hash functions to achieve similar results to k hash functions.
- Aldo Cortesi’s Blog contains a good rule of thumb’s for picking n, m, and k.