Swiping Right on Bloom Filters
Anyone who has used a mobile dating app has been asked to swipe through a list of faces, picking out the ones they want to talk to and kicking the rest to the digital curb. Swiping is simple and easy, but serving up those millions of faces to phones around the world is not.
Building the infrastructure to support tens of millions of swipes every day presents a substantial number of technical challenges. At OkCupid, we recently tackled the challenge of storing all of our users’ swipes more efficiently.
Our users swipe over 25 million times a day
Because our users swipe over 25 million times a day, simply remembering who we’ve already shown is not a trivial task. By using Bloom filters, a ‘sketch’ data structure which is exceptionally space efficient, we were able to reduce the memory we used on our system by 40%. In absolute terms, we saved 1.2 TB of RAM. That 1.2 TB savings made up nearly a third of the total memory capacity of all our matching servers at the time.
Though Bloom filters are incredibly efficient, they typically cannot simply be dropped in as a replacement for a tree or hash table due to the limited set of operations they can perform, as well as their approximate nature. Many problems will take a bit of massaging to get the Bloom filter to be useful. In our case, that meant we had to store only a subset of our swipes, the ‘passes’. The rest of this post goes into the why’s and the how’s of our journey to get Bloom filters to work for us.
Bloom Filters 101
Bloom filters store a large set of objects in a manner that consumes less memory than if that set were stored naively. In brief, a Bloom filter works by storing only a few bits of hashed information for each item it stores, rather than the entire item itself. Bloom filters are described in detail in other posts on the Internet, so we recommend reading one of those if you want the nitty gritty of the inner workings.
It may seem surprising, given the incredible efficiency of Bloom filters, that they aren’t used more commonly. However, sketch data structures such as Bloom filters work by trading incredible space efficiency for a substantial list of restrictions on how you can use the stored data. For Bloom filters in particular, these restrictions are as follows:
- You must not need to enumerate your set, only check set intersection.
- You must be able to tolerate false positives. That is, sometimes the Bloom filter will tell you an item is in the set that is not
- You must know the approximate size of the set beforehand 1 2
These are the constraints of the classical Bloom filter, and they illustrate why it is far from a panacea for all memory problems. We were very excited when we realized that, despite these constraints, Bloom filters were still a perfect fit for the problem of storing swipes.
First, for those unfamiliar with the concept of swiping in dating apps, a quick primer: we show you a picture of a user3, and you decide whether you like them or not. If you like them, you swipe their picture off to the right, and if not, you swipe them off to the left.
To power this feature, we need to keep a list of everyone that you’ve ever swiped on, either as a like or a pass. That way, when you come back for more people to swipe, we don’t show you duplicates. On top of this, we need to integrate the ‘already swiped’ list with our substantial matching infrastructure.
Swipes were taking up 1.9 TB of memory
The good and the bad news for us is that people love swiping. We get 25 million swipes every day, which is great for trying to match up our users, but not so great for our servers, dutifully storing all these swipes in memory. A year and a half after this feature launched, swipes were taking up approximately 1.9 TB of memory on our servers4.
Though Bloom filters are far more efficient than our current storage mechanism, using them to store all the swipes was not possible. This is because we DO need to be able to enumerate all of the ‘likes’ that users have recorded, to display features such as ‘who you like’ and ‘who likes you’. However, it turns out that we have no need to enumerate all of the pass votes, since users tend to be much less interested in who passed on them than who liked them. Since passes make up about 70% of swipes, we used a Bloom filter for just the pass swipes.
False positives also prevent us from using the Bloom filter to store ‘like’ swipes. If we stored ‘likes’ in the Bloom filter, a false positive might mean that we highlight users as ‘matches’ who have never even seen you. However, for ‘pass’ swipes, getting a false positive out of the Bloom filter will only mean that they are incorrectly filtered out of the user’s ‘to vote’ queue. This is both invisible and relatively harmless to the end user’s experience.
Putting it all together
To actually implement the Bloom filter storage, we had to address the problem of sizing. A classic Bloom filter needs to be sized when it is created, which means we need to know the approximate number of items it will store ahead of time5. Unfortunately, we don’t know how many swipes we’ll need to store in the Bloom filter ahead of time, since users will always be swiping away, adding more stuff for us to store.
Our solution to this was relatively simple and straightforward. Each new user starts with a small, fixed size Bloom filter. When that user has swiped enough that they exceed that size, we reload all of their swipe data from the database and rebuild the Bloom filter at double the size. This process is repeated as needed. Since we needed the code to rebuild the Bloom filter from scratch anyway (for server restarts), this solution was easy to write and easy to understand.
We were able to save 1.2 TB of RAM on our servers
When we actually launched the Bloom filter, the results were even better than we anticipated. We were able to save 1.2 TB of RAM on our servers, which amounted to roughly 40% of the total memory of our match system at the time. We capped our false positive rate at 0.5%, meaning that we would only spuriously filter about 1 in 200 users6. Even better, the set of users who are incorrectly filtered will change every time the Bloom filter was resized, meaning that as you swipe more, you uncover people who might have previously been filtered out by this system.
Bloom filters are a great example of a specialized data structure that can be very powerful in the situations it’s designed to handle. Finding and actually applying this structure in a real world situation is rarely as straightforward as it seems like it should be, but the rewards are always worthwhile.
Technically, any number of items can fit into any size Bloom filter, if you don’t care what the false positive rate is. Most practical uses warrant keeping the false positive rate relatively low, however. ↩
On top of all these constraints, it’s worth emphasizing that a Bloom filter is only a memory optimization, not a speed optimization. It will generally be slower to query than a hash table. ↩
We actually show a decent portion of the user’s profile as well, but we have some hard data that says that in the aggregate, people don’t care. ↩
The astute reader may wonder why we don’t use an on-disk database rather than storing everything in memory. The short answer is speed. In a simpler system, we could use caching and/or precalculated results to meet our speed requirements. However, our matching system is unfortunately difficult to cache, meaning our results need to be calculated in real time. Thus, the next simplest solution is to keep the data we need in memory, and optimize that data to be as space-efficient as we can. ↩
One of the main extensions of Bloom filters is relaxing this constraint. The tradeoff is that the data structures that result are somewhat more complicated, which makes them both harder to correctly implement, and harder to reason about in a production environment. ↩
We say that we ‘capped’ the false positive rate because this is the false positive rate when each Bloom filter is close to ‘full’. The actual measured false positive rate is around 0.02%, because many of the Bloom filters are not nearly close to being full, and the improvement in false positive rate is better than linear. ↩