Evaluating Perceptual Image Hashes at OkCupid
At OkCupid (we're hiring!), we sometimes run into users and bots that re-use the same photos we've seen before. While sometimes we can simply check if a photo has been seen (and banned) before, we often run into the case where a banned photo has been manipulated slightly -- so an exact comparison won't work. Even a slight shift in the image would completely ruin an exact match. This is a problem we identified years ago, and we had a partial solution using a simple perceptual hash function. However over the years we've become familiar with the failings of the current system. The need for a more robust approach magnified once we made photos a requirement for new users.
There are a few ways to approach this problem, however the most common way is using a perceptual hash. For the uninitiated, a hash in general is (according to the almighty Wikipedia) a "function that can be used to map data of arbitrary size to data of fixed size", e.g. turning an image into a bitstring that represents that image.
Traditionally this means a cryptographic hash (e.g. MD5, SHA-256), in which the slightest change in the input will produce a drastic change in the output. However, for our purposes, we want nearly the opposite -- a slight change in the input (like, shifting or rotating the image a bit) should produce little to no change in the output. This is where the idea of a perceptual image hash comes in.
Ultimately a perceptual image hash seeks to stay largely invariant in its output to slight changes in the input, while producing a small (i.e., manageable) output. Armed with these sorts of functions, we should be able to check if a photo uploaded to the site is the same one that's just been slightly changed in an attempt to get around exact image checks.
The rest of this post details an exploration of four different perceptual hash algorithms that fit this format, under what transformations they fail to stay invariant and which ones perform better overall. While we already had a perceptual hash system in place, it's time for an upgrade!
For the rest of this blog post we'll be discussing perceptual image hash functions that take an arbitrary image and produce a fixed size bitstring. This allows us to just use the Hamming distance between bitstrings in our evaluations. Theoretically a perceptual hash could produce a dense representation of an image (i.e. a vector of floats), but that makes comparisons more complicated.
There's two main things we evaluate here:
- Image Transforms (or attacks), i.e. what transformations would attackers apply to the images in order to try and fool a detection system
- Perceptual Hash function accuracy
The goal of this evaluation is to determine how well perceptual hashes differentiate between entirely distinct image pairs and image pairs which are just an image and its transformed version.
We set this up as measuring distances in two cases:
- Positive case -- the distances between images and the transformed versions of those images
- Negative case -- the distances between distinct original images
We work with a sample of about 10k images on the site to balance getting this done in a reasonable amount of time and statistical significance, and from this sample we run 10 trials of each case on these images. For the negative case we actually sample 1k image pairs (and get their hash distances), whereas for the positive case we just evaluate a ton of transformations and their distances and then sample them down to 1k examples. This gives us a sense of how reliable each metric is -- although generally we didn't observe much significant variation.
One important thing to add here is that these images are deliberately filtered to make sure none of them came from spammers to prevent re-used spam photos from being treated as distinct image pairs. Additionally, we take the extra step of removing any non-unique photos (by comparing their exact hashes).
We identified five typical image transformations that our hardworking, tireless moderation team sees most commonly when dealing with spammers.
- Rotation (about the origin)
- Gamma correction
- Adding a border (or reverse crop)
- Salt & Pepper noise
|Crop||95-99||85-94||60-84||% of image kept|
|S&P Noise||0.5-2.5||2.6-5||5.1-10||% of image|
Each of these transformations have a parameter that controls the intensity of the operation -- e.g. how many degrees to rotate. For simplicity of analysis, we specified three buckets of each of these parameters: mild, medium, and 🔥 extreme 🔥. Whenever a transformation is applied, the bucket is specified and the parameter is uniformly sampled from that bucket (see the table above for those buckets). In most of these analyses, each transformation / bucket is computed three separate times because more data is always better.
But wait there's mirr..ore! For each transformation application we flip a metaphorical coin and mirror the transformed image across its vertical axis. We skip the horizontal axis since it's just not seen as much in the wild -- one might imagine that upside down pictures are a pretty obvious spam giveaway.
Perceptual Hash functions
We won't go too much into the specifics of each perceptual hash function in this post since they're well detailed in the references provided. For simplicity we limited ourselves to evaluating four common perceptual hash functions:
- Difference Hash (denoted
- Average Hash (denoted
- Perceptual (DCT) Hash (denoted
- Wavelet Hash (denoted
The implementations used here are all provided by Johannes Buchner's excellent and convenient Python ImageHash library.
Normalized Hamming distance
Since each hash function is reducing the image to a bitstring, we use the Hamming distance to compare them. This is pretty much a fancy way of saying that we just measure how many bits are different between each hash.
In order to compare hash functions that produce different size bitstrings, we also normalize against the length of the bitstrings produced (given that we're not comparing bitstrings between hash functions, which wouldn't make a lot of sense anyways). So for example, when you see the histograms further down the line take note that the distances (i.e. the x-axis) are from zero to one. Effectively this is the ratio of bits that are different between the two bitstrings to the length of those bitstrings.
How do we measure the quality of a perceptual hash function? The primary paradigm that we'll evaluate these functions against is by looking at the distributions of distances in the positive and negative cases for specific scenarios. Essentially, what we want is for the distribution (or histogram) of distances in each case to be fairly separate. There are a few ways to approach this, but we'll just consider three metrics:
- EMD distance (or Wasserstein distance as all the deep learning people mysteriously started calling it). We want this to be large.
- Intersection-over-Union (IoU) of the two histograms. We want this to be pretty small.
- Maximal F1-score at an optimal distance threshold. We want this to be real close to 1.
EMD may be the most consistent measure here since, say you have two histograms that are separate and want to compare them to two histograms which are also separate but further apart. Under the optimal F1 score and the IoU, both cases would already be optimal -- however EMD would favor the latter case.
We bring in the F1-score to give us a threshold, which is discussed a bit more later. The easiest way to think of this is sliding that threshold along the x-axis, and finding that which separates the histograms optimally.
The histogram IoU is provided as an extra comparison, and basically measures directly how much each histogram overlaps with the other.
Let's say we set it up so that we're evaluating all of the parameter buckets, all of the hashes, and all of the transformations. In this scenario -- if you're looking for a quick answer here -- the best options according to the metrics appear to be either the average or wavelet hashes, each performing about the same.
|Hash||EMD Distance||Histogram IoU||Optimal F1-Score||@ Optimal Threshold|
However, this can be deceiving -- and the histograms for each hash function evaluated illustrate why (less overlap is better).
As a baseline, look at the distribution of hashes when you use a cryptographic hash (which was very helpful to look at when debugging this evaluation).
We can see in this case that a cryptographic hash produces completely overlapping normal distributions centered around 0.5, which is as expected. I would then advise interpreting all of the other histograms in this light -- the negative class should look like that distribution, and the positive class would ideally be entirely shifted to the left.
So at first glance you can see why the metrics favor the wavelet and average hashes -- it's pretty clear that both the perceptual and difference hashes have a more significant overlap of negative and positive distances. However, we have to consider the actual application -- if we want to find the closest hash to a given picture's hash, we want to be as insulated from false positives as possible. This means that the spread of negative distances seen in the wavelet and average hashes could produce more false positives in a case where we're trying to find the closest match to a given image!
So to take away some main findings from this evaluation with everything turned on (i.e. the hardest case), we see that for the positive case:
diff_hashhave bimodal positive distance distributions, and are clearly fooled by something we're doing, since one of the modes is near the expected random distance.
average_hashhave what looks like sort of long-tail distance distributions that have less mass as distance increases, implying that less transformations 'confuse' them.
And for the negative case:
diff_hashboth have tight distributions around the 0.5 mark, behaving similarly to the cryptographic hash case.
average_hashboth have distributions that, while normal and centered around the expected random distance, spread pretty far in both directions. I don't have a great answer for why something like this happens, but my best guess is that by capturing the information that they do, they're somehow actually doing a better job of checking a higher level of image similarity (e.g. similar color palettes). This is probably worth investigating further.
Based on the above, I think it's fair to categorize
diff_hash as (relatively) high precision, low recall perceptual hashes, and
average_hash as high recall, low precision hashes.
What transforms do certain hashes fail at?
So let's say we're incredibly concerned with that spread of negative case distances, and want to know what sacrifices we're going to have to make to use
Invariance to Mirroring
The first thing we might want to accept that we can't really handle is the mirror operation, as shown by comparing the median distance in the positive case when the transformed image is mirrored to when it is not.
|Hash||Mirroring Off||Mirroring On|
It appears that, while every hash does significantly better when the transformed image is unmirrored,
dct_hash appear to be entirely fooled with their median distance approaching that of the expected random distance.
So with this, and the fact that you can only really mirror an image once, we can accept a lack of mirroring. Without mirroring, we get histograms that look more reasonable for those two low recall hashes.
With the mirroring removed however, the table below shows that the other hashes also get a significant metrics boost -- but of course, the problem with the negative case distributions remains.
|Hash||EMD Distance||Histogram IoU||Optimal F1-Score||@ Optimal Threshold|
Bringing down the spice level
One other thing to consider here is that these evaluations are all run on all of the transform parameter buckets. What if we assume that none of our attackers want to rotate images beyond like 5 degrees or crop less than 95% of the image? And are allergic to mirrors?
|Hash||EMD Distance||Histogram IoU||Optimal F1-Score||@ Optimal Threshold|
Check out those metrics above! It looks like some of them have achieved full separability!
In this case, we can clearly see that
dct_hash get a significant boost in their metrics. Since this (of course) doesn't change the spread of distances for the negative case, if we're only concerned with the mild transform case then we might have a decent threshold to work with that doesn't produce too many false positives!
For brevity we'll focus on
diff_hash for discussing this, since I've found that
dct_hash behaves similarly. Below is the distance histogram comparison for the mild case.
We can use the distance threshold given to us by optimizing the F1-score to give us a cutoff (for those applications where we need one) such that any image pair whose distance lies below the threshold is considered the same, and any above is a distinct image.
If we look at the mild case, we notice that there are hashes that get nearly perfect F-1 score. This obviously only really applies to the mild case, but what transformation sacrifices do we have to make to use this threshold? Is it all transforms? Or do we just lose invariance to some of the more extreme transformations and not all?
So let's look at this for
diff_hash: when we take the threshold we get from only looking at the mild transformation parameters, what's the percentage of positive cases we end up treating as distinct image pairs for the rest of the transformations?
Apparently most of the time we rotate beyond 15 degrees it completely falls apart (example below) -- and the same can be said for cropping down to less than 85% of the image. It doesn't do great at the medium case either, which for rotation is from 6 to 15 degrees!
We took a look at that failure chance for different hashes --
wavelet_hash fail in the same places, however to a lesser extent -- think 80% not 96% of the time.
Below is a quick example to illustrate what a 30 degree rotation looks like, using a picture of Alice -- our director of support.
Alice is to OkCupid like Tom is to MySpace.
How long does this take anyways?
|Hash Name||Average Hash Time per image (ms)||Stddev|
The slowest overall is the wavelet hash, with the rest of them pretty much clocking in at the same amount of time taken.
Just as a quick note -- generally one of the slowest steps here is going to be resizing the image to the tiny hash size that most of these perceptual hashes use. This may seem weird at first, but it makes sense -- every other operation is generally working on a tiny (by modern computer standards) 16x16 array, so even those transformations with a high time complexity can take next to no time, whereas even a well optimized resize will have to work with a much larger image. This can also explain the variance in time each hash takes, since we don't resize images to a fixed size beforehand.
Nonetheless, each hash is pretty fast so it's not too much of a concern, especially when it can be done asynchronously from any sort of image uploading.
We investigated some other stuff that didn't produce much worthy of its own section here, so here's some quick notes instead:
- Increasing the bitstring length (or hash size) led to significantly worse results for most of the perceptual hashes, likely because it's simply capturing too much detail.
- For the wavelet hash I tried a bunch of different image scales and wavelet strategies, none of them really made any significant difference metrics-wise except for using Daubechies wavelets instead of Haar -- which performed significantly worse.
Conclusions & Future Work
The main draw that we've gotten from this is that there really isn't just one hashing algorithm that handles everything we need. It is likely that we'll have to judiciously combine multiple perceptual hashes to get a workable solution. However, each of these hashes is rather cheap to compute (and I would argue could be further optimized), which means that computing multiple hashes per image is probably a viable basis to build an image similarity engine on.
One might, for example, use a low precision hash as an initial check followed by a high precision hash to separate out cases that might need review from those that are definitely the same image.
Some thoughts for future work:
- There are a ton of newer interesting hashes that are worth consideration.
- There's quite a few more ways to transform images that we see in the wild (e.g. adding captions, superimposing over other images with transparency, compression artifacts) that should be evaluated.
- It's definitely a bit weird which hashes behave similarly to each other. There's probably something meaningful going on there that's worth looking into.
- Evaluate / design a gating architecture as briefly described above
- We could investigate what characteristics of given image pairs cause the low precision hashes to be unreliable / give them a low distance, and use that to determine which algorithm's hash distances can be trusted for a given image pair.
- The end-to-end approach of training a Convolutional Net to embed images such that the positive / negative case distances separate nicely is probably the cool new hypebeast way forward. This would have the added advantage of being able to handle whatever transformations you can think of at training time.