Estimating Set Membership With a Bloom Filter

Have you ever found yourself in a situation where you needed to keep track of whether or not you’ve seen an item, but the number of items you have to keep track of is either gigantic or completely unbounded?

This comes up a lot in “massive data” situations (like internet search engines), but it can also come up in games sometimes – like maybe having a monster in an MMO game remember which players have tried to attack it before so it can be hostile when it sees them again.

One solution to the game example could be to keep a list of all players the monster had been attacked by, but that will get huge and eat up a lot of memory and take time to search the list for a specific player.

You might instead decide you only want to keep the last 20 players the monster had been attacked by, but that still uses up a good chunk of memory if you have a lot of different monsters tracking this information for themselves, and 20 may be so low, that the monster will forget people who commonly attack it by the time it sees them again.

Enter The Bloom Filter

There is actually an interesting solution to this problem, using a probabilistic data structure called a bloom filter – and before you ask, no, it has nothing to do with graphics. It was invented by a guy named Burton Howard Bloom in 1970.

A bloom filter basically has M number of bits as storage and K number of hashes.

To illustrate an example, let’s say that we have 10 bits and only 1 hash.

When we insert an item, what we want to do is hash that item and then modulus that has value against 10 to get a pseudo-random (but deterministic) value 0-9. That value is the index of the bit we want to set to true. After we set that bit to true, we can consider that item is inserted into the set.

When we want to ask if an item exists in a given bloom filter, what we do is again hash the item and modulus it against 10 to get the 0-9 value again. If that bit is set to false, we can say that item is NOT in the set with certainty, but if it’s true we can only say it MIGHT be in the set.

If the bit is true, we can’t say that for sure it’s part of the set, because something else may have hashed to the same bit and set it, so we have to consider it a maybe, not a definite yes. In fact, with only 10 bits and 1 hash function, with a good hash function, there is a 10% chance that any item we insert will result in the same bit being set.

To be a little more sure of our “maybe” being a yes, and not a false positive, we can do two things… we can increase the number of bits we use for the set (which will reduce collisions), or we can increase the number of hashes we do (although there comes a point where adding more hashes starts decreasing accuracy again).

If we add a second hash, all that means is that we will do a second hash, get a second 0-9 value, and write a one to that bit AS WELL when we insert an item. When checking if an item exists in a set, we also READ that bit to see if it’s true.

For N hashes, when inserting an item into a bloom filter, you get (up to) N different bits that you need to set to 1.

When checking if an item exists in a bloom filter, you get (up to) N different bits, all of which need to be 1 to consider the item in the set.

There are a few mathematical formulas you can use to figure out how many bits and hashes you would want to use to get a desired reliability that your maybes are in fact yeses, and also there are some formulas for figuring out the probability that your maybe is in fact a yes for any given bloom filter in a runtime state.

Depending on your needs and usage case, you may treat your “maybe” as a “yes” and move on, or you could treat a “maybe” as a time when you need to do a more expensive test (like, a disk read?). In those situations, the “no” is the valuable answer, since it means you can skip the more expensive test.

Estimating Item Count In Bloom Filters

Besides being able to test an item for membership, you can also estimate how many items are in any given bloom filter:

EstimatedCount = -(NumBits * ln(1 – BitsOn / NumBits)) / NumHashes

Where BitsOn is the count of how many bits are set to 1 and ln is natural log.

Set Operations On Bloom Filters

If you have two bloom filters, you can do some interesting set operations on them.

Union

One operation you can do with two bloom filters is union them, which means that if you have a bloom filter A and bloom filter B, you end up with a third bloom filter C that contains all the unique items from both A and B.

Besides being able to test this third bloom filter to see if it contains specific items, you can also ask it for an estimated count of objects it contains, which is useful for helping figure out how similar the sets are.

Essentially, if A estimates having 50 items and B estimates having 50 items, and their union, C, estimates having 51 items, that means that the items in A and B were almost all the same (probably).

How you union two bloom filters is you just do a bitwise OR on every bit in A and B to get the bits for C. Simple and fast to calculate.

Intersection

Another operation you can do with two bloom filters is to calculate their intersection. The intersection of two sets is just the items that the two sets have in common.

Again, besides being able to test this third bloom filter for membership of items, you can also use it to get an estimated count of objects to help figure out how similar the sets are.

Similary to the union, you can reason that if the intersection count is small compared to the counts of the individual bloom filters that went into it, that they were probably not very similar.

There are two ways you can calculate the intersection of two bloom filters.

The first way is to do a bitwise AND on every bit in A and B to get the bits for C. Again, super simple and faster to calculate.

The other way just involves some simple math:

Count(Intersection(A,B)) = (Count(A) + Count(B)) – Count(Union(A,B))

Whichever method is better depends on your needs and your usage case.

Jaccard Index

Just like I talked about in the KMV post two posts ago, you can also calculate an estimated Jaccard Index for two bloom filters, which will give you a value between 0 and 1 that tells you how similar two sets are.

If the Jaccard index is 1, that means the sets are the same and contain all the same items. If the Jaccard index is 0, that means the sets are completely different. If the Jaccard index is 0.5 that means that they have half of their items in common.

To calculate the estimated Jaccard Index, you just use this simple formula:

Jaccard Index = Count(Intersection(A,B)) / Count(Union(A,B))

Estimating False Positive Rate

The more items you insert into a bloom filter the higher the false positive rate gets, until it gets to 100% which means all the bits are set and if you ask if it contains an item, it will never say no.

To combat this, you may want to calculate your error rate at runtime and maybe spit out a warning if the error rate starts getting too high, so that you know you need to adjust the number of bits or number of hashes, or at very least you can alert the user that the reliability of the answers has degraded.

Here is the formula to calculate the false positive rate of a bloom filter at runtime:

ErrorRate = (1 – e^(-NumHashes * NumItems / NumBits))^NumHashes

NumItems is the number of unique items that have been inserted into the bloom filter. If you know that exact value somehow you can use that real value, but in the more likely case, you won’t know the exact value, so you can use the estimated unique item count formula described above to get an estimated unique count.

Managing the Error Rate of False Positives

As I mentioned above, there are formulas to figure out how many bits and how many hashes you need, to store a certain number of items with a specific maximum error rate.

You can work this out in advance by figuring out about how many items you expect to see.

First, you calculate the ideal bit count:

NumBits = – (NumItems * ln(DesiredFalsePositiveProbability)) / (ln(2)^2)

Where NumItems is the number of items you expect to put into the bloom filter, and DesiredFalsePositiveProbability is the error rate you want when the bloom filter has the expected number of items in it. The error rate will be lower until the item count reaches NumItems.

Next, you calculate the ideal number of hashes:

NumHashes = NumBits / NumItems * ln(2)

Then, you just create your bloom filter, using the specified number of bits and hashes.

Example Code

Here is some example bloom filter c++ code with all the bells and whistles. Instead of using multiple hashes, I just grabbed some random numbers and xor them against the hash of each item to make more deterministic but pseudo-random numbers (that’s what a hash does too afterall). If you want to use a bloom filter in a more serious usage case, you may want to actually use multiple hashes, and you probably want to use a better hashing algorithm than std::hash.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
 
// If you get a compile error here, remove the "class" from this enum definition.
// It just means your compiler doesn't support enum classes yet.
enum class EHasItem
{
    e_no,
    e_maybe
};
 
// The CBloomFilter class
template <typename T, unsigned int NUMBYTES, unsigned int NUMHASHES, typename HASHER = std::hash>
class CBloomFilter
{
public:
    // constants
    static const unsigned int c_numHashes = NUMHASHES;
    static const unsigned int c_numBits = NUMBYTES*8;
    static const unsigned int c_numBytes = NUMBYTES;
 
    // friends
    template 
    friend float UnionCountEstimate(const CBloomFilter& left, const CBloomFilter& right);
	template 
	friend float IntersectionCountEstimate (const CBloomFilter& left, const CBloomFilter& right);
 
    // constructor
    CBloomFilter (const std::array& randomSalt)
        : m_storage()              // initialize bits to zero
        , m_randomSalt(randomSalt) // store off the random salt to use for the hashes
    { }
 
    // interface
    void AddItem (const T& item)
    {
        const size_t rawItemHash = HASHER()(item);
        for (unsigned int index = 0; index < c_numHashes; ++index)
        {
            const size_t bitIndex = (rawItemHash ^ m_randomSalt[index])%c_numBits;
            SetBitOn(bitIndex);
        }
    }
 
    EHasItem HasItem (const T& item) const
    {
        const size_t rawItemHash = HASHER()(item);
        for (unsigned int index = 0; index < c_numHashes; ++index)
        {
            const size_t bitIndex = (rawItemHash ^ m_randomSalt[index])%c_numBits;
            if (!IsBitOn(bitIndex))
                return EHasItem::e_no;
        }
        return EHasItem::e_maybe;
    }
 
    // probabilistic interface
    float CountEstimate () const
    {
        // estimates how many items are actually stored in this set, based on how many
        // bits are set to true, how many bits there are total, and how many hashes
        // are done for each item.
        return -((float)c_numBits * std::log(1.0f - ((float)CountBitsOn() / (float)c_numBits))) / (float)c_numHashes;
    }
 
    float FalsePositiveProbability (size_t numItems = -1) const
    {
        // calculates the expected error.  Since this relies on how many items are in
        // the set, you can either pass in the number of items, or use the default
        // argument value, which means to use the estimated item count
        float numItemsf = numItems == -1 ? CountEstimate() : (float)numItems;
        return pow(1.0f - std::exp(-(float)c_numHashes * numItemsf / (float)c_numBits),(float)c_numHashes);
    }
     
private:
 
    inline void SetBitOn (size_t bitIndex)
    {
        const size_t byteIndex = bitIndex / 8;
        const uint8_t byteValue = 1 << (bitIndex%8);
        m_storage[byteIndex] |= byteValue;
    }
 
    inline bool IsBitOn (size_t bitIndex) const
    {
        const size_t byteIndex = bitIndex / 8;
        const uint8_t byteValue = 1 << (bitIndex%8);
        return ((m_storage[byteIndex] & byteValue) != 0);
    }
 
    size_t CountBitsOn () const
    {
        // likely a more efficient way to do this but ::shrug::
        size_t count = 0;
        for (size_t index = 0; index < c_numBits; ++index)
        {
            if (IsBitOn(index))
                ++count;
        }
        return count;
    }
     
    // storage of bits
    std::array m_storage;
 
    // Storage of random salt values
    // It could be neat to use constexpr and __TIME__ to make compile time random numbers.
    // That isn't available til like c++17 or something though sadly.
    const std::array& m_randomSalt;
};
 
// helper functions
template 
float UnionCountEstimate (const CBloomFilter& left, const CBloomFilter& right)
{
    // returns an estimated count of the unique items if both lists were combined
    // example: (1,2,3) union (2,3,4) = (1,2,3,4) which has a union count of 4
    CBloomFilter temp(left.m_randomSalt);
    for (unsigned int index = 0; index < left.c_numBytes; ++index)
        temp.m_storage[index] = left.m_storage[index] | right.m_storage[index];
     
    return temp.CountEstimate();
}
 
template 
float IntersectionCountEstimate (const CBloomFilter& left, const CBloomFilter& right)
{
    // returns an estimated count of the number of unique items that are shared in both sets
    // example: (1,2,3) intersection (2,3,4) = (2,3) which has an intersection count of 2
    CBloomFilter temp(left.m_randomSalt);
    for (unsigned int index = 0; index < left.c_numBytes; ++index)
        temp.m_storage[index] = left.m_storage[index] & right.m_storage[index];
     
    return temp.CountEstimate();
}
 
float IdealBitCount (unsigned int numItemsInserted, float desiredFalsePositiveProbability)
{
    // given how many items you plan to insert, and a target false positive probability at that count, this returns how many bits
    // of flags you should use.
    return (float)-(((float)numItemsInserted*log(desiredFalsePositiveProbability)) / (log(2)*log(2)));
}
 
float IdealHashCount (unsigned int numBits, unsigned int numItemsInserted)
{
    // given how many bits you are using for storage, and how many items you plan to insert, this is the optimal number of hashes to use
    return ((float)numBits / (float)numItemsInserted) * (float)log(2.0f);
}
 
// random numbers from random.org
// https://www.random.org/cgi-bin/randbyte?nbytes=40&format=h
// in 64 bit mode, size_t is 64 bits, not 32.  The random numbers below will be all zero in the upper 32 bits!
static const std::array s_randomSalt =
{
    0x6ff3f8ef,
    0x9b565007,
    0xea709ce4,
    0xf7d5cbc7, 
    0xcb7e38e1,
    0xd54b5323,
    0xbf679080,
    0x7fb78dee,
    0x540c9e8a,
    0x89369800
};
 
// data for adding and testing in our list
static const char *s_dataList1[] =
{
    "hello!",
    "blah!",
    "moo",
    nullptr
};
 
static const char *s_dataList2[] =
{
    "moo",
    "hello!",
    "mooz",
    "kitty",
    "here is a longer string just cause",
    nullptr
};
 
static const char *s_askList[] =
{
    "moo",
    "hello!",
    "unf",
    "boom",
    "kitty",
    "mooz",
    "blah!",
    nullptr
};
 
// driver program
void WaitForEnter ()
{
    printf("nPress Enter to quit");
    fflush(stdin);
    getchar();
}
 
void main(void)
{
    CBloomFilter set1(s_randomSalt);
    CBloomFilter set2(s_randomSalt);
	std::set actualSet1;
	std::set actualSet2;
 
    printf("Creating 2 bloom filter sets with %u bytes of flags (%u bits), and %u hashes.nn", set1.c_numBytes, set1.c_numBits, set1.c_numHashes);
 
	// create data set 1
    unsigned int index = 0;
    while (s_dataList1[index] != nullptr)
    {
        printf("Adding to set 1: "%s"n", s_dataList1[index]);
        set1.AddItem(s_dataList1[index]);
		actualSet1.insert(s_dataList1[index]);
        index++;
    }
 
	// create data set 2
    printf("n");
    index = 0;
    while (s_dataList2[index] != nullptr)
    {
        printf("Adding to set 2: "%s"n", s_dataList2[index]);
        set2.AddItem(s_dataList2[index]);
		actualSet2.insert(s_dataList2[index]);
        index++;
    }
 
	// query each set to see if they think that they contain various items
    printf("n");
    index = 0;
    while (s_askList[index] != nullptr)
    {
        printf("Exists: "%s"? %s & %s (actually %s & %s)n",
            s_askList[index],
            set1.HasItem(s_askList[index]) == EHasItem::e_maybe ? "maybe" : "no",
            set2.HasItem(s_askList[index]) == EHasItem::e_maybe ? "maybe" : "no",
			actualSet1.find(s_askList[index]) != actualSet1.end() ? "yes" : "no",
			actualSet2.find(s_askList[index]) != actualSet2.end() ? "yes" : "no");
        index++;
    }

	// show false positive rates
    printf ("nFalse postive probability = %0.2f%% & %0.2f%%n", set1.FalsePositiveProbability()*100.0f, set2.FalsePositiveProbability()*100.0f);
    printf ("False postive probability at 10 items = %0.2f%%n", set1.FalsePositiveProbability(10)*100.0f);
    printf ("False postive probability at 25 items = %0.2f%%n", set1.FalsePositiveProbability(25)*100.0f);
    printf ("False postive probability at 50 items = %0.2f%%n", set1.FalsePositiveProbability(50)*100.0f);
    printf ("False postive probability at 100 items = %0.2f%%n", set1.FalsePositiveProbability(100)*100.0f);

	// show ideal bit counts and hashes.
    const unsigned int itemsInserted = 10;
    const float desiredFalsePositiveProbability = 0.05f;
    const float idealBitCount = IdealBitCount(itemsInserted, desiredFalsePositiveProbability);
    const float idealHashCount = IdealHashCount((unsigned int)idealBitCount, itemsInserted);
    printf("nFor %u items inserted and a desired false probability of %0.2f%%nYou should use %0.2f bits of storage and %0.2f hashesn",
        itemsInserted, desiredFalsePositiveProbability*100.0f, idealBitCount, idealHashCount);

	// get the actual union
	std::set actualUnion;
	std::for_each(actualSet1.begin(), actualSet1.end(), [&actualUnion] (const std::string& s) {
		actualUnion.insert(s);
	});
	std::for_each(actualSet2.begin(), actualSet2.end(), [&actualUnion] (const std::string& s) {
		actualUnion.insert(s);
	});

	// get the actual intersection
	std::set actualIntersection;
	std::for_each(actualSet1.begin(), actualSet1.end(), [&actualIntersection,&actualSet2] (const std::string& s) {
		if (actualSet2.find(s) != actualSet2.end())
			actualIntersection.insert(s);
	});

	// caclulate actual jaccard index
	float actualJaccardIndex = (float)actualIntersection.size() / (float)actualUnion.size();

	// show the estimated and actual counts, and error of estimations
	printf("nSet1: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		set1.CountEstimate(),
		actualSet1.size(),
		100.0f * ((float)set1.CountEstimate() - (float)actualSet1.size()) / (float)actualSet1.size()
	);
	printf("Set2: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		set2.CountEstimate(),
		actualSet2.size(),
		100.0f * ((float)set2.CountEstimate() - (float)actualSet2.size()) / (float)actualSet2.size()
	);

	float estimatedUnion = UnionCountEstimate(set1, set2);
	float estimatedIntersection = IntersectionCountEstimate(set1, set2);
	float estimatedJaccardIndex = estimatedIntersection / estimatedUnion;
	printf("Union: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		estimatedUnion,
		actualUnion.size(),
		100.0f * (estimatedUnion - (float)actualUnion.size()) / (float)actualUnion.size()
	);
	printf("Intersection: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		estimatedIntersection,
		actualIntersection.size(),
		100.0f * (estimatedIntersection - (float)actualIntersection.size()) / (float)actualIntersection.size()
	);
	printf("Jaccard Index: %0.2f estimated, %0.2f actual.  Error: %0.2f%%n",
		estimatedJaccardIndex,
		actualJaccardIndex,
		100.0f * (estimatedJaccardIndex - actualJaccardIndex) / actualJaccardIndex
	);
 
    WaitForEnter();
}

And here is the output of that program:

In the output, you can see that all the existence checks were correct. All the no’s were actually no’s like they should be, but also, all the maybe’s were actually present, so there were no false positives.

The estimated counts were a little off but were fairly close. The first list was estimated at 2.4 items, when it actually had 3. The second list was estimated at 4.44 items when it actually had 5 items.

It’s reporting a very low false positive rate, which falls in line with the fact that we didn’t see any false positives. The projected false positive rates at 10, 25, 50 and 100 items show us that the set doesn’t have a whole lot more capacity if we want to keep the error rate low.

The union, intersection and jaccard index error rate was pretty low, but the error rate was definitely larger than the false positive rate.

Interestingly, if you look at the part which reports the ideal bit and hash count for 10 items, it says that we should actually use FEWER hashes than we do and a couple less bits. You can actually experiment by changing the number of hashes to 4 and seeing that the error rate goes down. In the example code we are actually using TOO MANY hashes, and it’s hurting our probability rate, for the number of items we plan on inserting.

Interesting Idea

I was chatting with a friend Dave, who said he was using a bloom filter like structure to try and help make sure he didn’t try the same genetic algorithm permutation more than once. An issue with that is that hash collisions could thwart the ability for evolution to happen correctly by incorrectly disallowing a wanted permutation from happening just because it happened to hash to the same value as another permutation already tried. To help this situation, he just biased against permutations found in the data structure, instead of completely blocking them out. Basically, if the permutation was marked as “maybe seen” in the data structure, he’d give it some % chance that it would allow it to happen “again” anyways.

Unfortunately, the idea in general turned out to be impractical. He had about 40 bits of genetic information which is about 1 trillion unique items (2^40).

for being able to store only 1 billion items – which is 1000 times smaller – with 5% false positive rate, that would require about 750MB of storage of bits.

Dropping the requirement to being 25% false positive rate, it still requires 350MB, and to 75% requires 70MB. Even at 99% false positive rate allowed, it requires 2.5MB, and we are still 1000 times too small.

So, for 1 trillion permutations, the size of the bloom filter is unfortunately far too large and he ended up going a different route.

The technique of rolling a random number when you get a maybe is pretty neat though, so wanted to mention it (:

Next Up

We’ve now talked about a probabilistic unique value counter (KMV) that can count the number of unique objects seen.

We then talked about a probabilistic set structure (Bloom Filter) that can keep track of set membership of objects.

How about being able to probabilistically store a list of non unique items?

One way to do this would be to have a count with each item seen instead of only having a bit of whether it’s present or not. When you have a count per item, this is known as a multiset, or the more familiar term: histogram.

If you change a bloom filter to have a count instead of a single bit, that’s called a counting bloom filter and completely works.

There’s a better technique for doing this though called count min sketch. Look for a post soon!

Links

Wikipedia: Bloom Filter
Interactive Bloom Filter Demo

Check this out, it’s a physical implementation of a bloom filter (WAT?!)
Wikipedia: Superimposed code