Avoiding The Performance Hazzards of std::function

There are two performance implications of using std::function that might surprise you:

When calling a std::function, it does a virtual function call.
When assigning a lambda with significant captures to a std::function, it will do a dynamic memory allocation!

Here is an example of what I mean:

    array i;
    auto A = [=] () -> int {
        return (i[0] + i[1] * i[2] + i[3]) ^ i[4];
    };

    // A dynamic memory allocation of sizeof(A) + some change is done on assignment
    // sizeof(A) = 20 on my machine, and it ends up doing a 28 byte allocation from the heap.
    function fA = A; 

    // A virtual function is called to be able to execute A()
    int ret = fA();

The reason for the virtual function call is because under the hood, a lambda creates a class that contains the code and all the captured data you specified for capture. Since other callers of your std::function have no idea what the type of that object is (it’s only visible in the scope the lambda was defined in, and is not a human friendly name), the std::function has to use a virtual function to “blindly” call into from the outside, to perform the correct work on the inside. This technique is called “type erasure” and you can read more about it in the links section at the bottom of this post!

The reason for the dynamic memory allocation is that std::function is generic and has no idea how much data you captured, but that captured data has to exist inside of a std::function and also has to be able to be copied as a std::function is moved around or copied to other locations. It needs the storage space for your captures.

That is unfortunate in the cases of when you don’t plan on using the lambda in those ways though. Your data is already on the stack, it would be great if it could just use THAT copy of the captures and avoid the allocation.

If you capture only a small amount (in the implementations I’ve seen, typically something around 8 bytes max… so 2 pointers or references worth), it doesn’t have to allocate, as it has a small static buffer for use in these “small capture” cases. That’s a nice optimization, but the allocations can still sneak in and bite us when we least expect it, so we have to be careful.

Avoid Both Problems With Template Parameters

If you are passing a lambda to some other code to execute, you can avoid both problems by using templated functions like below:

template <typename T>
void RunLambda (const T &lambda)
{
  // No virtual function calls or dynamic memory allocations!
  lambda();
}

You could then call that function like this and it’ll deduce the type of the lambda as the template parameter:

RunLambda([] () {printf("hello world!");});

The reason why this works, is because your lambda does have an actual type (even if we don’t know what it is, and can’t really type what it is). The template function is able to make a version of itself for the specific type that the lamdba is, and it can make a custom shrink wrapped function to call it.

In my experiences, it’s also able to inline the lambda code & template function, which is pretty awesome since then it’s basically no extra overhead at all.

The problem going this route though, is that you hit the usual problems of templated functions: without jumping through extra hoops, you have to put the implementation of the function in the header file.

The rest of the solutions i present get rid of the allocation, but the virtual call remains. The allocation is the real monster here though, and if you really have to avoid the virtual function call too, you can always resort to using template functions.

Solution 1: Wrapping

Like I said above, a lambda makes a type for you behind the scenes that contains your code and capture storage.

If this type is small enough, if won’t require dynamic allocation to hold it in a std::function.

So, let’s say we have a lambda with a large capture and storing it in a std::function would require a dynamic allocation.

What if we capture that lambda in a new lambda, but capture it by reference, and inside this lambda we call the other lambda?

Our new lambda now has only a single capture, which is by reference, so has a size of 4.

That’s small enough to store in a std::function without needing a dynamic allocation.

Weird huh?

	array i;
	auto A = [=]() -> int {
		return (i[0] + i[1] * i[2] + i[3]) ^ i[4];
	}; // sizeof(A) == 20
	auto B = [&]() -> int {
		return A();
	}; // sizeof(B) = 4!

	// no allocation needed, B is small enough to use the static storage buffer!
	function fB(B);

Notes:

When using this technique, you can’t return B (or fB) from the function. Once the function exits, those guys are garbage and using them will cause a crash. This is the usual case when capturing locals by reference though, so nothing new there!
I tried to come up with a helper object / function to do this wrapping for you, to make it easier and more concise to do this. I hit a problem in that in my generic template code, I was able to deduce the return type of the lambda being passed in, but not the argument types. I’m sure it’s possible, but I wasn’t able to get it to work.

Solution 2: Stack Allocation

Due to the fact that you can give a std::function a custom allocator to use, you can solve this problem by having the std::function allocate from the stack (or a small pool allocator if that was more desirable for you!).

Unfortunately we can’t naively use something like alloca() to allocate from the stack, because when a function returns that alloca() is used in, that stack memory is released.

However, you COULD make a custom allocator, give it a buffer for it to give out memory from, and allocate THAT buffer on the stack (or just define it on the stack).

You’ll find the source code to that solution in the sample code below.

One problem with the solution as I made it though, is that you have to manually declare an allocator, define a buffer, pass it to this allocator, then pass that allocator to the std::function.

That is a lot of manual steps which could go wrong.

You could probably clean it up with a helper macro or something, but that still isn’t nearly as clean as the wrapping solution.

Also, the amount of memory that is allocated seems somewhat arbitrarily larger than the size of your type. If you look at the sample code and output of that code, in one case it wants 8 more bytes than the size of the type, and in another, it wants 12 bytes more. Coming up with the right sized buffer can be tricky and problematic which makes this solve a bit uglier.

This is because it isn’t always just your type that needs to be allocated, but is usually your type wrapped in some other type, that adds a little bit of it’s own storage.

A good way to fix this would be to have the allocator store a type “T” object inside of it, and just give out that object when asked for memory, but unfortunately, due to the way allocators are used and converted from allocating one type to another, I was unable to get that working.

Solution 3: std::ref() / std::cref()

There is a third way I’ve found to solve the allocation problem that seems like it’s the official solve that the STL gods would recommend.

Problem is, in my testing it was available in visual studio 2013, but not visual studio 2010. Boo hoo!

std::ref() and std::cref() return reference wrappers (costant ref wrapper in the cref case) which can hold arbitrary types as references. If you put your large capture lambda into one of these functions and give it to std::function, there’s a std::function constructor which is able to take this reference, and use that instead of allocating more memory.

It’s very elegant and simple to do, check it out:

	array i;
	auto A = [=]() -> int {
		return (i[0] + i[1] * i[2] + i[3]) ^ i[4];
	};

	// no allocation, std::function stores a reference to A instead of A itself
	function fA(ref(A));

Again to re-iterate though, this works in vs 2013 but not 2010. If you are stuck in 2010, I recommend manually wrapping your lambdas like in solution 1

Sample Code

#include 
#include 
using namespace std;

#include 

// prototype for the function that runs our lambdas
int RunLambda(const function& f);

// ======================================= Allocators =======================================

template 
class CAllocator : public allocator 
{
public:
	template
	struct rebind
	{
		typedef CAllocator other;
	};

	T* allocate(size_type n, const void *hint = 0)
	{
		printf("!!CAllocator was asked for %u bytes, dynamically allocating memory.n", n*sizeof(T));
		return allocator::allocate(n, hint);
	}

	void deallocate(pointer p, size_type n)
	{
		return allocator::deallocate(p, n);
	}

	CAllocator() throw() : allocator() { }
	CAllocator(const CAllocator &a) throw() : allocator(a) { }
	template 
	CAllocator(const CAllocator<U> &a) throw() : allocator(a) { }
	~CAllocator() throw() { }
};

template 
class CStackAllocator : public allocator 
{
public:
	template
	struct rebind
	{
		typedef CStackAllocator other;
	};

	T* allocate(size_type n, const void *hint = 0)
	{
		int a = sizeof(T);
		printf("  CStackAllocator was asked for %u bytes, returning buffer given.n", n*sizeof(T));
		return (T*)m_buffer;
	}

	void deallocate(pointer p, size_type n)
	{
	}

	CStackAllocator() throw() : allocator(), m_buffer(nullptr), m_bufferSize(0) { }
	CStackAllocator(const CStackAllocator &a) throw() : allocator(a), m_buffer(a.m_buffer), m_bufferSize(a.m_bufferSize) { }
	template 
	CStackAllocator(const CStackAllocator<U> &a) throw() : allocator(a), m_buffer(a.m_buffer), m_bufferSize(a.m_bufferSize) { }
	~CStackAllocator() throw() { }

	void SetBuffer(void *buffer, size_t bufferSize)
	{
		m_buffer = buffer;
		m_bufferSize = bufferSize;
	}

	void   *m_buffer;
	size_t    m_bufferSize;

	// It would be neat to be able to do the below instead, but it complains about trying to use an undefined type unfortunately.
	// It would be less error prone and require fewer manual steps by the caller.
	// unsigned char m_buffer[sizeof(T)];
};

// ======================================= Lambda Tests =======================================

void LambdaTest_Normal()
{
	printf("=====LambdaTest_Normal()=====n");
	printf("Large Capture Lambda A:n");
	array i;
	i[0] = 2;
	i[1] = 57;
	i[2] = i[0] ^ i[1];
	i[3] = i[1] ^ i[2];
	i[4] = i[3] ^ i[2] + i[0] ^ i[1];
	auto A = [=]() -> int {
		return (i[0] + i[1] * i[2] + i[3]) ^ i[4];
	};
	printf("  sizeof(A) = %un", sizeof(A));
	printf("  creating allocatorAn");
	CAllocator allocatorA;
	printf("  storing A in fAn");
	function fA(A, allocatorA);
	printf("  RunLambda(fA)= %un", RunLambda(fA));
	printf("n");

	// Allocation, no good for perf intensive code!
	// Also a virtual function call, but unavoidable when using std::functions
}

void LambdaTest_Wrap()
{
	printf("=====LambdaTest_Wrap()=====n");
	printf("Lambda B which captures A by reference:n");
	array i;
	i[0] = 2;
	i[1] = 57;
	i[2] = i[0] ^ i[1];
	i[3] = i[1] ^ i[2];
	i[4] = i[3] ^ i[2] + i[0] ^ i[1];
	auto A = [=]() -> int {
		return (i[0] + i[1] * i[2] + i[3]) ^ i[4];
	};
	auto B = [&]() -> int {
		return A();
	};
	printf("  sizeof(B) = %un", sizeof(B));
	printf("  creating allocatorBn");
	CAllocator allocatorB;
	printf("  storing B in fBn");
	function fB(B, allocatorB);
	printf("  RunLambda(fB)= %un", RunLambda(fB));
	printf("n");

	// Could alternately do this to create fB more concisely while still avoiding allocation
	// function fB([&A]()->int {return A();});

	// It would be nice to make some kind of helper object / function to wrap
	// lambdas instead of having to wrap them twice yourself.  Couldn't figure out
	// a way to do that unfortunately.  decltype(A()) gives you the return type,
	// but figuring out the parameter type(s) is more difficult.
}

void LambdaTest_StackAlloc()
{
	printf("=====LambdaTest_StackAlloc()=====n");
	printf("Large Capture Lambda A:n");
	array i;
	i[0] = 2;
	i[1] = 57;
	i[2] = i[0] ^ i[1];
	i[3] = i[1] ^ i[2];
	i[4] = i[3] ^ i[2] + i[0] ^ i[1];
	auto A = [=]() -> int {
		return (i[0] + i[1] * i[2] + i[3]) ^ i[4];
	};
	printf("  sizeof(A) = %un", sizeof(A));
	printf("  creating stack allocatorAn");
	CStackAllocator allocatorA;
	unsigned char buffer[sizeof(A) + 12];
	printf("  Giving stack allocator %u byte buffern", sizeof(buffer));
	allocatorA.SetBuffer(buffer, sizeof(buffer));
	printf("  storing A in fAn");
	function fA(A, allocatorA);
	printf("  RunLambda(fA)= %un", RunLambda(fA));
	printf("n");

	// It would have been nice for the allocator to hold it's own storage, since it
	// has to live on the stack anyways, and has knowledge of the type it's allocating
	// for (so knows the type size).  Hit some issues trying that though unfortunately.
}

void LambdaTest_Ref()
{
	printf("=====LambdaTest_Ref()=====n");
	printf("Large Capture Lambda A:n");
	array i;
	i[0] = 2;
	i[1] = 57;
	i[2] = i[0] ^ i[1];
	i[3] = i[1] ^ i[2];
	i[4] = i[3] ^ i[2] + i[0] ^ i[1];
	auto A = [=]() -> int {
		return (i[0] + i[1] * i[2] + i[3]) ^ i[4];
	};
	printf("  sizeof(A) = %un", sizeof(A));
	printf("  creating allocatorAn");
	CAllocator allocatorA;
	printf("  storing ref(A) in fAn");
	function fA(ref(A), allocatorA);
	printf("  RunLambda(fA)= %un", RunLambda(fA));
	printf("n");

	// Nice and simple, this seems to be the correct solution, so long as you have access to
	// std::ref / std::cref!
	// vs2010 has it, but it seems to be partially implemented, as the above doesn't work.
	// vs2013 has it and it works wonderfully though.
}

// ======================================= Driver Program =======================================

void WaitForEnter()
{
	printf("Press Enter to quit");
	fflush(stdin);
	(void)getchar();
}

int main(int argc, char **argv)
{
	LambdaTest_Normal();
	LambdaTest_Wrap();
	LambdaTest_StackAlloc();
	LambdaTest_Ref();
	WaitForEnter();
	return 0;
}

// Pretend like this function is in another cpp file, where it can't be a templated function
int RunLambda(const function& f)
{
	return f() * 3;
}

And here is the example output of that run in visual studio 2013. Same output in debug and release:

Links

Two good reads about type erasure:
cplusplus.com: C++ Type Erasure
Andrzej’s C++ blog: Type erasure — Part I

If you have any improvements to any of the above solutions, or other solutions of your own, post a comment and let us know!

Count Min Sketch: A Probabilistic Histogram

Count min sketch is a probabilistic histogram that was invented in 2003 by Graham Cormode and S. Muthukrishnan.

It’s a histogram in that it can store objects (keys) and associated counts.

It’s probabilistic in that it lets you trade space and computation time for accuracy.

The count min sketch is pretty similar to a bloom filter, except that instead of storing a single bit to say whether an object is in the set, the count min sketch allows you to keep a count per object. You can read more about bloom filters here: Estimating Set Membership With a Bloom Filter.

It’s called a “sketch” because it’s a smaller summarization of a larger data set.

Inserting an Item

The count min sketch is just a 2 dimensional array, with size of W x D. The actual data type in the array depends on how much storage space you want. It could be an unsigned char, it could be 4 bits, or it could be a uint64 (or larger!).

Each row (value of D) uses a different hash function to map objects to an index of W.

To insert an object, for each row you hash the object using that row’s hash function to get the W index for that object, then increment the count at that position.

In this blog post, I’m only going to be talking about being able to add items to the count min sketch. There are different rules / probabilities / etc for count min sketches that can have objects removed, but you can check out the links at the bottom of this post for more information about that!

Getting an Item Count

When you want to get the count of how many times an item has been added to the count min sketch, you do a similar operation as when you insert.

For each row, you hash the object being asked about with that row’s hash function to get the W index and then get the value for that row at the W index.

This will give you D values and you just return the smallest one.

The reason for this, is because due to hash collisions of various hash functions, your count in a specific slot may have been incorrectly incremented extra times due to other objects hashing to the same object. If you take the minimum value you’ve seen across all rows, you are guaranteed to be taking the value that has the least number of hash collisions, so is guaranteed to be most correct, and in fact guaranteed to be greater than or equal to the actual answer – but never lower than the actual answer.

Dot Product (Inner Product)

If you read the last post on using dot product with histograms to gauge similarity, you might be wondering if you can do a dot product between two count min sketch objects.

Luckily yes, you can! They need to have the same W x D dimensions and they need to use the same hash functions per row, but if that’s true, you can calculate a dot product value very easily.

If you have two count min sketch objects A and B that you want to calculate the dot product for, you dot product each row (D index) of the two count min sketch objects. This will leave you with D dot products and you just return the smallest one. This guarantees that the dot product value you calculate will have the fewest hash collisions (so will be most accurate), and will also guarantee that the estimate is greater that or equal to the actual answer, but will never be lower.

To Normalize Or Not To Normalize

There is a caveat here though with doing a dot product between two count min sketch objects. If you do a normalized dot product (normalize the vectors before doing a dot product, or dividing the answer by the length of the two vectors multiplied together), the guarantee that the dot product is greater than or equal to the true answer no longer holds!

The reason for this is that the formula for doing a normalized dot product is like this:
normalized dot product = dot(A,B) / (length(A)*length(B))

In a count min sketch, the dot(A,B) estimate is guaranteed to greater than or equal to the true value.

The length of a vector is also guaranteed to be greater than or equal to the length of the true vector (the vector made from the actual histogram values).

This means that the numerator and the denominator BOTH have varying levels of overestimation in them. Overestimation in the numerator makes the normalized dot product estimate larger, while overestimation in the denominator makes the normalized dot product estimate smaller.

The result is that a normalized dot product estimate can make no guarantee about being greater than or equal to the true value!

This may or may not be a problem for your situation. Doing a dot product with unnormalized vectors still gives you a value that you can use to compare “similarity values” between histograms, but it has slightly different meaning than a dot product with normalized vectors.

Specifically, if the counts are much larger in one histogram versus another (such as when doing a dot product between multiple large text documents and a small search term string), the “weight” of the larger counts will count for more.

That means if you search for “apple pie”, a 100 page novel that mentions apples 10 times will be a better match than a 1/2 page recipe for apple pie!

When you normalize histograms, it makes it so the counts are “by percentage of the total document length”, which would help our search correctly find that the apple pie recipe is more relevant.

In other situations, you might want to let the higher count weigh stronger even though the occurrences are “less dense” in the document.

It really just depends on what your usage case is.

Calculating W & D

There are two parameters (values) used when calculating the correct W and D dimensions of a count min sketch, for the desired accuracy levels. The parameters are ε (epsilon) and δ (delta).

ε (Epsilon) is “how much error is added to our counts with each item we add to the cm sketch”.

δ (Delta) is “with what probability do we want to allow the count estimate to be outside of our epsilon error rate”

To calculate W and D, you use these formulas:

W = ⌈e/ε⌉
D = ⌈ln (1/δ)⌉

Where ln is “natural log” and e is “euler’s constant”.

Accuracy Guarantees

When querying to get a count for a specific object (also called a “point query”) the accuracy guarantees are:

True Count <= Estimated Count
Estimated Count <= True Count + ε * Number Of Items Added
There is a δ chance that #2 is not true

When doing an unnormalized dot product, the accuracy guarantees are:

True Dot Product <= Estimated Dot Product
Estimated Dot Product <= True Dot Product + ε * Number Of Items Added To A * Number Of Items Added To B
There is a δ chance that #2 is not true

Conservative Update

There is an alternate way to implement adding an item to the cm sketch, which results in provably less error. That technique is called a “Conservative Update”.

When doing a conservative update, you first look at the values in each row that you would normally increment and keep track of the smallest value that you’ve seen. You then only increment the counters that have that smallest value.

The reason this works is because we only look at the smallest value across all rows when doing a look up. So long as the smallest value across all rows increases when you insert an object, you’ve satisfied the requirements to make a look up return a value that is greater than or equal to the true value. The reason this conservative update results in less error is because you are writing to fewer values, which means that there are fewer hash collisions happening.

While this increases accuracy, it comes at the cost of extra logic and processing time needed when doing an update, which may or may not be appropriate for your needs.

Example Runs

The example program is a lot like the program from the last post which implemented some search engine type functionality.

This program also shows you some count estimations to show you that functionality as well.

The first run of the program is with normalized vectors, the second run of the program is with unnormalized vectors, and the third run of the program, which is most accurate, is with unnormalized vectors and conservative updates.

First Run: Normalized Vectors, Regular Updates

Second Run: Unnormalized Vectors, Regular Updates

Third Run: Unnormalized Vectors, Conservative Updates

Example Code

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

const float c_eulerConstant = (float)std::exp(1.0);

// The CCountMinSketch class
template <typename TKEY, typename TCOUNTTYPE, unsigned int NUMBUCKETS, unsigned int NUMHASHES, typename HASHER = std::hash>
class CCountMinSketch
{
public:
	typedef CCountMinSketch TType;

	CCountMinSketch ()
		: m_countGrid { } // init counts to zero
		, m_vectorLengthsDirty(true)
	{ }

	static const unsigned int c_numBuckets = NUMBUCKETS;
	static const unsigned int c_numHashes = NUMHASHES;
	typedef TCOUNTTYPE TCountType;

	void AddItem (bool conservativeUpdate, const TKEY& item, const TCOUNTTYPE& count)
	{
		// this count min sketch is only supporting positive counts
		if (count = 0!n");
			return;
		}

		// remember that our vector lengths are inaccurate
		m_vectorLengthsDirty = true;

		// if doing a conservative update, only update the buckets that are necesary
		if (conservativeUpdate)
		{
			// find what the lowest valued bucket is and calculate what our new lowest
			// value should be
			TCOUNTTYPE lowestValue = GetCount(item) + count;

			// make sure every bucket has at least the lowest value it should have
			size_t rawHash = HASHER()(item);
			for (unsigned int i = 0; i < NUMHASHES; ++i)
			{
				size_t hash = std::hash()(rawHash ^ std::hash()(i));
				TCOUNTTYPE value = m_countGrid[i][hash%NUMBUCKETS];
				if (value < lowestValue)
					m_countGrid[i][hash%NUMBUCKETS] = lowestValue;
			}
		}
		// else do a normal update
		else
		{
			// for each hash, find what bucket this item belongs in, and add the count to that bucket
			size_t rawHash = HASHER()(item);
			for (unsigned int i = 0; i < NUMHASHES; ++i)
			{
				size_t hash = std::hash()(rawHash ^ std::hash()(i));
				m_countGrid[i][hash%NUMBUCKETS] += count;
			}
		}
	}

	TCOUNTTYPE GetCount (const TKEY& item)
	{
		// for each hash, get the value for this item, and return the smalles value seen
		TCOUNTTYPE ret = 0;
		size_t rawHash = HASHER()(item);
		for (unsigned int i = 0; i < NUMHASHES; ++i)
		{
			size_t hash = std::hash()(rawHash ^ std::hash()(i));
			if (i == 0 || ret > m_countGrid[i][hash%NUMBUCKETS])
				ret = m_countGrid[i][hash%NUMBUCKETS];
		}
		return ret;
	}

	void CalculateVectorLengths ()
	{
		// if our vector lengths were previously calculated, no need to do anything
		if (!m_vectorLengthsDirty)
			return;

		// calculate vector lengths of each hash
		for (unsigned int hash = 0; hash < NUMHASHES; ++hash)
		{
			m_vectorLengths[hash] = 0.0f;
			for (unsigned int bucket = 0; bucket < NUMBUCKETS; ++bucket)
				m_vectorLengths[hash] += (float)m_countGrid[hash][bucket] * (float)m_countGrid[hash][bucket];
			m_vectorLengths[hash] = sqrt(m_vectorLengths[hash]);
		}

		// remember that our vector lengths have been calculated
		m_vectorLengthsDirty = false;
	}

	friend float HistogramDotProduct (TType& A, TType& B, bool normalize)
	{
		// make sure the vector lengths are accurate. No cost if they were previously calculated
		A.CalculateVectorLengths();
		B.CalculateVectorLengths();

		// whatever hash has the smallest dot product is the most correct
		float ret = 0.0f;
		bool foundValidDP = false;
		for (unsigned int hash = 0; hash < NUMHASHES; ++hash)
		{
			// if either vector length is zero, don't consider this dot product a valid result
			// we cant normalize it, and it will be zero anyways
			if (A.m_vectorLengths[hash] == 0.0f || B.m_vectorLengths[hash] == 0.0f)
				continue;

			// calculate dot product of unnormalized vectors
			float dp = 0.0f;
			for (unsigned int bucket = 0; bucket  dp)
			{
				ret = dp;
				foundValidDP = true;
			}
		}
		return ret;
	}

private:
	typedef std::array TBucketList;
	typedef std::array TTable;

	TTable m_countGrid;
	bool m_vectorLengthsDirty;
	std::array m_vectorLengths;
};

// Calculate ideal count min sketch parameters for your needs.
unsigned int CMSIdealNumBuckets (float error)
{
	return (unsigned int)std::ceil((float)(c_eulerConstant / error));
}

unsigned int CMSIdealNumHashes (float probability)
{
	return (unsigned int)std::ceil(log(1.0f / probability));
}

typedef std::string TKeyType;
typedef unsigned char TCountType;

typedef CCountMinSketch THistogramEstimate;
typedef std::unordered_map THistogramActual;

// These one paragraph stories are from http://birdisabsurd.blogspot.com/p/one-paragraph-stories.html

// The Dino Doc : http://birdisabsurd.blogspot.com/2011/11/dino-doc.html (97 words)
const char *g_storyA =
"The Dino Doc:n"
"Everything had gone according to plan, up 'til this moment. His design team "
"had done their job flawlessly, and the machine, still thrumming behind him, "
"a thing of another age, was settled on a bed of prehistoric moss. They'd "
"done it. But now, beyond the protection of the pod and facing an enormous "
"tyrannosaurus rex with dripping jaws, Professor Cho reflected that, had he "
"known of the dinosaur's presence, he wouldn't have left the Chronoculator - "
"and he certainly wouldn't have chosen "Stayin' Alive", by The Beegees, as "
"his dying soundtrack. Curse his MP3 player";

// The Robot: http://birdisabsurd.blogspot.com/2011/12/robot.html (121 words)
const char *g_storyB =
"The Robot:n"
"The engineer watched his robot working, admiring its sense of purpose.It knew "
"what it was, and what it had to do.It was designed to lift crates at one end "
"of the warehouse and take them to the opposite end.It would always do this, "
"never once complaining about its place in the world.It would never have to "
"agonize over its identity, never spend empty nights wondering if it had been "
"justified dropping a promising and soul - fulfilling music career just to "
"collect a bigger paycheck.And, watching his robot, the engineer decided that "
"the next big revolution in the robotics industry would be programming "
"automatons with a capacity for self - doubt.The engineer needed some company.";

// The Internet: http://birdisabsurd.blogspot.com/2011/11/internet.html (127 words)
const char *g_storyC =
"The Internet:n"
"One day, Sandra Krewsky lost her mind.Nobody now knows why, but it happened - "
"and when it did, Sandra decided to look at every page on the Internet, "
"insisting that she wouldn't eat, drink, sleep or even use the washroom until "
"the job was done. Traps set in her house stalled worried family members, and by "
"the time they trounced the alligator guarding her bedroom door - it managed to "
"snap her neighbour's finger clean off before going down - Sandra was already "
"lost… though the look of despair carved in her waxen features, and the cat "
"video running repeat on her flickering computer screen, told them everything "
"they needed to know.She'd seen too much. She'd learned that the Internet "
"played for keeps.";

void WaitForEnter ()
{
	printf("nPress Enter to quit");
	fflush(stdin);
	getchar();
}

template 
void ForEachWord (const std::string &source, L& lambda)
{
	size_t prev = 0;
	size_t next = 0;

	while ((next = source.find_first_of(" ,.-":n", prev)) != std::string::npos)
	{
		if ((next - prev != 0))
		{
			std::string word = source.substr(prev, next - prev);
			std::transform(word.begin(), word.end(), word.begin(), ::tolower);
			lambda(word);
		}
		prev = next + 1;
	}

	if (prev < source.size())
	{
		std::string word = source.substr(prev);
		std::transform(word.begin(), word.end(), word.begin(), ::tolower);
		lambda(word);
	}
}

void PopulateHistogram (THistogramEstimate &histogram, const char *text, bool conservativeUpdate)
{
	ForEachWord(text, [&](const std::string &word) {
		histogram.AddItem(conservativeUpdate, word, 1);
	});
}

void PopulateHistogram (THistogramActual &histogram, const char *text)
{
	ForEachWord(text, [&histogram](const std::string &word) {
		histogram[word]++;
	});
}

float HistogramDotProduct (THistogramActual &A, THistogramActual &B, bool normalize)
{
	// Get all the unique keys from both histograms
	std::set keysUnion;
	std::for_each(A.cbegin(), A.cend(), [&keysUnion](const std::pair& v)
	{
		keysUnion.insert(v.first);
	});
	std::for_each(B.cbegin(), B.cend(), [&keysUnion](const std::pair& v)
	{
		keysUnion.insert(v.first);
	});

	// calculate and return the normalized dot product!
	float dotProduct = 0.0f;
	float lengthA = 0.0f;
	float lengthB = 0.0f;
	std::for_each(keysUnion.cbegin(), keysUnion.cend(),
		[&A, &B, &dotProduct, &lengthA, &lengthB]
		(const TKeyType& key)
		{
			// if the key isn't found in either histogram ignore it, since it will be 0 * x which is
			// always anyhow.  Make sure and keep track of vector length though!
			auto a = A.find(key);
			auto b = B.find(key);

			if (a != A.end())
				lengthA += (float)(*a).second * (float)(*a).second;

			if (b != B.end())
				lengthB += (float)(*b).second * (float)(*b).second;

			if (a == A.end())
				return;

			if (b == B.end())
				return;

			// calculate dot product
			dotProduct += ((float)(*a).second * (float)(*b).second);
		}
	);

	// if we don't need to normalize, return the unnormalized value we have right now
	if (!normalize)
		return dotProduct;

	// normalize if we can
	if (lengthA * lengthB <= 0.0f)
		return 0.0f;

	lengthA = sqrt(lengthA);
	lengthB = sqrt(lengthB);
	return dotProduct / (lengthA * lengthB);
}

template 
const char *CalculateError (const T&estimate, const T&actual)
{
	float error = 100.0f * ((float)estimate - (float)actual) / (float)actual;
	if (std::isnan(error) || std::isinf(error))
		return "undef";
	
	// bad practice to return a static local string, dont do this in production code!
	static char ret[256];
	sprintf(ret, "%0.2f%%", error);
	return ret;
}

int main (int argc, char **argv)
{
	// settings
	const bool c_normalizeDotProducts = false;
	const bool c_conservativeUpdate = true;

	// show settings and implication
	printf("Dot Products Normalized? %sn",
		c_normalizeDotProducts
			? "Yes! estimate could be  actual"
			: "No! estimate <= actual");

	printf("Conservative Updates? %snn",
		c_conservativeUpdate
			? "Yes! Reduced error"
			: "No! normal error");	

	// populate our probabilistic histograms.
	// Allocate memory for the objects so that we don't bust the stack for large histogram sizes!
	std::unique_ptr TheDinoDocEstimate = std::make_unique();
	std::unique_ptr TheRobotEstimate = std::make_unique();
	std::unique_ptr TheInternetEstimate = std::make_unique();
	PopulateHistogram(*TheDinoDocEstimate, g_storyA, c_conservativeUpdate);
	PopulateHistogram(*TheRobotEstimate, g_storyB, c_conservativeUpdate);
	PopulateHistogram(*TheInternetEstimate, g_storyC, c_conservativeUpdate);

	// populate our actual count histograms for comparison
	THistogramActual TheDinoDocActual;
	THistogramActual TheRobotActual;
	THistogramActual TheInternetActual;
	PopulateHistogram(TheDinoDocActual, g_storyA);
	PopulateHistogram(TheRobotActual, g_storyB);
	PopulateHistogram(TheInternetActual, g_storyC);

	// report whether B or C is a closer match for A
	float dpABEstimate = HistogramDotProduct(*TheDinoDocEstimate, *TheRobotEstimate, c_normalizeDotProducts);
	float dpACEstimate = HistogramDotProduct(*TheDinoDocEstimate, *TheInternetEstimate, c_normalizeDotProducts);
	float dpABActual = HistogramDotProduct(TheDinoDocActual, TheRobotActual, c_normalizeDotProducts);
	float dpACActual = HistogramDotProduct(TheDinoDocActual, TheInternetActual, c_normalizeDotProducts);
	printf(""The Dino Doc" vs ...n");
	printf("  "The Robot"    %0.4f (actual %0.4f) Error: %sn", dpABEstimate, dpABActual, CalculateError(dpABEstimate, dpABActual));
	printf("  "The Internet" %0.4f (actual %0.4f) Error: %snn", dpACEstimate, dpACActual, CalculateError(dpACEstimate, dpACActual));
	if (dpABEstimate > dpACEstimate)
		printf("Estimate: "The Dino Doc" and "The Robot" are more similarn");
	else
		printf("Estimate: "The Dino Doc" and "The Internet" are more similarn");
	if (dpABActual > dpACActual)
		printf("Actual:   "The Dino Doc" and "The Robot" are more similarn");
	else
		printf("Actual:   "The Dino Doc" and "The Internet" are more similarn");

	// let the user do a search engine style query for our stories!
	char searchString[1024];
	printf("nplease enter a search string:n");
	searchString[0] = 0;
	scanf("%[^n]", searchString);

	struct SSearchResults
	{
		SSearchResults(const std::string& pageName, float rankingEstimated, float rankingActual)
			: m_pageName(pageName)
			, m_rankingEstimated(rankingEstimated)
			, m_rankingActual(rankingActual)
		{ }

		bool operator  other.m_rankingEstimated;
		}

		std::string m_pageName;
		float       m_rankingEstimated;
		float       m_rankingActual;
	};
	std::vector results;

	// preform our search and gather our results!
	std::unique_ptr searchEstimate = std::make_unique();
	THistogramActual searchActual;
	PopulateHistogram(*searchEstimate, searchString, c_conservativeUpdate);
	PopulateHistogram(searchActual, searchString);
	results.push_back(
		SSearchResults(
			"The Dino Doc",
			HistogramDotProduct(*TheDinoDocEstimate, *searchEstimate, c_normalizeDotProducts),
			HistogramDotProduct(TheDinoDocActual, searchActual, c_normalizeDotProducts)
		)
	);
	results.push_back(
		SSearchResults(
			"The Robot",
			HistogramDotProduct(*TheRobotEstimate, *searchEstimate, c_normalizeDotProducts),
			HistogramDotProduct(TheRobotActual, searchActual, c_normalizeDotProducts)
		)
	);
	results.push_back(
		SSearchResults(
			"The Internet",
			HistogramDotProduct(*TheInternetEstimate, *searchEstimate, c_normalizeDotProducts),
			HistogramDotProduct(TheInternetActual, searchActual, c_normalizeDotProducts)
		)
	);
	std::sort(results.begin(), results.end());

	// show the search results
	printf("nSearch results sorted by estimated relevance:n");
	std::for_each(results.begin(), results.end(), [](const SSearchResults& result) {
		printf("  "%s" : %0.4f (actual %0.4f) Error: %sn",
			result.m_pageName.c_str(),
			result.m_rankingEstimated,
			result.m_rankingActual,
			CalculateError(result.m_rankingEstimated, result.m_rankingActual)
		);
	});

	// show counts of search terms in each story (estimated and actual)
	printf("nEstimated counts of search terms in each story:n");
	std::for_each(searchActual.cbegin(), searchActual.cend(), [&] (const std::pair& v)
	{
		// show key
		printf(""%s"n", v.first.c_str());

		// the dino doc
		TCountType estimate = TheDinoDocEstimate->GetCount(v.first.c_str());
		TCountType actual = 0;
		auto it = TheDinoDocActual.find(v.first.c_str());
		if (it != TheDinoDocActual.end())
			actual = it->second;
		printf("  "The Dino Doc" %u (actual %u) Error: %sn", estimate, actual, CalculateError(estimate, actual));

		// the robot
		estimate = TheRobotEstimate->GetCount(v.first.c_str());
		actual = 0;
		it = TheRobotActual.find(v.first.c_str());
		if (it != TheRobotActual.end())
			actual = it->second;
		printf("  "The Robot"    %u (actual %u) Error: %sn", estimate, actual, CalculateError(estimate, actual));

		// the internet
		estimate = TheInternetEstimate->GetCount(v.first.c_str());
		actual = 0;
		it = TheInternetActual.find(v.first.c_str());
		if (it != TheInternetActual.end())
			actual = it->second;
		printf("  "The Internet" %u (actual %u) Error: %sn", estimate, actual, CalculateError(estimate, actual));
	});

	// show memory use
	printf("nThe above used %u buckets and %u hashes with %u bytes per countn",
		THistogramEstimate::c_numBuckets, THistogramEstimate::c_numHashes, sizeof(THistogramEstimate::TCountType));
	printf("Totaling %u bytes of storage for each histogramnn",
		THistogramEstimate::c_numBuckets * THistogramEstimate::c_numHashes * sizeof(THistogramEstimate::TCountType));
	
	// show a probabilistic suggestion
	float error = 0.1f;
	float probability = 0.01f;
	printf("You should use %u buckets and %u hashes for...n", CMSIdealNumBuckets(error), CMSIdealNumHashes(probability));
	printf("true count <= estimated count <= true count + %0.2f * Items ProcessednWith probability %0.2f%%n", error, (1.0f - probability)*100.0f);
	
	WaitForEnter();
	return 0;
}

Links

If you use this in production code, you should probably use a better quality hash.

The rabbit hole on this stuff goes deeper, so if you want to know more, check out these links!

Wikipedia: Count Min Sketch
Count Min Sketch Full Paper
Count Min Sketch AT&T Research Paper
Another CMS paper
And another, with some more info like range query details

Next up I’ll be writing about hyperloglog, which does the same thing as KMV (K-Minimum Values) but is better at it!

Writing a Basic Search Engine AKA Calculating Similarity of Histograms with Dot Product

I came across this technique while doing some research for the next post and found it so interesting that it seemed to warrant it’s own post.

Histograms and Multisets

Firstly, histogram is just a fancy word for a list of items that have an associated count.

If I were to make a histogram of the contents of my office, I would end up with something like:

Books = 20
Phone = 1
Headphones = 2
Sombrero = 1 (thanks to the roaming tequilla cart, but that’s another story…)
VideoGames = 15
(and so on)

Another term for a histogram is multiset, so if you see that word, just think of it as being the same thing.

Quick Dot Product Refresher

Just to make sure we are on the same page, using dot product to get the angle between two vectors is as follows:

cos(theta) = (A * B) / (||A||*||B||)

Or in coder-eese, if A and B are vectors of any dimension:

cosTheta = dot(A,B) / (length(A)*length(B))

To actually do the “dot” portion, you just multiply the X’s by the X’s, the Y’s by the Y’s, the Z’s by the Z’s etc, and add them all up to get a single scalar value. For a 3d vector it would look like this:

dot(A,B) = A.x*B.x + A.y*B.y + A.z*B.z

The value from the formulas above will tell you how similar the direction of the two vectors are.

If the value is 1, that means they are pointing the exact same way. if the value is 0 that means they are perpendicular. If the value is -1 that means they are pointing the exact opposite way.

Note that you can forgo the division by lengths in that formula, and just look at whether the result is positive, negative, or zero, if that’s enough information for your needs. We’ll be using the full on normalized vector version above in this post today though.

For a deeper refresher on dot product check out this link:
Wikipedia: Dot Product

Histogram Dot Products

Believe it or not, if you treat the counts in a histogram as an N dimensional vector – where N is the number of categories in the histogram – you can use the dot product to gauge similarity between the contents of two histograms by using it to get the angular difference between the vectors!

In the normal usage case, histograms have counts that are >= 0, which ends up meaning that two histogram vectors can only be up to 90 degrees apart. That ends up meaning that the result of the dot product of these normalized vectors is going to be from 0 to 1, where 0 means they have nothing in common, and 1 means they are completely the same.

This is similar to the Jaccard Index mentioned in previous posts, but is different. In fact, this value isn’t even linear (although maybe putting it through acos and dividing by pi/2 may make it suitably linear!), as it represents the cosine of an angle, not a percentage of similarity. It’s still useful though. If you have histogram A and are trying to see if histogram B or C is a closer match to A, you can calculate this value for the A/B pair and the A/C pair. The one with the higher value is the more similar pairing.

Another thing to note about this technique is that order doesn’t matter. If you are trying to compare two multisets where order matters, you are going to have to find a different algorithm or try to pull some shenanigans – such as perhaps weighting the items in the multiset based on the order they were in.

Examples

Let’s say we have two bags of fruit and we want to know how similar the bags are. The bags in this case represent histograms / multisets.

In the first bag, we have 3 apples. In the second bag, we have 2 oranges.

If we have a 2 dimensional vector where the first component is apples and the second component is oranges, we can represent our bags of fruit with these vectors:

Bag1 = [3, 0]
Bag2 = [0, 2]

Now, let’s dot product the vectors:

Dot Product = Bag1.Apples * Bag2.Apples + Bag1.Oranges * Bag2.Oranges
= 3*0 + 0*2
= 0

We would normally divide our answer by the length of the Bag1 vector and then the length of the Bag2 vector, but since it’s zero we know that won’t change the value.

From this, we can see that Bag1 and Bag2 have nothing in common!

What if we added an apple to bag 2 though? let’s do that and try the process again.

Bag1 = [3,0]
Bag2 = [1,2]
Dot Product = Bag1.Apples * Bag2.Apples + Bag1.Oranges * Bag2.Oranges
= 3*1 + 0*2
= 3

Next up, we need to divide the answer by the length of our Bag1 vector which is 3, multiplied the length of our Bag2 vector which is the square root of 5.

Cosine Angle (Similarity Value) = 3 / (3 * sqrt(5))
= ~0.45

Lets add an orange to bag 1. That ought to make it more similar to bag 2, so should increase our similarity value of the two bags. Let’s find out if that’s true.

Bag1 = [3,1]
Bag2 = [1,2]
Dot Product = Bag1.Apples * Bag2.Apples + Bag1.Oranges * Bag2.Oranges
= 3*1 + 1*2
= 5

Next, we need to divide that answer by the length of the bag 1 vector which is the square root of 10, multiplied by the length of the bag 2 vector which is the square root of 5.

Cosine Angle (Similarity Value) = 5 / (sqrt(10) * sqrt(5))
= ~0.71

So yep, adding an orange to bag 1 made the two bags more similar!

Example Code

Here’s a piece of code that uses the dot product technique to be able to gauge the similarity of text. It uses this to compare larger blocks of text, and also uses it to allow you to search the blocks of text like a search engine!

#include 
#include 
#include 
#include 
#include 
 
typedef std::unordered_map THistogram;
 
// These one paragraph stories are from http://birdisabsurd.blogspot.com/p/one-paragraph-stories.html
 
// The Dino Doc : http://birdisabsurd.blogspot.com/2011/11/dino-doc.html
const char *g_storyA =
"The Dino Doc:n"
"Everything had gone according to plan, up 'til this moment. His design team "
"had done their job flawlessly, and the machine, still thrumming behind him, "
"a thing of another age, was settled on a bed of prehistoric moss. They'd "
"done it. But now, beyond the protection of the pod and facing an enormous "
"tyrannosaurus rex with dripping jaws, Professor Cho reflected that, had he "
"known of the dinosaur's presence, he wouldn't have left the Chronoculator - "
"and he certainly wouldn't have chosen "Stayin' Alive", by The Beegees, as "
"his dying soundtrack. Curse his MP3 player";
 
// The Robot: http://birdisabsurd.blogspot.com/2011/12/robot.html
const char *g_storyB =
"The Robot:n"
"The engineer watched his robot working, admiring its sense of purpose.It knew "
"what it was, and what it had to do.It was designed to lift crates at one end "
"of the warehouse and take them to the opposite end.It would always do this, "
"never once complaining about its place in the world.It would never have to "
"agonize over its identity, never spend empty nights wondering if it had been "
"justified dropping a promising and soul - fulfilling music career just to "
"collect a bigger paycheck.And, watching his robot, the engineer decided that "
"the next big revolution in the robotics industry would be programming "
"automatons with a capacity for self - doubt.The engineer needed some company.";
 
// The Internet: http://birdisabsurd.blogspot.com/2011/11/internet.html
const char *g_storyC =
"The Internet:n"
"One day, Sandra Krewsky lost her mind.Nobody now knows why, but it happened - "
"and when it did, Sandra decided to look at every page on the Internet, "
"insisting that she wouldn't eat, drink, sleep or even use the washroom until "
"the job was done. Traps set in her house stalled worried family members, and by "
"the time they trounced the alligator guarding her bedroom door - it managed to "
"snap her neighbour's finger clean off before going down - Sandra was already "
"lost… though the look of despair carved in her waxen features, and the cat "
"video running repeat on her flickering computer screen, told them everything "
"they needed to know.She'd seen too much. She'd learned that the Internet "
"played for keeps.";
 
void WaitForEnter ()
{
    printf("nPress Enter to quit");
    fflush(stdin);
    getchar();
}
 
template 
void ForEachWord (const std::string &source, L& lambda)
{
    size_t prev = 0;
    size_t next = 0;
 
    while ((next = source.find_first_of(" ,.-":n", prev)) != std::string::npos)
    {
        if ((next - prev != 0))
        {
            std::string word = source.substr(prev, next - prev);
            std::transform(word.begin(), word.end(), word.begin(), ::tolower);
            lambda(word);
        }
        prev = next + 1;
    }
 
    if (prev < source.size())
    {
        std::string word = source.substr(prev);
        std::transform(word.begin(), word.end(), word.begin(), ::tolower);
        lambda(word);
    }
}
 
void PopulateHistogram (THistogram &histogram, const char *text)
{
    ForEachWord(text, [&histogram](const std::string &word) {
        histogram[word] ++;
    });
}
 
float HistogramDotProduct (THistogram &A, THistogram &B)
{
    // Get all the unique keys from both histograms
    std::set keysUnion;
    std::for_each(A.cbegin(), A.cend(), [&keysUnion](const std::pair& v)
    {
        keysUnion.insert(v.first);
    });
    std::for_each(B.cbegin(), B.cend(), [&keysUnion](const std::pair& v)
    {
        keysUnion.insert(v.first);
    });
 
    // calculate and return the normalized dot product!
    float dotProduct = 0.0f;
    float lengthA = 0.0f;
    float lengthB = 0.0f;
    std::for_each(keysUnion.cbegin(), keysUnion.cend(),
        [&A, &B, &dotProduct, &lengthA, &lengthB]
        (const std::string& key)
        {
            // if the key isn't found in either histogram ignore it, since it will be 0 * x which is
            // always anyhow.  Make sure and keep track of vector length though!
            auto a = A.find(key);
            auto b = B.find(key);
 
            if (a != A.end())
                lengthA += (float)(*a).second * (float)(*a).second;
 
            if (b != B.end())
                lengthB += (float)(*b).second * (float)(*b).second;
 
            if (a == A.end())
                return;
 
            if (b == B.end())
                return;
 
            // calculate dot product
            dotProduct += ((float)(*a).second * (float)(*b).second);        
        }
    );
 
    // normalize if we can
    if (lengthA * lengthB  dpAC)
        printf(""The Dino Doc" and "The Robot" are more similarn");
    else
        printf(""The Dino Doc" and "The Internet" are more similarn");
 
    // let the user do a search engine style query for our stories!
    char searchString[1024];
    printf("nplease enter a search string:n");
    scanf("%[^n]", searchString);
 
    // preform our search and gather our results!
    THistogram search;
    PopulateHistogram(search, searchString);
 
    struct SSearchResults
    {
        SSearchResults (const std::string& pageName, float ranking, const char* pageContent)
            : m_pageName(pageName)
            , m_ranking(ranking)
            , m_pageContent(pageContent)
        { }
 
        bool operator  other.m_ranking;
        }
 
        std::string m_pageName;
        float       m_ranking;
        const char* m_pageContent;
    };
    std::vector results;
 
    results.push_back(SSearchResults("The Dino Doc", HistogramDotProduct(TheDinoDoc, search), g_storyA));
    results.push_back(SSearchResults("The Robot", HistogramDotProduct(TheRobot, search), g_storyB));
    results.push_back(SSearchResults("The Internet", HistogramDotProduct(TheInternet, search), g_storyC));
    std::sort(results.begin(), results.end());
 
    // show the search results
    printf("nSearch results sorted by relevance:n");
    std::for_each(results.begin(), results.end(), [] (const SSearchResults& result) {
        printf("  "%s" : %0.4fn", result.m_pageName.c_str(), result.m_ranking);
    });
 
    // show the most relevant result
    printf("n-----Best Result-----n%sn",results[0].m_pageContent);
 
    WaitForEnter();
    return 0;
}

Here is an example run of this program:

Improvements

Our “search engine” code does work but is pretty limited. It doesn’t know that “cat” and “Kitty” are basically the same, and also doesn’t know that the words “and”, “the” & “it” are not important.

I’m definitely not an expert in these matters, but to solve the first problem you might try making a “thesaurus” lookup table, where maybe whenever it sees “kitten”, “kitty”, “feline”, it translates it to “cat” before putting it into the histogram. That would make it smarter at realizing all those words mean the same thing.

For the second problem, there is a strange technique called tf–idf that seems to work pretty well, although people haven’t really been able to rigorously figure out why it works so well. Check out the link to it at the bottom of this post.

Besides just using this technique for text comparisons, I’ve read mentions that this histogram dot product technique is used in places like machine vision and object classification.

It is a pretty neat technique 😛

Links

Wikipedia: Cosine Similarity – The common name for this technique
Easy Bar Chart Creator – Used to make the example graphs
Wikipedia: tf-idf – Used to automagically figure out which words in a document are important and which aren’t

Estimating Set Membership With a Bloom Filter

Have you ever found yourself in a situation where you needed to keep track of whether or not you’ve seen an item, but the number of items you have to keep track of is either gigantic or completely unbounded?

This comes up a lot in “massive data” situations (like internet search engines), but it can also come up in games sometimes – like maybe having a monster in an MMO game remember which players have tried to attack it before so it can be hostile when it sees them again.

One solution to the game example could be to keep a list of all players the monster had been attacked by, but that will get huge and eat up a lot of memory and take time to search the list for a specific player.

You might instead decide you only want to keep the last 20 players the monster had been attacked by, but that still uses up a good chunk of memory if you have a lot of different monsters tracking this information for themselves, and 20 may be so low, that the monster will forget people who commonly attack it by the time it sees them again.

Enter The Bloom Filter

There is actually an interesting solution to this problem, using a probabilistic data structure called a bloom filter – and before you ask, no, it has nothing to do with graphics. It was invented by a guy named Burton Howard Bloom in 1970.

A bloom filter basically has M number of bits as storage and K number of hashes.

To illustrate an example, let’s say that we have 10 bits and only 1 hash.

When we insert an item, what we want to do is hash that item and then modulus that has value against 10 to get a pseudo-random (but deterministic) value 0-9. That value is the index of the bit we want to set to true. After we set that bit to true, we can consider that item is inserted into the set.

When we want to ask if an item exists in a given bloom filter, what we do is again hash the item and modulus it against 10 to get the 0-9 value again. If that bit is set to false, we can say that item is NOT in the set with certainty, but if it’s true we can only say it MIGHT be in the set.

If the bit is true, we can’t say that for sure it’s part of the set, because something else may have hashed to the same bit and set it, so we have to consider it a maybe, not a definite yes. In fact, with only 10 bits and 1 hash function, with a good hash function, there is a 10% chance that any item we insert will result in the same bit being set.

To be a little more sure of our “maybe” being a yes, and not a false positive, we can do two things… we can increase the number of bits we use for the set (which will reduce collisions), or we can increase the number of hashes we do (although there comes a point where adding more hashes starts decreasing accuracy again).

If we add a second hash, all that means is that we will do a second hash, get a second 0-9 value, and write a one to that bit AS WELL when we insert an item. When checking if an item exists in a set, we also READ that bit to see if it’s true.

For N hashes, when inserting an item into a bloom filter, you get (up to) N different bits that you need to set to 1.

When checking if an item exists in a bloom filter, you get (up to) N different bits, all of which need to be 1 to consider the item in the set.

There are a few mathematical formulas you can use to figure out how many bits and hashes you would want to use to get a desired reliability that your maybes are in fact yeses, and also there are some formulas for figuring out the probability that your maybe is in fact a yes for any given bloom filter in a runtime state.

Depending on your needs and usage case, you may treat your “maybe” as a “yes” and move on, or you could treat a “maybe” as a time when you need to do a more expensive test (like, a disk read?). In those situations, the “no” is the valuable answer, since it means you can skip the more expensive test.

Estimating Item Count In Bloom Filters

Besides being able to test an item for membership, you can also estimate how many items are in any given bloom filter:

EstimatedCount = -(NumBits * ln(1 – BitsOn / NumBits)) / NumHashes

Where BitsOn is the count of how many bits are set to 1 and ln is natural log.

Set Operations On Bloom Filters

If you have two bloom filters, you can do some interesting set operations on them.

Union

One operation you can do with two bloom filters is union them, which means that if you have a bloom filter A and bloom filter B, you end up with a third bloom filter C that contains all the unique items from both A and B.

Besides being able to test this third bloom filter to see if it contains specific items, you can also ask it for an estimated count of objects it contains, which is useful for helping figure out how similar the sets are.

Essentially, if A estimates having 50 items and B estimates having 50 items, and their union, C, estimates having 51 items, that means that the items in A and B were almost all the same (probably).

How you union two bloom filters is you just do a bitwise OR on every bit in A and B to get the bits for C. Simple and fast to calculate.

Intersection

Another operation you can do with two bloom filters is to calculate their intersection. The intersection of two sets is just the items that the two sets have in common.

Again, besides being able to test this third bloom filter for membership of items, you can also use it to get an estimated count of objects to help figure out how similar the sets are.

Similary to the union, you can reason that if the intersection count is small compared to the counts of the individual bloom filters that went into it, that they were probably not very similar.

There are two ways you can calculate the intersection of two bloom filters.

The first way is to do a bitwise AND on every bit in A and B to get the bits for C. Again, super simple and faster to calculate.

The other way just involves some simple math:

Count(Intersection(A,B)) = (Count(A) + Count(B)) – Count(Union(A,B))

Whichever method is better depends on your needs and your usage case.

Jaccard Index

Just like I talked about in the KMV post two posts ago, you can also calculate an estimated Jaccard Index for two bloom filters, which will give you a value between 0 and 1 that tells you how similar two sets are.

If the Jaccard index is 1, that means the sets are the same and contain all the same items. If the Jaccard index is 0, that means the sets are completely different. If the Jaccard index is 0.5 that means that they have half of their items in common.

To calculate the estimated Jaccard Index, you just use this simple formula:

Jaccard Index = Count(Intersection(A,B)) / Count(Union(A,B))

Estimating False Positive Rate

The more items you insert into a bloom filter the higher the false positive rate gets, until it gets to 100% which means all the bits are set and if you ask if it contains an item, it will never say no.

To combat this, you may want to calculate your error rate at runtime and maybe spit out a warning if the error rate starts getting too high, so that you know you need to adjust the number of bits or number of hashes, or at very least you can alert the user that the reliability of the answers has degraded.

Here is the formula to calculate the false positive rate of a bloom filter at runtime:

ErrorRate = (1 – e^(-NumHashes * NumItems / NumBits))^NumHashes

NumItems is the number of unique items that have been inserted into the bloom filter. If you know that exact value somehow you can use that real value, but in the more likely case, you won’t know the exact value, so you can use the estimated unique item count formula described above to get an estimated unique count.

Managing the Error Rate of False Positives

As I mentioned above, there are formulas to figure out how many bits and how many hashes you need, to store a certain number of items with a specific maximum error rate.

You can work this out in advance by figuring out about how many items you expect to see.

First, you calculate the ideal bit count:

NumBits = – (NumItems * ln(DesiredFalsePositiveProbability)) / (ln(2)^2)

Where NumItems is the number of items you expect to put into the bloom filter, and DesiredFalsePositiveProbability is the error rate you want when the bloom filter has the expected number of items in it. The error rate will be lower until the item count reaches NumItems.

Next, you calculate the ideal number of hashes:

NumHashes = NumBits / NumItems * ln(2)

Then, you just create your bloom filter, using the specified number of bits and hashes.

Example Code

Here is some example bloom filter c++ code with all the bells and whistles. Instead of using multiple hashes, I just grabbed some random numbers and xor them against the hash of each item to make more deterministic but pseudo-random numbers (that’s what a hash does too afterall). If you want to use a bloom filter in a more serious usage case, you may want to actually use multiple hashes, and you probably want to use a better hashing algorithm than std::hash.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
 
// If you get a compile error here, remove the "class" from this enum definition.
// It just means your compiler doesn't support enum classes yet.
enum class EHasItem
{
    e_no,
    e_maybe
};
 
// The CBloomFilter class
template <typename T, unsigned int NUMBYTES, unsigned int NUMHASHES, typename HASHER = std::hash>
class CBloomFilter
{
public:
    // constants
    static const unsigned int c_numHashes = NUMHASHES;
    static const unsigned int c_numBits = NUMBYTES*8;
    static const unsigned int c_numBytes = NUMBYTES;
 
    // friends
    template 
    friend float UnionCountEstimate(const CBloomFilter& left, const CBloomFilter& right);
	template 
	friend float IntersectionCountEstimate (const CBloomFilter& left, const CBloomFilter& right);
 
    // constructor
    CBloomFilter (const std::array& randomSalt)
        : m_storage()              // initialize bits to zero
        , m_randomSalt(randomSalt) // store off the random salt to use for the hashes
    { }
 
    // interface
    void AddItem (const T& item)
    {
        const size_t rawItemHash = HASHER()(item);
        for (unsigned int index = 0; index < c_numHashes; ++index)
        {
            const size_t bitIndex = (rawItemHash ^ m_randomSalt[index])%c_numBits;
            SetBitOn(bitIndex);
        }
    }
 
    EHasItem HasItem (const T& item) const
    {
        const size_t rawItemHash = HASHER()(item);
        for (unsigned int index = 0; index < c_numHashes; ++index)
        {
            const size_t bitIndex = (rawItemHash ^ m_randomSalt[index])%c_numBits;
            if (!IsBitOn(bitIndex))
                return EHasItem::e_no;
        }
        return EHasItem::e_maybe;
    }
 
    // probabilistic interface
    float CountEstimate () const
    {
        // estimates how many items are actually stored in this set, based on how many
        // bits are set to true, how many bits there are total, and how many hashes
        // are done for each item.
        return -((float)c_numBits * std::log(1.0f - ((float)CountBitsOn() / (float)c_numBits))) / (float)c_numHashes;
    }
 
    float FalsePositiveProbability (size_t numItems = -1) const
    {
        // calculates the expected error.  Since this relies on how many items are in
        // the set, you can either pass in the number of items, or use the default
        // argument value, which means to use the estimated item count
        float numItemsf = numItems == -1 ? CountEstimate() : (float)numItems;
        return pow(1.0f - std::exp(-(float)c_numHashes * numItemsf / (float)c_numBits),(float)c_numHashes);
    }
     
private:
 
    inline void SetBitOn (size_t bitIndex)
    {
        const size_t byteIndex = bitIndex / 8;
        const uint8_t byteValue = 1 << (bitIndex%8);
        m_storage[byteIndex] |= byteValue;
    }
 
    inline bool IsBitOn (size_t bitIndex) const
    {
        const size_t byteIndex = bitIndex / 8;
        const uint8_t byteValue = 1 << (bitIndex%8);
        return ((m_storage[byteIndex] & byteValue) != 0);
    }
 
    size_t CountBitsOn () const
    {
        // likely a more efficient way to do this but ::shrug::
        size_t count = 0;
        for (size_t index = 0; index < c_numBits; ++index)
        {
            if (IsBitOn(index))
                ++count;
        }
        return count;
    }
     
    // storage of bits
    std::array m_storage;
 
    // Storage of random salt values
    // It could be neat to use constexpr and __TIME__ to make compile time random numbers.
    // That isn't available til like c++17 or something though sadly.
    const std::array& m_randomSalt;
};
 
// helper functions
template 
float UnionCountEstimate (const CBloomFilter& left, const CBloomFilter& right)
{
    // returns an estimated count of the unique items if both lists were combined
    // example: (1,2,3) union (2,3,4) = (1,2,3,4) which has a union count of 4
    CBloomFilter temp(left.m_randomSalt);
    for (unsigned int index = 0; index < left.c_numBytes; ++index)
        temp.m_storage[index] = left.m_storage[index] | right.m_storage[index];
     
    return temp.CountEstimate();
}
 
template 
float IntersectionCountEstimate (const CBloomFilter& left, const CBloomFilter& right)
{
    // returns an estimated count of the number of unique items that are shared in both sets
    // example: (1,2,3) intersection (2,3,4) = (2,3) which has an intersection count of 2
    CBloomFilter temp(left.m_randomSalt);
    for (unsigned int index = 0; index < left.c_numBytes; ++index)
        temp.m_storage[index] = left.m_storage[index] & right.m_storage[index];
     
    return temp.CountEstimate();
}
 
float IdealBitCount (unsigned int numItemsInserted, float desiredFalsePositiveProbability)
{
    // given how many items you plan to insert, and a target false positive probability at that count, this returns how many bits
    // of flags you should use.
    return (float)-(((float)numItemsInserted*log(desiredFalsePositiveProbability)) / (log(2)*log(2)));
}
 
float IdealHashCount (unsigned int numBits, unsigned int numItemsInserted)
{
    // given how many bits you are using for storage, and how many items you plan to insert, this is the optimal number of hashes to use
    return ((float)numBits / (float)numItemsInserted) * (float)log(2.0f);
}
 
// random numbers from random.org
// https://www.random.org/cgi-bin/randbyte?nbytes=40&format=h
// in 64 bit mode, size_t is 64 bits, not 32.  The random numbers below will be all zero in the upper 32 bits!
static const std::array s_randomSalt =
{
    0x6ff3f8ef,
    0x9b565007,
    0xea709ce4,
    0xf7d5cbc7, 
    0xcb7e38e1,
    0xd54b5323,
    0xbf679080,
    0x7fb78dee,
    0x540c9e8a,
    0x89369800
};
 
// data for adding and testing in our list
static const char *s_dataList1[] =
{
    "hello!",
    "blah!",
    "moo",
    nullptr
};
 
static const char *s_dataList2[] =
{
    "moo",
    "hello!",
    "mooz",
    "kitty",
    "here is a longer string just cause",
    nullptr
};
 
static const char *s_askList[] =
{
    "moo",
    "hello!",
    "unf",
    "boom",
    "kitty",
    "mooz",
    "blah!",
    nullptr
};
 
// driver program
void WaitForEnter ()
{
    printf("nPress Enter to quit");
    fflush(stdin);
    getchar();
}
 
void main(void)
{
    CBloomFilter set1(s_randomSalt);
    CBloomFilter set2(s_randomSalt);
	std::set actualSet1;
	std::set actualSet2;
 
    printf("Creating 2 bloom filter sets with %u bytes of flags (%u bits), and %u hashes.nn", set1.c_numBytes, set1.c_numBits, set1.c_numHashes);
 
	// create data set 1
    unsigned int index = 0;
    while (s_dataList1[index] != nullptr)
    {
        printf("Adding to set 1: "%s"n", s_dataList1[index]);
        set1.AddItem(s_dataList1[index]);
		actualSet1.insert(s_dataList1[index]);
        index++;
    }
 
	// create data set 2
    printf("n");
    index = 0;
    while (s_dataList2[index] != nullptr)
    {
        printf("Adding to set 2: "%s"n", s_dataList2[index]);
        set2.AddItem(s_dataList2[index]);
		actualSet2.insert(s_dataList2[index]);
        index++;
    }
 
	// query each set to see if they think that they contain various items
    printf("n");
    index = 0;
    while (s_askList[index] != nullptr)
    {
        printf("Exists: "%s"? %s & %s (actually %s & %s)n",
            s_askList[index],
            set1.HasItem(s_askList[index]) == EHasItem::e_maybe ? "maybe" : "no",
            set2.HasItem(s_askList[index]) == EHasItem::e_maybe ? "maybe" : "no",
			actualSet1.find(s_askList[index]) != actualSet1.end() ? "yes" : "no",
			actualSet2.find(s_askList[index]) != actualSet2.end() ? "yes" : "no");
        index++;
    }

	// show false positive rates
    printf ("nFalse postive probability = %0.2f%% & %0.2f%%n", set1.FalsePositiveProbability()*100.0f, set2.FalsePositiveProbability()*100.0f);
    printf ("False postive probability at 10 items = %0.2f%%n", set1.FalsePositiveProbability(10)*100.0f);
    printf ("False postive probability at 25 items = %0.2f%%n", set1.FalsePositiveProbability(25)*100.0f);
    printf ("False postive probability at 50 items = %0.2f%%n", set1.FalsePositiveProbability(50)*100.0f);
    printf ("False postive probability at 100 items = %0.2f%%n", set1.FalsePositiveProbability(100)*100.0f);

	// show ideal bit counts and hashes.
    const unsigned int itemsInserted = 10;
    const float desiredFalsePositiveProbability = 0.05f;
    const float idealBitCount = IdealBitCount(itemsInserted, desiredFalsePositiveProbability);
    const float idealHashCount = IdealHashCount((unsigned int)idealBitCount, itemsInserted);
    printf("nFor %u items inserted and a desired false probability of %0.2f%%nYou should use %0.2f bits of storage and %0.2f hashesn",
        itemsInserted, desiredFalsePositiveProbability*100.0f, idealBitCount, idealHashCount);

	// get the actual union
	std::set actualUnion;
	std::for_each(actualSet1.begin(), actualSet1.end(), [&actualUnion] (const std::string& s) {
		actualUnion.insert(s);
	});
	std::for_each(actualSet2.begin(), actualSet2.end(), [&actualUnion] (const std::string& s) {
		actualUnion.insert(s);
	});

	// get the actual intersection
	std::set actualIntersection;
	std::for_each(actualSet1.begin(), actualSet1.end(), [&actualIntersection,&actualSet2] (const std::string& s) {
		if (actualSet2.find(s) != actualSet2.end())
			actualIntersection.insert(s);
	});

	// caclulate actual jaccard index
	float actualJaccardIndex = (float)actualIntersection.size() / (float)actualUnion.size();

	// show the estimated and actual counts, and error of estimations
	printf("nSet1: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		set1.CountEstimate(),
		actualSet1.size(),
		100.0f * ((float)set1.CountEstimate() - (float)actualSet1.size()) / (float)actualSet1.size()
	);
	printf("Set2: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		set2.CountEstimate(),
		actualSet2.size(),
		100.0f * ((float)set2.CountEstimate() - (float)actualSet2.size()) / (float)actualSet2.size()
	);

	float estimatedUnion = UnionCountEstimate(set1, set2);
	float estimatedIntersection = IntersectionCountEstimate(set1, set2);
	float estimatedJaccardIndex = estimatedIntersection / estimatedUnion;
	printf("Union: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		estimatedUnion,
		actualUnion.size(),
		100.0f * (estimatedUnion - (float)actualUnion.size()) / (float)actualUnion.size()
	);
	printf("Intersection: %0.2f estimated, %u actual.  Error: %0.2f%%n",
		estimatedIntersection,
		actualIntersection.size(),
		100.0f * (estimatedIntersection - (float)actualIntersection.size()) / (float)actualIntersection.size()
	);
	printf("Jaccard Index: %0.2f estimated, %0.2f actual.  Error: %0.2f%%n",
		estimatedJaccardIndex,
		actualJaccardIndex,
		100.0f * (estimatedJaccardIndex - actualJaccardIndex) / actualJaccardIndex
	);
 
    WaitForEnter();
}

And here is the output of that program:

In the output, you can see that all the existence checks were correct. All the no’s were actually no’s like they should be, but also, all the maybe’s were actually present, so there were no false positives.

The estimated counts were a little off but were fairly close. The first list was estimated at 2.4 items, when it actually had 3. The second list was estimated at 4.44 items when it actually had 5 items.

It’s reporting a very low false positive rate, which falls in line with the fact that we didn’t see any false positives. The projected false positive rates at 10, 25, 50 and 100 items show us that the set doesn’t have a whole lot more capacity if we want to keep the error rate low.

The union, intersection and jaccard index error rate was pretty low, but the error rate was definitely larger than the false positive rate.

Interestingly, if you look at the part which reports the ideal bit and hash count for 10 items, it says that we should actually use FEWER hashes than we do and a couple less bits. You can actually experiment by changing the number of hashes to 4 and seeing that the error rate goes down. In the example code we are actually using TOO MANY hashes, and it’s hurting our probability rate, for the number of items we plan on inserting.

Interesting Idea

I was chatting with a friend Dave, who said he was using a bloom filter like structure to try and help make sure he didn’t try the same genetic algorithm permutation more than once. An issue with that is that hash collisions could thwart the ability for evolution to happen correctly by incorrectly disallowing a wanted permutation from happening just because it happened to hash to the same value as another permutation already tried. To help this situation, he just biased against permutations found in the data structure, instead of completely blocking them out. Basically, if the permutation was marked as “maybe seen” in the data structure, he’d give it some % chance that it would allow it to happen “again” anyways.

Unfortunately, the idea in general turned out to be impractical. He had about 40 bits of genetic information which is about 1 trillion unique items (2^40).

for being able to store only 1 billion items – which is 1000 times smaller – with 5% false positive rate, that would require about 750MB of storage of bits.

Dropping the requirement to being 25% false positive rate, it still requires 350MB, and to 75% requires 70MB. Even at 99% false positive rate allowed, it requires 2.5MB, and we are still 1000 times too small.

So, for 1 trillion permutations, the size of the bloom filter is unfortunately far too large and he ended up going a different route.

The technique of rolling a random number when you get a maybe is pretty neat though, so wanted to mention it (:

Next Up

We’ve now talked about a probabilistic unique value counter (KMV) that can count the number of unique objects seen.

We then talked about a probabilistic set structure (Bloom Filter) that can keep track of set membership of objects.

How about being able to probabilistically store a list of non unique items?

One way to do this would be to have a count with each item seen instead of only having a bit of whether it’s present or not. When you have a count per item, this is known as a multiset, or the more familiar term: histogram.

If you change a bloom filter to have a count instead of a single bit, that’s called a counting bloom filter and completely works.

There’s a better technique for doing this though called count min sketch. Look for a post soon!

Links

Wikipedia: Bloom Filter
Interactive Bloom Filter Demo

Check this out, it’s a physical implementation of a bloom filter (WAT?!)
Wikipedia: Superimposed code

Getting Strongly Typed Typedefs Using Phantom Types

In C++, typedefs have a lot of really handy uses.

They can make code easier to read:


    // ugly code:
    std::map<std::string, size_t> wordCounts;
    for (std::map<std::string, size_t>::iterator it = wordCounts.begin(); it != wordCounts.end(); ++it)
    {
        ...
    }

    // typedefs
    typedef std::map<std::string, size_t> TWordCounts; 
    typedef TWordCounts::iterator TWordCountInterator;

    // prettier, simplified code:
    TWordCounts wordCounts;
    for (TWordCountInterator it = wordCounts.begin(); it != wordCounts.end(); ++it)
    {
        ...
    }

And, if you find yourself wanting to change the std::map above to a std::unordered_map, typedefs can make your code easier to maintain since you only have one place to change instead of several (yes, decltype can help too!).

Typedefs can bite you sometimes though if you aren’t careful. Look at the below, where the code’s diligent author made friendly typedefs for model and animation ids:

    typedef unsigned int TModelID;
    typedef unsigned int TAnimationID;

    TModelID id = GetModelID();
    SetAnimationID(id); // BUG!  SetAnimation() wants a TAnimationID!

Wouldn’t it be great if you could make it so even though TModelID and TAnimationID are the same underlying type, the compiler could give you a compile error when you tried to use the wrong one?

One way to solve this problem is to use C++ 11’s feature enum classes, but that isn’t the solve I want to show you today (that technique is pretty straightforward anyways…). The solve I want to show you uses a spooky technique called phantom types.

Phantom Types

Here is how you might see a phantom type in the wild:

template <typename PHANTOM_TYPE>
struct SUInt
{
public:
    SUInt (unsigned int value) : m_value(value) { }
    inline unsigned int& Value () { return &m_value; }
private:
    unsigned int m_value;
};

You might be asking yourself “why??” or “what??” or perhaps even “I wonder what the wife will be making for dinner tonight?”.

The neat thing about that is that you can create dummy types to use as the template parameter, and then if you try to mix and match objects that use a different template parameter you’ll get a compile error instead of a subtle run time bug. Fail hard and fail early!

Here’s how to use phantom types to solve our issue:

// our phantom-type-using struct
template <typename PHANTOM_TYPE>
struct SUInt
{
public:
    SUInt (unsigned int value) : m_value(value) { }
    inline unsigned int& Value () { return &m_value; }
private:
    unsigned int m_value;
};

// some "tag types" to use as phantom types
struct SModelID {};
struct SAnimationID {};

// our strongly typed typedefs, which both are really just unsigned ints
typedef SUInt<SModelID> TModelID;
typedef SUInt<SAnimationID> TAnimationID;

TModelID GetModelID ()
{
    TModelID ret(3);
    return ret;
}

void SetAnimationID (TAnimationID animId)
{
    // do something with the animation id
}

int main(int argc, char **argv)
{
    TModelID id = GetModelID();
    SetAnimationID(id);
    return 0;
}

If you try to compile that you get a very lovely compile error.

error C2664: ‘SetAnimationID’ : cannot convert parameter 1 from ‘TModelID’ to ‘TAnimationID’

And taking it a step further, you could abstract the unsigned int away to make this code, in case you want to change the underlying type of either ID in the future:

// our phantom-type-using struct
template <typename T, typename PHANTOM_TYPE>
struct SStronglyTypedType
{
public:
    SStronglyTypedType (T value) : m_value(value) { }
    inline T& Value () { return &m_value; }
private:
    T m_value;
};

// some "tag types" to use as phantom types
struct SModelID {};
struct SAnimationID {};

// our strongly typed typedefs, which both are really just unsigned ints
typedef SStronglyTypedType<unsigned int, SModelID> TModelID;
typedef SStronglyTypedType<unsigned int, SAnimationID> TAnimationID;

TModelID GetModelID ()
{
    TModelID ret(3);
    return ret;
}

void SetAnimationID (TAnimationID animId)
{
    // do something with the animation id
}

int main(int argc, char **argv)
{
    TModelID id = GetModelID();
    SetAnimationID(id);
    return 0;
}

Trying to compile that code again gives the compile error:

error C2664: ‘SetAnimationID’ : cannot convert parameter 1 from ‘TModelID’ to ‘TAnimationID’

If you are going to be using complex objects with this setup, you are going to want to more efficiently handle copy construction and such (check links section for more info), but the above solves our problem. We can no longer mix up model id’s and animation id’s, even though they are both just really an unsigned int.

And best of all, except for the copy construction type issues, this code adds no overhead in memory or run time processing costs.

Other Uses

Besides the above, this strongly typed phantom type stuff has some other interesting uses.

Units – If you have some functions that take units of time in seconds, and others that take it in milliseconds, you could use this setup to make sure you never passed the wrong type to the wrong function. You could perhaps even make implicit conversion to fix it for you, if you desired that.
Validation – This was originally shown to me as a way to make sure that raw data from the user was validated before being used. Raw data from the user would use one phantom type, and validation functions would convert it to a different, validated phantom type. In this way, you could be sure that no unvalidated (or unsanitized!) data went to something that was vulnerable.
Deterministic Calculations – You could use this technique if you have a deterministic simulation that you need to run the same way every time. These phantom types could ensure no “non deterministic” data got into your equations – such as data from user input, or the clock, or based on frame rate, etc.

Links

Here is the info about copy construction (etc) that i mentioned. Every C++ programmer should know these things!
Sticky Bits: The Rule of The Big Three (and a half) – Resource Management in C++
Sticky Bits: The Rule of The Big Four (and a half) – Move Semantics and Resource Management

Estimating Counts of Distinct Values with KMV

A few months ago I saw what I’m about to show you for the first time, and my jaw dropped. I’m hoping to share that experience with you right now, as a I share a quick intro to probabilistic algorithms, starting with KMV, otherwise known as K-Minimum Values, which is a Distinct Value (DV) Sketch, or a DV estimator.

Thanks to Ben Deane for exposure to this really interesting area of computer science.

The Setup

Let’s say that you needed a program to count the number of unique somethings that you’ve encountered. The unique somethings could be unique visitors on a web page, unique logins into a server within a given time period, unique credit card numbers, or anything else that has a way for you to get some unique identifier.

One way to do this would be to keep a list of all the things you’ve encountered before, only inserting a new item if it isn’t already in the list, and then count the items in the list when you want a count of unique items. The obvious downside here is that if the items are large and/or there are a lot of them, your list can get pretty large. In fact, it can take an unbounded amount of memory to store this list, which is no good.

A better solution may be instead of storing the entire item in the list, maybe instead you make a 32 bit hash of the item and then add that hash to a list if it isn’t already in the list, knowing that you have at most 2^32 (4,294,967,296) items in the list. It’s a bounded amount of memory, which is an improvement, but it’s still pretty large. The maximum memory requirement there is 2^32 * 4 which is 17,179,869,184 bytes or 16GB.

You may even do one step better and just decide to have 2^32 bits of storage, and when you encounter a specific 32 bit hash, use that as an index and set that specific bit to 1. You could then count the number of bits set and use that as your answer. That only takes 536,870,912 bytes, or 512MB. A pretty big improvement over 16GB but still pretty big. Also, try counting the number of bits set in a 512MB region of memory. That isn’t the fastest operation around 😛

We made some progress by moving to hashes, which put a maximum size amount of memory required, but that came at the cost of us introducing some error. Hashes can have collisions, and when that occurs in our scenarios above, we have no way of knowing that two items hashing to the same value are not the same item.

Sometimes it’s ok though that we only have an approximate answer, because getting an exact answer may require resources we just don’t have – like infinite memory to hold an infinitely large list, and then infinite computing time to count how many items there are. Also notice that the error rate is tuneable if we want to spend more memory. If we used a 64 bit hash instead of a 32 bit hash, our hash collisions would decrease, and our result would be more accurate, at the cost of more memory used.

The Basic Idea

Let’s try something different, let’s hash every item we see, but only store the lowest valued hash we encounter. We can then use that smallest hash value to estimate how many unique items we saw. That’s not immediately intuitive, so here’s the how and the why…

When we put something into a hash function, what we are doing really is getting a deterministic pseudo random number to represent the item. If we put the same item in, we’ll get the same number out every time, and if we put in a different item, we should get a different number out, even if the second item is really similar to the first item. Interestingly the numbers we get out should have no bearing on the nature of the items we put into the hash. Even if the items we put in are similar, the output should be just as different (on average) as if we put in items that were completely different.

This is one of the properties of a good hash function, that it’s output is (usually) evenly distributed, no matter the nature of the input. What that means is that if we put N items into a hash function, those items are on average going to be evenly spaced in the output of the hash, regardless of how similar or different they were before going into the hash function.

Using this property, the distance from zero to the smallest hash we’ve ever seen can be treated as a representative estimate of the average distance between each item that we hashed. So, to get the number of items in the hash, we convert the smallest hash into a percentage from 0 to 1 of the total hash space (for uint32, convert to float and divide by (float)2^32), and then we can use this formula:

numItems = (1 / percentage) – 1

To understand why we subtract the 1, imagine that our minimum hash value is 0.5. If that is an accurate representation of the space between values, that means that we only have 1 value, right in the middle. But, if we didvide 1 by 0.5 we get 2. We have 2 REGIONS, but we have only 1 item in the list, so we need to subtract 1.

As another example imagine that our minimum hash is 0.3333 and that it is an accurate representation of the space between values. If we divide 1 by 0.3333, we get 3. We do have 3 regions if we cut the whole space into 3 parts, but we only have 2 items (we made 2 cuts to make 3 regions).

Reducing Error

This technique doesn’t suffer too much from hash collisions (so long as your hash function is a decent hash function), but it does have a different sort of problem.

As you might guess, sometimes the hash function might not play nice, and you could hash only a single item, get a hash of 1 and ruin the count estimation. So, there is possibility for error in this technique.

To help reduce error, you ultimately need information about more ranges so that you can combine the multiple pieces of information together to get a more accurate estimate.

Here are some ways to gather information about more ranges:

Keep the lowest and highest hash seen instead of just the lowest. This doubles the range information you have since you’d know the range 0-min and max-1.
Keep the lowest N hashes seen instead of only the lowest. This multiplies the amount of range information you have by N.
Instead of keeping the lowest value from a single hash, perform N hashes instead and keep the lowest value seen from each hash. Again, this multiplies the amount of range information you have by N.
Alternately, just salt the same hash N different ways (with high quality random numbers) instead of using N different hash functions, but still keep the lowest seen from each “value channel”. Multiplies info by N again.
Also a possibility, instead of doing N hashes or N salts, do a single hash, and xor that hash against N different high quality random numbers to come up with N deterministic pseudorandom numbers per item. Once again, still keep the lowest hash value seen from each “value channel”. Multiplies info by N again
Mix the above however you see fit!

Whatever route you go, the ultimate goal is to just get information about more ranges to be able to combine them together.

In vanilla KMV, the second option is used, which is to keep the N lowest hashes seen, instead of just a single lowest hash seen. Thus the full name for KMV: K-Minimum Values.

Combining Info From Multiple Ranges

When you have the info from multiple ranges and need to combine that info, it turns out that using the harmonic mean is the way to go because it’s great at filtering out large values in data that don’t fit with the rest of the data.

Since we are using division to turn the hash value into an estimate (1/percentValue-1), unusually small hashes will result in exponentially larger values, while unusually large hashes will not affect the math as much, but also likely will be thrown out before we ever see them since they will likely not be the minimum hash that we’ve seen.

I don’t have supporting info handy, but from what I’ve been told, the harmonic mean is provably better than both the geometric mean and the regular plain old vanilla average (arithmetic mean) in this situation.

So, to combine information from the multiple ranges you’ve gathered, you turn each range into a distinct value estimate (by calculating 1/percentValue-1) and then putting all those values through the mean equation of your choice (which ought to be harmonic mean, but doesn’t strictly have to be!). The result will be your final answer.

Set Operations

Even though KMV is just a distinct value estimator that estimates a count, there are some interesting probabilistic set operations that you can do with it as well. I’ll be talking about using the k min value technique for gathering information from multiple ranges, but if you use some logic you should be able to figure out how to make it work when you use any of the other techniques.

Jaccard Index

Talking about set operations, I want to start with a concept called the Jaccard index (sometimes called the Jaccard similarity coefficient).

If you have 2 sets, the Jaccard index is calculated by:

Jaccard Index = count(intersection(A,B)) / count(union(A,B))

Since the union of A and B is the combined list of all items in those sets, and the intersection of A and B is the items that they have in common, you can see that if the sets have all items in common, the Jaccard index will be 1 and if the sets have no items in common, the Jaccard index will be 0. If you have some items in common it will be somewhere between 0 and 1. So, the Jaccard Index is just a measurement of how similar two sets are.

Union

If you have the information for two KMV objects, you can get an estimate to the number of unique items there if you were to union them together, even though you don’t have much of the info about the items that are in the set.

To do a union, you just combine the minimum value lists, and remove the K largest ones, so that you are left with the K minimum values from both sets.

You then do business as usual to estimate how many items are in that resulting set.

If you think about it, this makes sense, because if you tracked the items from both lists in a third KMV object, you’d end up with the same K minimums as if you took the K smallest values from both sets individually.

Note that if the two KMV objects are of different size, due to K being different sizes, or because either one isn’t completely filled with K minimum values, you should use the smaller value of K as your union set K size.

Intersection

Finding an estimated count of intersections between two sets is a little bit different.

Basically, having the K minimum hashes seen from both lists, you can think of that as a kind of random sample of a range in both lists. You can then calculate the Jaccard index for that range you have for the two sets (by dividing the size of the intersection by the size of the union), and then use that Jaccard index to estimate an intersection count for the entire set based on the union count estimate.

You can do some algebra to the Jaccard index formula above to get this:

count(intersection(A,B)) = Jaccard Index * count(union(A,B))

Just like with union, if the two KMV objects are of different size, due to K being different sizes, or because either one isn’t completely filled with K minimum values, you should use the smaller value of K.

Sample Code

#include 
#include 
#include 
#include 
#include 
#include 

// The CKMVCounter class
template <typename T, unsigned int NUMHASHES, typename HASHER = std::hash>
class CKMVUniqueCounter
{
public:
	// constants
	static const unsigned int c_numHashes = NUMHASHES;
	static const size_t c_invalidHash = (size_t)-1;

	// constructor
	CKMVUniqueCounter ()
	{
		// fill our minimum hash values with the maximum possible value
		m_minHashes.fill(c_invalidHash);
		m_largestHashIndex = 0;
	}

	// interface
	void AddItem (const T& item)
	{
		// if the new hash isn't smaller than our current largest hash, do nothing
		const size_t newHash = HASHER()(item);
		if (m_minHashes[m_largestHashIndex] <= newHash)
			return;

		// if the new hash is already in the list, don't add it again
		for (unsigned int index = 0; index < c_numHashes; ++index)
		{
			if (m_minHashes[index] == newHash)
				return;
		}

		// otherwise, replace the largest hash
		m_minHashes[m_largestHashIndex] = newHash;

		// and find the new largest hash
		m_largestHashIndex = 0;
		for (unsigned int index = 1; index  m_minHashes[m_largestHashIndex])
				m_largestHashIndex = index;
		}
	}

	// probabilistic interface
	void UniqueCountEstimates (float &arithmeticMeanCount, float &geometricMeanCount, float &harmonicMeanCount)
	{
		// calculate the means of the count estimates.  Note that if there we didn't get enough items
		// to fill our m_minHashes array, we are just ignoring the unfilled entries.  In production
		// code, you would probably just want to return the number of items that were filled since that
		// is likely to be a much better estimate.
		// Also, we need to sort the hashes before calculating uniques so that we can get the ranges by
		// using [i]-[i-1] instead of having to search for the next largest item to subtract out
		SortHashes();
		arithmeticMeanCount = 0.0f;
		geometricMeanCount = 1.0f;
		harmonicMeanCount = 0.0f;
		int numHashes = 0;
		for (unsigned int index = 0; index < c_numHashes; ++index)
		{
			if (m_minHashes[index] == c_invalidHash)
				continue;
			numHashes++;
			float countEstimate = CountEstimate(index);
			arithmeticMeanCount += countEstimate;
			geometricMeanCount *= countEstimate;
			harmonicMeanCount += 1.0f / countEstimate;
		}
		arithmeticMeanCount = arithmeticMeanCount / (float)numHashes;
		geometricMeanCount = pow(geometricMeanCount, 1.0f / (float)numHashes);
		harmonicMeanCount /= (float)numHashes;
		harmonicMeanCount = 1.0f / harmonicMeanCount;
	}

	// friends
	template 
	friend CKMVUniqueCounter KMVUnion (
		const CKMVUniqueCounter& a,
		const CKMVUniqueCounter& b
	);

	template 
	friend float KMVJaccardIndex (
		const CKMVUniqueCounter& a,
		const CKMVUniqueCounter& b
	);

private:

	unsigned int NumHashesSet () const
	{
		unsigned int ret = 0;
		for (unsigned int index = 0; index  0 ? m_minHashes[hashIndex- 1] : 0;
		const float percent = (float)(currentHash-lastHash) / (float)((size_t)-1);
		return (1.0f / percent) - 1.0f;
	}
	
	// the minimum hash values
	std::array m_minHashes;
	size_t m_largestHashIndex;
};

// Set interface
template 
CKMVUniqueCounter KMVUnion (
	const CKMVUniqueCounter& a,
	const CKMVUniqueCounter& b
)
{
	// gather the K smallest hashes seen, where K is the smaller, removing duplicates
	std::set setMinHashes;
	std::for_each(a.m_minHashes.begin(), a.m_minHashes.end(), [&setMinHashes](size_t v) {setMinHashes.insert(v); });
	std::for_each(b.m_minHashes.begin(), b.m_minHashes.end(), [&setMinHashes](size_t v) {setMinHashes.insert(v); });
	std::vector minHashes(setMinHashes.begin(), setMinHashes.end());
	std::sort(minHashes.begin(),minHashes.end());
	minHashes.resize(std::min(a.NumHashesSet(), b.NumHashesSet()));

	// create and return the new KMV union object
	CKMVUniqueCounter ret;
	for (unsigned int index = 0; index < minHashes.size(); ++index)
		ret.m_minHashes[index] = minHashes[index];
	ret.m_largestHashIndex = ret.c_numHashes - 1;
	return ret;
}

template 
float KMVJaccardIndex (
	const CKMVUniqueCounter& a,
	const CKMVUniqueCounter& b
)
{
	size_t smallerK = std::min(a.NumHashesSet(), b.NumHashesSet());

	size_t matches = 0;
	for (size_t ia = 0; ia < smallerK; ++ia)
	{
		for (size_t ib = 0; ib < smallerK; ++ib)
		{
			if (a.m_minHashes[ia] == b.m_minHashes[ib])
			{
				matches++;
				break;
			}
		}
	}

	return (float)matches / (float)smallerK;
}

// data to add to the lists
const char *s_boyNames[] =
{
	"Loki",
	"Alan",
	"Paul",
	"Stripes",
	"Shelby",
	"Ike",
	"Rafael",
	"Sonny",
	"Luciano",
	"Jason",
	"Brent",
	"Jed",
	"Lesley",
	"Randolph",
	"Isreal",
	"Charley",
	"Valentin",
	"Dewayne",
	"Trent",
	"Abdul",
	"Craig",
	"Andre",
	"Brady",
	"Markus",
	"Randolph",
	"Isreal",
	"Charley",
	"Brenton",
	"Herbert",
	"Rafael",
	"Sonny",
	"Luciano",
	"Joshua",
	"Ramiro",
	"Osvaldo",
	"Monty",
	"Mckinley",
	"Colin",
	"Hyman",
	"Scottie",
	"Tommy",
	"Modesto",
	"Reginald",
	"Lindsay",
	"Alec",
	"Marco",
	"Dee",
	"Randy",
	"Arthur",
	"Hosea",
	"Laverne",
	"Bobbie",
	"Damon",
	"Les",
	"Cleo",
	"Robt",
	"Rick",
	"Alonso",
	"Teodoro",
	"Rodolfo",
	"Ryann",
	"Miki",
	"Astrid",
	"Monty",
	"Mckinley",
	"Colin",
	nullptr
};

const char *s_girlNames[] =
{
	"Chanel",
	"Colleen",
	"Scorch",
	"Grub",
	"Anh",
	"Kenya",
	"Georgeann",
	"Anne",
	"Inge",
	"Georgeann",
	"Anne",
	"Inge",
	"Analisa",
	"Ligia",
	"Chasidy",
	"Marylee",
	"Lashandra",
	"Frida",
	"Katie",
	"Alene",
	"Brunilda",
	"Zoe",
	"Shavon",
	"Anjanette",
	"Daine",
	"Sheron",
	"Hilary",
	"Felicitas",
	"Cristin",
	"Ressie",
	"Tynisha",
	"Annie",
	"Sharilyn",
	"Astrid",
	"Charise",
	"Gregoria",
	"Angelic",
	"Lesley",
	"Mckinley",
	"Lindsay",
	"Shanelle",
	"Karyl",
	"Trudi",
	"Shaniqua",
	"Trinh",
	"Ardell",
	"Doreen",
	"Leanna",
	"Chrystal",
	"Treasa",
	"Dorris",
	"Rosalind",
	"Lenore",
	"Mari",
	"Kasie",
	"Ann",
	"Ryann",
	"Miki",
	"Lasonya",
	"Olimpia",
	"Shelby",
	"Lesley",
	"Mckinley",
	"Lindsay",
	"Dee",
	"Bobbie",
	"Cleo",
	"Leanna",
	"Chrystal",
	"Treasa",
	nullptr
};

// driver program
void WaitForEnter ()
{
	printf("nPress Enter to quit");
	fflush(stdin);
	getchar();
}

void main(void)
{
	// how many min values all these KMV objects keep around
	static const int c_numMinValues = 15;

	printf("Using %u minimum valuesnn", c_numMinValues);

	// =====================  Boy Names  =====================
	// put our data into the KVM counter
	CKMVUniqueCounter boyCounter;
	unsigned int index = 0;
	while (s_boyNames[index] != nullptr)
	{
		boyCounter.AddItem(s_boyNames[index]);
		index++;
	}

	// get our count estimates
	float arithmeticMeanCount, geometricMeanCount, harmonicMeanCount;
	boyCounter.UniqueCountEstimates(arithmeticMeanCount, geometricMeanCount, harmonicMeanCount);

	// get our actual unique count
	std::set actualBoyUniques;
	index = 0;
	while (s_boyNames[index] != nullptr)
	{
		actualBoyUniques.insert(s_boyNames[index]);
		index++;
	}

	// print the results!
	printf("Boy Names:n%u actual uniquesn", actualBoyUniques.size());
	float actualCount = (float)actualBoyUniques.size();
	printf("Estimated counts and percent error:n  Arithmetic Mean: %0.2ft%0.2f%%n"
		"  Geometric Mean : %0.2ft%0.2f%%n  Harmonic Mean  : %0.2ft%0.2f%%n",
		arithmeticMeanCount, 100.0f * (arithmeticMeanCount - actualCount) / actualCount,
		geometricMeanCount, 100.0f * (geometricMeanCount - actualCount) / actualCount,
		harmonicMeanCount, 100.0f * (harmonicMeanCount - actualCount) / actualCount);

	// =====================  Girl Names  =====================
	// put our data into the KVM counter
	CKMVUniqueCounter girlCounter;
	index = 0;
	while (s_girlNames[index] != nullptr)
	{
		girlCounter.AddItem(s_girlNames[index]);
		index++;
	}

	// get our count estimates
	girlCounter.UniqueCountEstimates(arithmeticMeanCount, geometricMeanCount, harmonicMeanCount);

	// get our actual unique count
	std::set actualGirlUniques;
	index = 0;
	while (s_girlNames[index] != nullptr)
	{
		actualGirlUniques.insert(s_girlNames[index]);
		index++;
	}

	// print the results!
	printf("nGirl Names:n%u actual uniquesn", actualGirlUniques.size());
	actualCount = (float)actualGirlUniques.size();
	printf("Estimated counts and percent error:n  Arithmetic Mean: %0.2ft%0.2f%%n"
		"  Geometric Mean : %0.2ft%0.2f%%n  Harmonic Mean  : %0.2ft%0.2f%%n",
		arithmeticMeanCount, 100.0f * (arithmeticMeanCount - actualCount) / actualCount,
		geometricMeanCount, 100.0f * (geometricMeanCount - actualCount) / actualCount,
		harmonicMeanCount, 100.0f * (harmonicMeanCount - actualCount) / actualCount);

	// =====================  Set Operations  =====================

	// make the KMV union and get our count estimates
	CKMVUniqueCounter boyGirlUnion = KMVUnion(boyCounter, girlCounter);
	boyGirlUnion.UniqueCountEstimates(arithmeticMeanCount, geometricMeanCount, harmonicMeanCount);

	// make the actual union
	std::set actualBoyGirlUnion;
	std::for_each(actualBoyUniques.begin(), actualBoyUniques.end(),
		[&actualBoyGirlUnion](const std::string& s)
		{
			actualBoyGirlUnion.insert(s);
		}
	);
	std::for_each(actualGirlUniques.begin(), actualGirlUniques.end(),
		[&actualBoyGirlUnion](const std::string& s)
		{
			actualBoyGirlUnion.insert(s);
		}
	);

	// print the results!
	printf("nUnion:n%u actual uniques in unionn", actualBoyGirlUnion.size());
	actualCount = (float)actualBoyGirlUnion.size();
	printf("Estimated counts and percent error:n  Arithmetic Mean: %0.2ft%0.2f%%n"
		"  Geometric Mean : %0.2ft%0.2f%%n  Harmonic Mean  : %0.2ft%0.2f%%n",
		arithmeticMeanCount, 100.0f * (arithmeticMeanCount - actualCount) / actualCount,
		geometricMeanCount, 100.0f * (geometricMeanCount - actualCount) / actualCount,
		harmonicMeanCount, 100.0f * (harmonicMeanCount - actualCount) / actualCount);

	// calculate estimated jaccard index
	float estimatedJaccardIndex = KMVJaccardIndex(boyCounter, girlCounter);

	// calculate actual jaccard index and actual intersection
	size_t actualIntersection = 0;
	std::for_each(actualBoyUniques.begin(), actualBoyUniques.end(),
		[&actualGirlUniques, &actualIntersection] (const std::string &s)
		{
			if (actualGirlUniques.find(s) != actualGirlUniques.end())
				actualIntersection++;
		}
	);
	float actualJaccardIndex = (float)actualIntersection / (float)actualBoyGirlUnion.size();

	// calculate estimated intersection
	float estimatedIntersection = estimatedJaccardIndex * (float)actualBoyGirlUnion.size();

	// print the intersection and jaccard index information
	printf("nIntersection:n%0.2f estimated, %u actual.  Error: %0.2f%%n",
		estimatedIntersection,
		actualIntersection,
		100.0f * (estimatedIntersection - (float)actualIntersection) / (float)actualIntersection);

	printf("nJaccard Index:n%0.2f estimated, %0.2f actual.  Error: %0.2f%%n",
		estimatedJaccardIndex,
		actualJaccardIndex,
		100.0f * (estimatedJaccardIndex-actualJaccardIndex) / actualJaccardIndex
	);

	WaitForEnter();
}

Here’s the output of the program:

Upgrading from Set to Multi Set

Interestingly, if you keep a count of how many times you’ve seen each hash in the k min hash, you can upgrade this algorithm from being a set algorithm to a multi set algorithm and get some other interesting information from it.

Where a set is basically a list of unique items, a multi set is a set of unique items that also have a count associated with them. In this way, you can think of a multiset as just a list of items which may appear more than once.

Upgrading KMV to a multi set algorithm lets you do some new and interesting things where instead of getting information only about unique counts, you can get information about non unique counts too. But to re-iterate, you still keep the ability to get unique information as well, so it is kind of like an upgrade, if you are interested in multiset information.

Links

Want more info about this technique?

Sketch of the Day: K-Minimum Values

Sketch of the Day: Interactive KMV web demo

K-Minimum Values: Sketching Error, Hash Functions, and You

Wikipedia: MinHash – Jaccard similarity and minimum hash values

All Done

I wanted to mention that even though this algorithm is a great, intuitive introduction into probabilistic algorithms, there are actually much better distinct value estimators in use today, such as one called HyperLogLog which seems to be the current winner. Look for a post on HyperLogLog soon 😛

KMV is better than other algorithms at a few things though. Specifically, from what I’ve read, that it can extend to multiset makes it very useful, and also it is much easier to calculate intersections versus other algorithms.

I also wanted to mention that there are interesting usage cases for this type of algorithms in game development, but where these probabilistic algorithms really shine is in massive data situations like google, amazon, netflix etc. If you go out searching for more info on this stuff, you’ll probably be led down many big data / web dev rabbit holes, because that’s where the best information about this stuff resides.

Lastly, I wanted to mention that I’m using the C++ std::hash built in hash function. I haven’t done a lot of research to see how it compares to other hashes, but I’m sure, much like rand(), the built in functionality leaves much to be desired for “power user” situations. In other words, if you are going to use this in a realistic situation, you probably are better off using a better hashing algorithm. If you need a fast one, you might look into the latest murmurhash variant!

More probabilistic algorithm posts coming soon, so keep an eye out!