This series of posts is aimed at examining if and how ad hoc code crafted for a specific static (unchanging / constant) data set can run faster than typical generic run time data structures. I think the answer is an obvious “yes, we can often do better”, but these posts explore the details of the problem space and explore how and when we might do better.
The last post explored switch statement performance compared to array access performance.
A switch statement is just a way of telling the compiler how we want to map integral inputs to either some code to run, or some value to return. It’s up to the compiler how to make that happen.
Because compiler makers presumably want to make their compiler generate fast code, it seems like a switch statement should be able to match the speed of an array since a switch statement could very well be implemented as an array when all it is doing is returning different values based on input. Maybe it could even beat an array, in the case of a sparse array, an array with many duplicates, or other situations.
In practice, this doesn’t seem to be the case though, and switch statements are actually quite a bit slower than arrays from the experimentation I’ve done. The main part of the overhead seems to be that it always does a jump (goto) based on the input you are switching on. It can do some intelligent things to find the right location to jump to, but if all you are doing is returning a value, it doesn’t seem smart enough to do a memory read from an array and return that value, instead of doing a jump.
You can read a nice analysis on how switch statements are compiled on the microsoft compiler here: Something You May Not Know About the Switch Statement in C/C++.
Today we are going to be analyzing how hash tables fare against switch statements, arrays, and a few other things.
I ran these tests in x86/x64 debug/release in visual studio 2015.
I got a list of 100 random words from http://www.randomwordgenerator.com/ and made sure they were all lowercase. I associated an integer value with them, from 1 to 100. My tests are all based on the string being the key and the integer being the value.
I have that data stored/represented in several ways for performing lookups:
- std::unordered_map using crc32 hash function.
- std::unordered_map using crc32 hash function modulo 337 and salted with 1147287 to prevent collisions.
- SwitchValue() switches on crc32 of input string.
- SwitchValueValidate() switches on crc32 of input string but does a single strcmp to handle possibly invalid input.
- SwitchValueMinimized() switches on crc32 of input string modulo 337 and salted with 1147287 to prevent collisions.
- SwitchValueMinimizedValidate() like SwitchValueMinimized() but does a single strcmp to handle possibly invalid input.
- g_SwitchValueMinimizedArray, the array version of SwitchValueMinimized().
- g_SwitchValueMinimizedArrayValidate, the array version of SwitchValueMinimizedValidate().
- BruteForceByStartingLetter() switches on first letter, then brute force strcmp’s words beginning with that letter.
- BruteForce() brute force strcmp’s all words.
The non validating switch statement functions have an __assume(0) in their default case to remove the overhead of testing for invalid values. This is to make them as fast as possible for the cases when you will only be passing valid values in. If ever that contract was broken, you’d hit undefined behavior, so the performance boost comes at a cost. The Validate versions of the switch functions don’t do this, as they are meant to take possibly invalid input in, and handle it gracefully. Both validating and not validating input are common use cases so I wanted to represent both in the performance analysis.
Here are the tests done:
- In Order – looks up all strings in order and sums the associated values.
- Shuffled – looks up all strings in random order and sums the associated values.
- Pre-Hashed Keys In Order – looks up all strings in order and sums the associated values, using pre-hashed keys.
- Pre-Hashed Keys Shuffled – looks up all strings in random order and sums the associated values, using pre-hashed keys.
The second two tests only apply to the value lookups which can take pre-hashed keys. For instance, g_SwitchValueMinimizedArray can be indexed by a key that was hashed before the program ran, but a std::unordered_map cannot be indexed by a hash value that was calculated in advance.
Each of those tests were done 5,000 times in a row to make performance differences stand out more, and that full amount of time is the time reported. That process was done 50 times to give both an average (a mean) and a standard deviation to show much much the time samples differed.
The source code for the tests can be found here:
Here are the results, in milliseconds. The values in parentheses are the standard deviations, which are also in milliseconds.
Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.
|std::map||7036.77 (126.41)||7070.18 (155.49)||33.02 (2.68)||35.40 (1.43)|
|std::unordered_map||4235.31 (24.41)||4261.36 (145.16)||19.97 (0.45)||20.12 (0.62)|
|std::unordered_map crc32||4236.38 (80.72)||4275.36 (116.65)||24.36 (0.47)||23.47 (0.86)|
|std::unordered_map crc32 minimized||4034.50 (12.72)||4323.67 (170.55)||26.39 (0.50)||23.68 (0.71)|
|SwitchValue()||123.28 (0.98)||144.29 (4.91)||6.81 (0.30)||5.47 (0.29)|
|SwitchValueValidate()||127.59 (1.22)||147.41 (5.20)||8.84 (0.35)||7.99 (0.36)|
|SwitchValueMinimized()||128.83 (0.95)||151.48 (4.66)||8.28 (0.38)||10.18 (0.37)|
|SwitchValueMinimizedValidate()||132.44 (1.02)||159.85 (6.73)||12.65 (0.40)||10.89 (0.36)|
|g_SwitchValueMinimizedArray||104.15 (1.13)||122.94 (5.98)||7.68 (0.36)||6.08 (0.36)|
|g_SwitchValueMinimizedArrayValidate||107.75 (1.07)||120.75 (2.80)||10.49 (0.37)||8.95 (0.32)|
|BruteForceByStartingLetter()||19.92 (0.63)||22.01 (0.86)||4.85 (0.24)||5.81 (0.26)|
|BruteForce()||118.65 (1.09)||140.20 (2.28)||31.53 (0.56)||46.47 (0.83)|
Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.
|std::map||7082.92 (214.13)||6999.90 (193.82)||32.14 (0.59)||34.20 (0.62)|
|std::unordered_map||4155.85 (133.00)||4221.84 (124.70)||20.21 (0.42)||20.09 (0.47)|
|std::unordered_map crc32||4286.44 (95.39)||4300.81 (64.37)||24.55 (0.57)||23.06 (0.57)|
|std::unordered_map crc32 minimized||4186.27 (75.35)||4111.73 (43.36)||26.36 (0.56)||23.65 (0.54)|
|SwitchValue()||127.93 (3.85)||137.63 (1.31)||6.97 (0.32)||5.47 (0.27)|
|SwitchValueValidate()||131.46 (2.34)||141.38 (1.47)||8.92 (0.38)||7.86 (0.37)|
|SwitchValueMinimized()||133.03 (2.93)||145.74 (1.50)||9.16 (0.37)||10.50 (0.41)|
|SwitchValueMinimizedValidate()||135.47 (2.27)||151.58 (1.48)||12.13 (0.40)||10.13 (0.43)|
|g_SwitchValueMinimizedArray||106.38 (2.70)||118.61 (3.73)||8.18 (0.31)||5.19 (0.29)|
|g_SwitchValueMinimizedArrayValidate||109.32 (2.34)||120.94 (3.02)||10.49 (0.55)||9.00 (0.40)|
|BruteForceByStartingLetter()||20.45 (0.92)||21.64 (0.76)||4.90 (0.31)||5.87 (0.32)|
|BruteForce()||120.70 (2.16)||140.95 (1.71)||32.50 (0.47)||45.90 (0.79)|
Pre-hashed In Order
Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.
|SwitchValue()||12.49 (0.61)||13.23 (0.37)||1.94 (0.17)||1.81 (0.12)|
|SwitchValueValidate()||17.08 (1.06)||16.72 (0.57)||4.32 (0.30)||4.05 (0.21)|
|SwitchValueMinimized()||11.83 (0.69)||12.06 (0.51)||1.29 (0.13)||1.58 (0.17)|
|SwitchValueMinimizedValidate()||16.02 (0.84)||15.84 (0.66)||3.25 (0.24)||3.47 (0.27)|
|g_SwitchValueMinimizedArray||1.23 (0.06)||1.15 (0.10)||0.00 (0.00)||0.00 (0.00)|
|g_SwitchValueMinimizedArrayValidate||4.21 (0.32)||2.99 (0.20)||2.45 (0.17)||2.66 (0.20)|
Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.
|SwitchValue()||12.96 (1.37)||13.45 (0.47)||1.84 (0.11)||1.81 (0.16)|
|SwitchValueValidate()||16.27 (2.01)||16.57 (0.63)||2.65 (0.19)||2.85 (0.17)|
|SwitchValueMinimized()||11.75 (0.63)||12.15 (0.45)||1.07 (0.07)||1.06 (0.11)|
|SwitchValueMinimizedValidate()||16.44 (0.99)||16.33 (0.58)||3.43 (0.18)||3.41 (0.22)|
|g_SwitchValueMinimizedArray||1.13 (0.06)||1.18 (0.10)||0.32 (0.05)||0.31 (0.04)|
|g_SwitchValueMinimizedArrayValidate||4.50 (0.32)||3.31 (0.18)||2.82 (0.16)||3.29 (0.18)|
There’s a lot of data, but here’s the things I found most interesting or relevant to what I’m looking at (generic data structures vs ad hoc code for data).
Tests 1 and 2
std::map and std::unordered map are very, very slow in debug as you might expect. It would be interesting to look deeper and see what it is that they are doing in debug to slow them down so much.
There is some tribal knowledge in the C++ world that says to not use std::map and to use std::unordered_map instead, but I was surprised to see just how slow std::map was. in x64 release, std::map took about 75% the time that brute force did, and in win32 release, it took the same time or was slower! std::map isn’t hash based, you give it a comparison function that returns -1,0, or 1 meaning less than, equal or greater than. Even so, you have to wonder how the heck the algorithm can be so slow that brute force is a comparable replacement for lookup times!
It’s interesting to see that everything i tried (except brute force) was significantly faster than both std::map and std::unordered_map. That saddens me a little bit, but to be fair, the usage case I’m going after is a static data structure that has fast lookup speeds, which isn’t what unordered_map aims to solve. This just goes to show that yes, if you have static data that you want fast lookup times for, making ad hoc code or rolling your own read only data structure can give you significant wins to performance, and also can help memory issues (fragmentation and wasted allocations that will never be used).
It was surprising to see that switching on the first letter and brute forcing the strings with the same first letter did so well. That is one of the faster results, competing with SwitchValue() for top dog. The interesting thing though is that BruteForceByStartingLetter() gracefully handles invalid input, while SwitchValue() does not and has undefined behavior, so another point goes to BruteForceByStartingLetter().
Tests 3 and 4
These tests were done with pre-hashed keys to simulate an ideal setup.
If you have a static key to value data structure and have the ability to make ad hoc code for your specific static data, chances are pretty good that you’ll also be able to pre-hash whatever keys you are going to be looking up so you don’t have to hash them at run time. Also, if you are doing multiple lookups with a single key for some reason, you may opt to calculate the hash only on the first lookup, and then from there re-use the hashed key.
These tests simulated those situations.
As expected, the perf results on these tests are much better than those that hash the key on demand for each lookup. Less work done at runtime means better performance.
Based on the results of the last blog post – that array lookups are super fast – you probably aren’t surprised to see that g_SwitchValueMinimizedArray is the winner for performance by far.
It is so fast that the in order case doesn’t even register any time having been taken. This is probably a little misleading, because doing the in order tests (and even the shuffled tests) are very cache friendly. In reality, you probably would have more cache misses and it wouldn’t be quite as cheap as what is being reported, but would still be super fast compared to the other options.
In second place comes SwitchValueMinimized() which is the switch statement function version of g_SwitchValueMinimizedArray. Arrays still beat switch statements, as we found out in the last post!
In third place comes SwitchValue(), which is the same as SwitchValueMinimized() but has sparser values used in the switch statement, which make it more difficult for the compiler to generate efficient code. For instance, having the full range of 32 bits as case statement values, and having them all be pseudo random numbers (because they are the result of a hash!) rules out the ability for the compiler to make a jump table array, or find any patterns in the numbers. The SwitchValueMinimized() function on the other hand has only 337 possible values, and so even though the values are sparse (there are 100 items in those 337 possible values), it’s a small enough number that a jump table array could be used without issues.
After that comes all the validated versions of the tests. It makes sense that they would be slower, because they do all the same work, and then some additional work (strcmp) to ensure that the input is valid.
Getting The Fastest Results
If you have some static data that maps keys to values, and you need it to be fast for doing lookups, it looks like writing something custom is the way to go.
The absolutely fastest way to do it is to make an array out of your data items and then pre-process (or compile time process) any places that do a lookup, to convert keys to array indices. then, at run time, you only need to do an array lookup to a known index to get your data, which is super fast. If your data has duplicates, you might also be able to make the keys which point at duplicate data instead just point at the same array index, to de-duplicate your data.
If doing that is too complex, or too much work, a low tech and low effort way to handle the problem seems to be to break your data up into buckets, possibly based on their first letter, and then doing brute force (or something else) to do the lookup among the fewer number of items.
In fact, that second method is sort of like a hard coded trie which is only two levels deep.
If you needed to do some hashing at runtime, finding a faster hash function (that also worked in constexpr, or at least had a constexpr version!) could help you get closer to the pre-hashed keys timings. The good news is the hash doesn’t have to be particularly good. It just has to be fast and have no collisions for the input values you wish to use. That seems like something where brute force searching simple hash functions with various salt values may give you the ideal solution, but probably would take a very, very long time to find what you are looking for. You might notice that the default hash used for std::unordered_map is actually faster than the crc32 implementation I used.
Of course, we also learned what NOT to do. Don’t use brute force, and don’t use std::map. Using std::unordered_map isn’t super aweful compared to those solutions, but you can do a lot better if you want to.
This fast key to value lookup might sound like a contrived need to some people, but in game development (I am a game developer!), there is commonly the concept of a game database, where you look up data about things (how much damage does this unit do?) by looking up a structure based on a unique ID that is a string, named by a human. So, in game dev, which also has high performance needs, optimizing this usage case can be very helpful. There is a little bit more talk about game data needs here: Game Development Needs Data Pipeline Middleware.
Is Code Faster Than Data?
I still think ad hoc code for data structures can often be faster than generic data structures, and the experiments on this post have positive indications of that.
Another way I think ad hoc code could be helpful is when you have hierarchical and/or heterogeneous data structures. By that I mean data structures which have multiple levels, where each level may actually have different needs for how best to look up data in it, and in fact, even siblings on the same level maybe have different needs for how best to look up data in it.
In these cases, you could make some data types which had virtual functions to handle the nature of the data needing different solutions at each level, but those virtual function calls and abstractions add up.
I think it’d be superior to have hard coded code that says “oh, you want index 4 of the root array? ok, that means you are going to binary search this list next”. Of course, that code needs to be generated by another program to be effective. If a human has to make sure all that code stays up to date, it’ll be a ton of work, and it’ll be broken, making very subtle hard to reproduce bugs.
A downside I can see to ad hoc code solutions is possibly thrashing the instruction cache more. Not sure if that’d be an issue in practice, it’d be interesting to try more complex data structures and see how it goes.
Also, it might be difficult to have good heuristics to figure out what is best in which situations. I could see a utility possibly generating different variations of code and running them to see which was most performant. Seems like it’d be a challenge to get 100% right all the time, but our experiments make it seems like it’d be easy to do significantly better than generic algorithms which are meant to be dynamic at runtime.
I also think that more complex data structures are more likely to get benefit of having custom code made for them. Simple ones less likely so. It’s hard to beat an array lookup. That being said, the unbeatable simple data structures make great building blocks for the more complex ones (;
It probably would also be good to look into memory usage a bit more to see how ad hoc code compares to generic algorithms. If ad hoc code is much faster but uses more memory, that’d have to be a conscious decision to make when weighing the trade offs.
Maybe in the future, the C++ standards will allow for static data structure types that you have to initialize with compile time constants (allowing constexpr), that are optimized for lookup times since they can’t have any inserts or deletes? I wonder how much demand there would be for that?
Here’s a good read on some other string friendly data structures:
Data Structures for Strings
Twitter user @ores brought up two interesting points:
- It would be interesting to see gperf performs in this situation. If makes a faster minimal perfect hash function, it’ll get us closer to the pre-hashed keys timings.
- It would be interesting to time scripting languages to see if for them code is faster than data or not. Another interesting aspect of this would be to look at a JIT compiled scripting language like lua-jit. The thing that makes JIT interesting is that it can compile for your specific CPU, instead of having to compile for a more generic set of CPU features. That gives it the opportunity to make code which will perform better on your specific machine.