The last post explored compile time hashing a bit and at the end showed a way to use compile time code to assist a string to enum function, by having a minimal perfect hash that was guaranteed at compile time not to have any hash collisions.
That was pretty interesting to me, and made me wonder what else was possible with compile time (or compile time assisted) data structures. This would be mainly for static data structures that were read only. We could get creative and allow for deletes, inserts and data edits, but we’ll probably come back to that in a future post.
In pursuing other compile time data structures, there is one big question that would be easy to make assumptions about: Is custom code for compile time data structure lookup functionality faster than data?
In other words, could we hard code a hash table key to value lookup that was more performant than a general case hash table which was meant to be used by any data?
The answer to this is “usually it can be” IMO, since we could do lots of things specific to the data, like finding cheaper / better hash functions for the specific data input we were expecting. Since we aren’t doing inserts or deletes, or data modification, we could also allocate just the right amount of memory, in one shot, which would reduce memory waste and memory fragmentation.
But lets do some experiments and see. Let’s start off by comparing how a switch statement performs compared to some simple array types.
I ran these tests in x86/x64 debug/release in visual studio 2015.
I made 5000 random uint32’s and stored them in these containers:
- C style array
- dynamic memory
- Switch statement function
- Switch statement function with __assume(0) in default case
The __assume(0) in the switch default case is a hint to the optimizer to not do the usual bounds checking needed for handling a value not present in the switch statement. It’s microsoft specific, but there are equivelants in other compilers. Note that if you switch a value that would usually be handled by the default case, you would then have undefined behavior. It’s a speed up, but comes at a price. You can read about it here: msdn: __assume. This test code never passes values that would trigger the default case, so is arguably a good candidate for __assume, if you believe there are any good usages for __assume.
For a neat read on some details of undefined behavior check this out: Undefined behavior can result in time travel (among other things, but time travel is the funkiest)
I did the following tests.
- Sequential Sum: Add up the values at the indices 0-4999, in order. For the switch function, it was called with indices 0-4999 in order
- Shuffle Sum: Same as sequential sum, but in a shuffled order.
- Sparse Sum: Same as shuffle sum, but only doing the first 1000 indices. An alternate switch function was used which only contained values for those first 1000 indices, simulating the ability for us to strip out values never actually needed, to see if that has an impact on performance.
Timing was measured using std::chrono::high_resolution_clock, timing each test done 1,000 times in a row to make timing differences more apparent. I did this 50 times for each test to get an average and a standard deviation for those 50 samples.
The source code for the tests is at:
The results are below. Times are in milliseconds. The number in parentheses is the standard deviation, also in milliseconds.
Sequential Sum: Sum 5,000 numbers 1,000 times, in order.
|std::vector||177.16 (4.68)||57.87 (6.48)||1.23 (0.25)||0.30 (0.46)|
|std::array||94.53 (5.84)||53.61 (3.96)||1.25 (0.29)||0.34 (0.48)|
|C array||8.67 (0.48)||9.90 (0.57)||1.22 (0.37)||0.30 (0.46)|
|dynamic memory||8.96 (0.29)||10.01 (0.77)||1.27 (0.40)||0.30 (0.46)|
|switch function||148.50 (3.48)||115.01 (4.83)||47.53 (3.86)||43.53 (1.75)|
|switch function assume 0||135.96 (1.22)||95.54 (2.83)||46.33 (0.97)||38.91 (0.72)|
Shuffle Sum: Sum 5,000 numbers 1,000 times, in random order.
|std::vector||179.82 (8.13)||46.75 (1.05)||2.78 (0.42)||1.41 (0.19)|
|std::array||89.94 (1.23)||39.47 (1.07)||2.58 (0.29)||1.42 (0.19)|
|C array||8.59 (0.41)||8.63 (0.49)||2.55 (0.25)||1.39 (0.21)|
|dynamic memory||8.23 (0.26)||8.41 (0.50)||2.73 (0.25)||1.40 (0.20)|
|switch function||151.83 (0.95)||116.37 (16.91)||55.74 (1.97)||48.20 (0.56)|
|switch function assume 0||142.07 (0.97)||103.01 (11.17)||55.45 (2.00)||42.09 (0.79)|
Sparse Sum: Sum 1,000 of the 5,000 numbers 1,000 times, in random order.
|std::vector||34.86 (0.46)||9.35 (0.48)||0.54 (0.14)||0.26 (0.25)|
|std::array||17.88 (0.45)||7.93 (0.27)||0.52 (0.10)||0.25 (0.25)|
|C array||1.72 (0.39)||1.70 (0.46)||0.52 (0.10)||0.24 (0.25)|
|dynamic memory||1.67 (0.45)||1.70 (0.35)||0.55 (0.18)||0.26 (0.25)|
|switch function||29.10 (0.49)||18.88 (0.92)||13.42 (0.43)||9.51 (0.43)|
|switch function assume 0||26.74 (0.52)||18.19 (0.39)||12.15 (0.50)||9.23 (0.42)|
It’s interesting to see that std::array is 10 times slower than a c array or dynamic memory, in debug win32, and that std::vector is 20 times slower. In debug x64 std::array is only about 5 times slower, and std::vector is only a little bit slower than std::array. In release, std::vector and std::array are more in line with the cost of a c array, or dynamic memory.
It’s expected that std::array and std::vector are slower in debug because they do things like check indices for being out of bounds. They also have a lot more code associated with them which may not “optimize away” in debug, like it does in release (including inlining functions). It’s nice to see that they become basically free abstractions in release though. It gives us more confidence that they can be used in performance critical areas without too much extra worry, as long as we are ok with the debug slowdowns and want or need the debug functionality the types give us.
It’s also interesting to see that adding the __assume(0) to the default case in the switch statements actually improves performance a measurable amount in most tests.
But, onto the sad news…
The switch function IS NOT faster than the simple array access, and in fact is quite a bit slower!
Why is it so much slower?
One reason is because of the function call overhead associated with each lookup. When I marked the function as inline, the compiler refused to inline it. When i marked it as __forceinline to force it to inline, the compiler took A LONG TIME in the “generating code” section. I left it for 10 minutes and it still wasn’t done so killed it since it is unacceptably long. I tried a bit to get apples to apples comparisons by putting the test code into the function, but the optimizer ended up realizing it could run the function once and multiply it’s result by the number of test repeats I wanted to do. Trying to fool the optimizer meant doing things that weren’t quite an apples to apples comparison anymore, and made the results get slower. So, I’m leaving it at this: In the release assembly, you can see the function call to the switch function, where you don’t have one with eg. dynamic memory and function calls are not free, so no, the function call is not faster than not having a function call!
The second reason is how switch statements work compared to how arrays work. The switch statement first does a comparison to see if the value is within range of the maximum valued case (in other words, it does a range check, like what std::array and std::vector do in debug! It does this in debug and release both.), and then it does a jmp to a location of code where it then does a return with a constant value.
An array on the other hand just calculates the offset of a memory value based on the index passed in and the object size, and then reads that memory address.
Arrays have more direct access to the values than switch statements do.
It’s true that the switch statements could be optimized to behave more like arrays, and the whole function could go away in lieu of a memory lookup, but the optimizer doesn’t make that optimization here.
Here’s a good read on details of how switch statements are actually implemented in msvc under various conditions:
Something You May Not Know About the Switch Statement in C/C++
On the memory usage front, the simplest array (not worrying about things like geometric dynamic growth policies or memory padding) takes 20,000 bytes to store 5000 uint32’s.
In win32 (didn’t try x64 but I’m sure it’s similar), gutting the switch functions to only have a single case made a huge difference in executable size which shows us that switch statements are bad for memory efficiency too:
Debug went from 340 KB (348,160 bytes) to 244 KB (249,856 bytes).
Release went from 191 KB (196,096 bytes) to 133 KB (136,192 bytes).
I’d like to see how switch statements compare to hash tables. I think switch statements will still be slower, but it would be good to investigate the details and confirm.
This also makes me think that the “String To Enum” code in the last post, which makes use of a switch statement, is probably not the ideal code for doing that, as far as performance and memory usage go.
I think that instead of being switch statement based, it should probably be array based.
However, the idea that we can have compile time assurance that our hash has no collisions is really nice and we get that from doing case statements on run time hashes of our full set of valid inputs. It would be nice to find a way to leverage that while still getting the speed we want, as automatically as possible. At worst, code generation could be used, but it would be nicer to do something less rigid / heavy handed.
Lastly, I still believe that “ad hoc” custom fit code for specific data lookups have the best potential for performance. I want to keep looking into it to see about finding some good avenues for success.
Have any info to share? Please do!
Thanks for reading and I hope you find this useful for at least interesting.