# Evaluating Polynomials with the GPU Texture Sampler

This is an extension of a paper I wrote which shows how to use the linear texture sampling capabilities of the GPU to calculate points on Bezier curves. You store the control points in the texture, then sample along the texture’s diagonal to get points on the curve:
GPU Texture Sampler Bezier Curve Evaluation

I’ve been thinking about the items in the “future work” section and found some interesting things regarding polynomials, logic gates, surfaces and volumes. This is the first post, which deals with evaluating polynomials.

# Evaluating Polynomials

One of the main points of my paper was that N-linear interpolation (linear, bilinear, trilinear, etc) can be used to evaluate the De Casteljau algorithm since both things are just linear interpolations of linear interpolations. (Details on bilinear interpolation here: Bilinear Filtering & Bilinear Interpolation).

This meant that it was also able to calculate Bernstein Polynomials (aka the algebraic form of Bezier curves), since Bernstein polynomials are equivalent to the De Casteljau algorithm.

I started looking around to see what would happen if you messed around with the De Casteljau algorithm a bit, like interpolate at one level by
$t^2$ or $t*0.5+0.5$ or by a constant or by another variable completely. My hope was that I’d be able to make the technique more generic and open it up to a larger family of equations, so people weren’t limited to just Bernstein polynomials.

That opened up a pretty deep rabbit hole on polynomial blossoming and something called Symmetric Multiaffine Functions. There are some great links in the answer here:
Math Stack Exchange: Modifying and Generalizing the De Casteljau Algorithm

In the end, it turned out to be pretty simple though. It turns out that any polynomial can be converted back and forth from “Power Basis” (which looks like $Ax^2+Bx+C$) to “Bernstein Basis” (which looks like $A(1-t)^2+B(1-t)t+Ct^2$) so long as they are the same degree.

This isn’t the result I was expecting but it is a nice result because it’s simple. I think there is more to be explored by sampling off the diagonal, and using different t values at different stages of interpolation, but this result is worth sharing.

By the way, you could also use curve fitting to try and approximate a higher degree function with a lower degree one, but for this post, I’m only going to be talking about exact conversion from Bernstein polynomials to Power polynomials.

Since we can convert power basis polynomials to Bernstein polynomials, and the technique already works for Bernstein polynomials, that means that if we have some random polynomial, say $y=2x^3+4x+2$, that we can make this technique work for that too. The technique got a little closer to arbitrary equation evaluation. Neat!

# Converting Power Basis to Bernstein Basis

I found the details of the conversion process at Polynomial Evaluation and Basis Conversion which was linked to by Math Stack Exchange: Convert polynomial curve to Bezier Curve control points.

This is best explained working through examples, so let’s start by converting a quadratic polynomial from power basis to Bernstein basis.

$y=2x^2+8x+3$

The first thing we do is write the coefficients vertically, starting with the $x^0$ coefficient, then the $x^1$ coefficient and continuing on to the highest value $x^n$:

$\begin{array}{c} 3 \\ 8 \\ 2 \\ \end{array}$

Next, we need to divide by the Binomial Coefficients (aka the row of Pascal’s Triangle which has the same number of items as we have coefficients). In this case we need to divide by: 1,2,1.

$\begin{array}{c|c} 3 & 3 / 1 = 3 \\ 8 & 8 / 2 = 4 \\ 2 & 2 / 1 = 2 \\ \end{array}$

Now we generate a difference table backwards. it’s hard to explain what that is in words, but if you notice, each value is the sum of the value to the left of it, and the one below that.

$\begin{array}{c|c|c|c} 3 & 3 / 1 = 3 & 7 & 13 \\ 8 & 8 / 2 = 4 & 6 & \\ 2 & 2 / 1 = 2 & & \\ \end{array}$

We are all done. The control points for the Bezier curve are on the top row (ignoring the left most column). They are 3,7,13 which makes it so we have the following two equations being equal. The first is in power basis, the second is in Bernstein basis.

$y=2x^2+8x+3$
$y=3(1-x)^2+14(1-x)x+13x^2$

Note: don’t forget that Bezier curves multiply the control points by the appropriate row in Pascal’s triangle. That’s where the 14 comes from in the middle term of the Bernstein polynomial. We are multiplying the control points 3,7,13 by the row in Pascal’s triangle 1,2,1 to get the final coefficients of 3,14,13.

Let’s have Wolfram Alpha help us verify that they are equal.

Yep, they are equal! If you notice the legend of the graph, wolfram actually converted the Bernstein form back to power basis, and you can see that they are exactly equivalent.

You can also write the Bernstein form like the below, which i prefer, using $t$ instead of $x$ and also setting $s=1-t$.

$y=3s^2+14st+13t^2$

Cubic Function

A cubic function is not that much harder than a quadratic function. After this, you should see the pattern and be able to convert any degree easily.

$y=5x^3+9x-4$

Again, the first thing we do is write the coefficients vertically, starting with the constant term. Note that we don’t have an $x^2$ term, so it’s coefficient is 0.

$\begin{array}{c} -4 \\ 9 \\ 0 \\ 5 \\ \end{array}$

We next divide by the Pascal’s triangle row 1,3,3,1.

$\begin{array}{c|c} -4 & -4 / 1 = -4 \\ 9 & 9 / 3 = 3 \\ 0 & 0 / 3 = 0 \\ 5 & 5 / 1 = 5 \\ \end{array}$

Now, make the difference table going backwards again:

$\begin{array}{c|c|c|c|c} -4 & -4 / 1 = -4 & -1 & 2 & 10 \\ 9 & 9 / 3 = 3 & 3 & 8 & \\ 0 & 0 / 3 = 0 & 5 & & \\ 5 & 5 / 1 = 5 & & & \\ \end{array}$

Our Bezier control points are along the top: -4,-1,2,10. Keeping in mind that the coefficients for a cubic bezier curve are multiplied by 1,3,3,1 we can make the Bernstein form and put it next to our original formula:

$y=5x^3+9x-4$
$y=-4(1-x)^3-3(1-x)^2x+6(1-x)x^2+10x^3$

Let’s check in wolfram alpha again:
Wolfram Alpha: graph y=5x^3+9x-4, y=-4(1-x)^3-3x(1-x)^2+6x^2(1-x)+10x^3, from 0 to 1

And here it is in the cleaner form:

$y=-4s^3-3s^2t+6st^2+10t^3$

## Some Notes On Calculating Polynomials with the Texture Sampler

You may notice that in the comparison graphs i only plotted the graphs from 0 to 1 on the x axis (aka the t axis). The equations are actually equivalent outside of that range as well, but the technique from my paper only works from the 0 to 1 range because it relies on built in hardware pixel interpolation. This may sound like a big limitation, but if you know the minimum and maximum value of x that you want to plug into your equation at runtime, you can convert your x into a percent between those values, get the resulting polynomial, convert it to Bernstein form, set up the texture, and then at runtime convert your input parameter into that percent when you do the lookup. In other words, you squeeze the parts of the function you care about into the 0 to 1 range.

Another issue you will probably hit is that standard RGBA8 textures have only 8 bits per channel and can only store values between 0 and 1. Since the texture is supposed to be storing your control points, that is bad news.

One way to get around this is to find the largest coefficient value and divide the others by this value. This will put the coefficients into the 0 to 1 range, which will be able to be stored in your texture. After sampling the texture, you multiply the result by that scaling value to get the correct answer.

Scaling won’t help having both negative and positive coefficients though. To handle negative coefficients, you could map the 0-1 space to be from -1 to 1, similar to how we often do it with normal maps and other signed data stored in textures. After doing the lookup you’d have to unmap it too of course.

You could also solve negative values and scaling problems by squishing the y axis into the 0 to 1 space by subtracting the minimum and dividing by the maximum minus the minimum, similarly to how we squished the x range into 0 to 1.

If you instead move to an RGBAF32 texture, you’ll have a full 32 bit float per color channel and won’t have problems with either large values or negative values. You will still have to deal with x only going from 0 to 1 though.

I also want to mention that the hardware texture interpolation works in a X.8 fixed point format. There are more details in my paper, but that means that you’ll get some jagged looking artifacts on your curve instead of a smoothly varying value. If that is a problem for you in practice, my paper talks about a few ways to mitigate that issue.

Before moving on, I wanted to mention that it’s easy to support rational polynomials using this method as well. A rational polynomial is when you divide one polynomial by another polynomial, and relates to rational Bezier curves, where you divide one curve by another curve (aka you give weights to control points). Rational curves are more powerful and in fact you can perfectly represent sine and cosine with a quadratic rational polynomial. More info on that in my paper.

To calculate rational polynomials, you just encode the numerator polynomial in one color channel, and the denominator polynomial in another color channel. After you sample the texture and get the result of your calculation, you divide the numerator value by the denominator value. It costs one division in your shader code, but that’s pretty cheap for the power it gives you!

Regarding the texture size requirements to store a polynomial of a specific degree…

Every dimension of the texture, and every color channel in that texture, adds a degree.

However, to get the benefit of the degree increase from the color channel, you need to do a little more math in the shader – check my paper for more details!

So, if you wanted to store a quadratic polynomial in a texture, you would need either a 2d texture with 1 color channel, or you could do it with a 1d texture that had 2 color channels.

If you wanted to store a cubic polynomial in a texture, you could use a 3d texture with 1 color channel, or a 2d texture with two color channels (there would be some waste here) or a 1d texture with three color channels.

For a polynomial that had a maximum degree term of 6, you could use a 3d volume texture that had 3 color channels: RGB.

If you need to evaluate a very high degree polynomial, you can actually take multiple texture samples and combine them.

For instance, if you had a 2d texture with a single color channel, you could do a single texture read to get a quadratic.

If you linearly interpolated between those two quadratics, you would end up with a cubic.

That isn’t a very high degree curve but is easier to grasp how they combine.

Taking this up to RGBA 3d volume textures, a single texture read will get you a curve of degree 6. If you do another read, it will take it to degree 7. Another read gets you to 8, another to 9, etc.

With support for 4d textures, an RGBA texture read would give you a degree 7 curve. Another read would boost it to 8, another to 9, another to 10, etc.

Regarding the specific sizes of the textures, in all cases the texture size is “2” on each dimension because we are always just linearly interpolating within a hyper cube of pixel values. You can increase the size of the texture for piecewise curves, check out the paper for more details on that and other options.

## Closing

Hopefully you found this useful or interesting!

There may not have been much new information in here for the more math inclined people, but I still think it’s worth while to explicitly show how the technique works for both Bernstein polynomials as well as the more common power basis polynomials.

I still think it would be interesting to look at what happens when you sample off of the diagonal, and also what happens if you use different values at different stages of the interpolation. As an example, instead of just looking up a texture at (t,t) for the (u,v) value to get a quadratic curve point, what if we look up by (t,t^2)? At first blush, it seems like by doing that we may be able to boost a curve to a higher degree, maybe at the cost of some reduced flexibility for the specific equations we can evaluate?

Next up I’ll be writing up some more extensions to the paper involving logic gates, surfaces, and volumes.

Have any feedback, questions or interesting ideas? Let me know!

# The Secret to Writing Fast Code / How Fast Code Gets Slow

This is a “soft tech” post. If that isn’t your thing, don’t worry, I’ll be returning to some cool “hard tech” and interesting algorithms after this. I’ve been abusing the heck out of the GPU texture sampler lately, so be on the lookout for some posts on that soon (;

I’m about to show you some of the fastest code there is. It’s faster than the fastest real time raytracer, it’s faster than Duff’s Device.

Heck, despite the fact that it runs on a classical computer, it runs faster than Shor’s Algorithm which uses quantum computing to factor integers so quickly that it breaks modern cryptographic algorithms.

This code also runs faster than Grover’s Algorithm which is another quantum algorithm that can search an unsorted list in O(sqrt(N)).

Even when compiled in debug it runs faster than all of those things.

Are you ready? here it is…

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
return 0;
}


Yes, the code does nothing and that is precisely why it runs so fast.

# The Secret to Writing Fast Code

The secret to writing fast code, no matter what you are writing is simple: Don’t do anything that is too slow.

Let’s say you started with a main() function like i showed above and you decided you want to make a real time raytracer that runs on the CPU.

First thing you do is figure out what frame rate you want it to run at, at the desired resolution. From there, you know how many milliseconds you have to render each frame, and now you have a defined budget you need to stay inside of. If you stay in that budget, you’ll consider it a real time raytracer. If you go outside of that budget, it will no longer be real time, and will be a failed program.

You may get camera control working and primary rays intersecting a plane, and find you’ve used 10% of your budget and 90% of the budget remains. So far so good.

Next up you add some spheres and boxes, diffuse and specular shade them with a directional light and a couple point lights. You find that you’ve used 40% of your budget, and 60% remains. We are still looking good.

Next you decide you want to add reflection and refraction, allowing up to 3 ray bounces. You find you are at 80% of your budget and are still looking good. We are still running fast enough to be considered real time.

Now you say to yourself “You know what? I’m going to do 4x super sampling for anti aliasing!”, so you shoot 4 rays out per pixel instead of 1 and average them.

You profile and uh oh! You are at 320% of your budget! Your ray tracer is no longer real time!

What do you do now? Well, hopefully it’s obvious: DON’T DO THAT, IT’S TOO SLOW!

So you revert it and maybe drop in some FXAA as a post processing pass on your render each frame. Now you are at 95% of your budget let’s say.

Now you may want to add another feature, but with only 5% of your budget left you probably don’t have much performance to spare to do it.

So, you implement whatever it is, find that you are at 105% of your budget.

Unlike the 4x super sampling, which was 220% overbudget, this new feature being only 5% over budget isn’t THAT much. At this point you could profile something that already exists (maybe even your new feature) and see if you can improve it’s performance, or if you can find some clever solution that gives you a performance boost, at the cost of things you don’t care about, you can do that to get some performance back. This is a big part of the job as a successful programmer / software engineer – make trade offs where you gain benefits you care about, at the cost of things you do not care about.

At this point, you can also decide if this new feature is more desired than any of the existing features. If it is, and you can cut an old feature you don’t care about anymore, go for it and make the trade.

Rinse and repeat this process with new features and functionality until you have the features you want, that fit within the performance budget you have set.

Follow this recipe and you too will have your very own real time raytracer (BTW related:Making a Ray Traced Snake Game in Shadertoy).

Maintaining a performance budget isn’t magic. It’s basically subtractive synthesis. Carve time away from your performance budget by adding a feature, then optimize or remove features if you are over budget. Rinse and repeat until the sun burns out.

Ok, so if it’s so easy, why do we EVER have performance problems?

## How Fast Code Gets Slow

Performance problems come up when we are not paying attention. Sometimes we cause them for ourselves, and sometimes things outside of our control cause them.

The biggest way we cause performance problems for ourselves is by NOT MEASURING.

If you don’t know how your changes affect performance, and performance is something you care about, you are going to have a bad time.

If you care about performance, measure performance regularly! Profile before and after your changes and compare the differences. Have automated tests that profile your software and report the results. Understand how your code behaves in the best and worst case. Watch out for algorithms that sometimes take a lot longer than their average case. Stable algorithms make for stable experiences (and stable frame rates in games). This is because algorithms that have “perf spikes” sometimes line up on the same frame, and you’ll have more erratic frame rate, which makes your game seem much worse than having a stable but lower frame rate.

But, again, performance problems aren’t always the programmers fault. Sometimes things outside of our control change and cause us perf problems.

Well, let’s say that you are tasked with writing some very light database software which keeps track of all employee’s birthdays.

Maybe you use a hash map to store birthdays. The key is the string of the person’s name, and the value is a unix epoch timestamp.

Simple and to the point. Not over-engineered.

Now, someone else has a great idea – we have this database software you wrote, what if we use it to keep track of all of our customers and end user birthdays as well?

So, while you are out on vacation, they make this happen. You come back and the “database” software you made is running super slow. There are hundreds of thousands of people stored in the database, and it takes several seconds to look up a single birthday. OUCH!

So hotshot, looks like your code isn’t so fast huh? Actually no, it’s just that your code was used for something other than the original intended usage case. If this was included in the original specs, you would have done something different (and more complex) to handle this need.

This was an exaggerated example, but this sort of thing happens ALL THE TIME.

If you are working on a piece of software, and the software requirements change, it could turn any of your previous good decisions into poor decisions in light of the new realities.

However, you likely don’t have time to go back and re-think and possibly re-work every single thing you had written up to that point. You move onward and upward, a little more heavy hearted.

The target moved, causing your code to rot a bit, and now things are likely in a less than ideal situation. You wouldn’t have planned for the code you have with the info you have now, but it’s the code you do have, and the code you have to stick with for the time being.

Every time that happens, you incur a little more tech debt / code complexity and likely performance problems as well.

You’ll find that things run a little slower than they should, and that you spend more time fighting symptoms with small changes and somewhat arbitrary rules – like telling people not to use name lengths more than 32 characters for maximum performance of your birthday database.

Unfortunately change is part of life, and very much part of software development, and it’s impossible for anyone to fully predict what sort of changes might be coming.

Those changes are often due to business decisions (feedback on product, jockying for a new position in the marketplace, etc), so are ultimately what give us our paychecks and are ultimately good things. Take it from me, who has worked at ~7 companies in 15 years. Companies that don’t change/adapt die.

So, change sucks for our code, but it’s good for our wallets and keeps us employed 😛

Eventually the less than ideal choices of the past affecting the present will reach some threshold where something will have to be done about it. This will likely happen at the point that it’s easier to refactor some code, than to keep fighting the problems it’s creating by being less than ideal, or when something that really NEEDS to happen CAN’T happen without more effort than the refactor would take.

When that happens, the refactor comes in, where you DO get to go back and rethink your decisions, with knowledge of the current realities.

The great thing about the refactor is that you probably have a lot of stuff that your code is doing which it doesn’t really even NEED to be doing.

Culling that dead functionality feels great, and it’s awesome watching your code become simple again. It’s also nice not having to explain why that section of code behaves the way it does (poorly) and the history of it coming to be. “No really, I do know better, but…!!!”

One of the best feelings as a programmer is looking at a complex chunk of code that has been a total pain, pressing the delete key, and getting a little bit closer back to the fastest code in the world:

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
return 0;
}


PS: Another quality of a successful engineer is being able to constantly improve software as it’s touched. If you are working in an area of code, and you see something ugly that can be fixed quickly and easily, do it while you are there. Since the only constant in software development is change, and change causes code quality to continually degrade, make yourself a force of continual code improvement and help reverse the flow of the code flowing into the trash can.

## Engines

In closing, I want to talk about game engines – 3rd party game engines, and re-using an engine from a different project. This also applies to using middleware.

Existing engines are great in that when you and your team know how to use them, you can get things set up very quickly. It lets you hit the ground running.

However, no engine is completely generic. No engine is completely flexible.

That means that when you use an existing engine, there will be some amount of features and functionality which were made without your specific usage case in mind.

You will be stuck in the world where from day 1 you are incurring the tech debt type problems I describe above, but you will likely be unable or unwilling to refactor everything to suit your needs specifically.

I don’t mention this to say that engines are bad. Lots of successful games have used engines made by other people, or re-used engines from previous projects.

However, it’s a different kind of beast using an existing engine.

Instead of making things that suit your needs, and then using them, you’ll be spending your time figuring out how to use the existing puzzle pieces to do what you want. You’ll also be spending time backtracking as you hit dead ends, or where your first cobbled together solution didn’t hold up to the new realities, and you need to find a new path to success that is more robust.

Just something to be aware of when you are looking at licensing or re-using an engine, and thinking that it’ll solve all your problems and be wonderful. Like all things, it comes at a cost!

Using an existing engine does put you ahead of the curve: At day 1 you already have several months of backlogged technical debt!

Unfortunately business realities mean we can’t all just always write brand new engines all the time. It’s unsustainable

Agree / Disagree / Have something to say?

# Minimizing Code Complexity by Programming Declaratively

Writing good code is something all programmers aspire to, but the definition of what actually makes good code can be a bit tricky to pin down. The idea of good code varies from person to person, from language to language, and also varies between problem domains. Web services, embedded devices and game programming are few software domains that all have very different needs and so also have very different software development styles, methods and best practices.

I truly believe that we are in the stone ages of software development (ok, maybe the bronze age?), and that 100 years from now, people will be doing things radically differently than we do today because they (or we) will have figured out better best practices, and the languages of the day will usher people towards increased success with decreased effort.

This post is on something called declarative programming. The idea is nothing new, as prolog from 1972 is a declarative language, but the idea of declarative programming is something I don’t think is talked about enough in the context of code quality.

By the end of this read, I hope you will agree that programming declaratively by default is a good best practice that pertains to all languages and problem domains. If not, leave a comment and let me know why!

## Declarative vs Imperative Programming

Declarative programming is when you write code that says what to do. Imperative programming is when you write code that says how to do it.

Below is some C++ code written imperatively. How long does it take you to figure out what the code is doing?

	int values[4] = { 8, 23, 2, 4 };
int sum = 0;
for (int i = 0; i < 4; ++i)
sum += values[i];
int temp = values[0];
for (int i = 0; i < 3; ++i)
values[i] = values[i + 1];
values[3] = temp;


Hopefully it didn’t take you very long to understand the code, but you had to read it line by line and reason about what each piece was doing. It may not be difficult, but it wasn’t trivial.

Here is the same code with some comments, which helps it be understandable more quickly, assuming the comments haven’t become out of date (:

	// Declare array
int values[4] = { 8, 23, 2, 4 };

// Calculate sum
int sum = 0;
for (int i = 0; i < 4; ++i)
sum += values[i];

// Rotate array items one slot left.
int temp = values[0];
for (int i = 0; i < 3; ++i)
values[i] = values[i + 1];
values[3] = temp;


Here is some declarative code that does the same thing:

	int values[4] = { 8, 23, 2, 4 };
int sum = SumArray(values);
RotateArrayIndices(values, -1);


The code is a lot quicker and easier to understand. In fact the comments aren’t even needed anymore because the code is basically what the comments were.

Comments are often declarative, saying what to do right next to the imperative code that says how to do it. If your code is also declarative though, there is no need for the declarative comments because they are redundant! In fact, if you decide to start trying to write code more declaratively, one way to do so is if you ever find yourself writing a declarative comment to explain what some code is doing, wrap it in a new function, or see if there is an existing function you ought to be using instead.

As a quick tangent, you can use the newer C++ features to make code more declarative, like the below. You arguably should be doing that when possible, if your code base uses STL, a custom STL implementation, or an in house STL type replacement, but I want to stress that this is a completely separate topic than whether or not we should be using new C++ features. Some folks not used to STL will find the below hard to read compared to the last example, which takes away from the main point. So, if you aren’t a fan of STL due to it’s readability (I agree!), or it’s performance characteristics (I also agree!), don’t worry about it. For people on the other side of the fence, you can take this as a pro STL argument though, as it does make code more declarative, if the readability and perf things aren’t impacting you.

	std::array<int,4> values = { 8, 23, 2, 4 };
int sum = std::accumulate(values.begin(), values.end(), 0);
std::rotate(values.begin(), values.begin() + 1, values.end());


## We Already Live in a Semi-Declarative World

When reading the tip about using (declarative) comments as a hint for when to break some functionality out into another function, you may be thinking to yourself: “Wait, isn’t that just the rule about keeping functions small, like to a few lines per function?”

Yeah, that is true. There is overlap between that rule and writing declarative code. IMO declarative code is a more general version of that rule. That rule is part of making code declarative, and gives some of the benefits, but isn’t the whole story.

The concept of D.R.Y. “Don’t Repeat Yourself” also ends up causing your code to become more declarative. When you are repeating yourself, it’s often because you are either duplicating code, or because there is boiler plate code that must be added in multiple places to make something work. By applying D.R.Y. and making a single authoritative source of your information or work, you end up taking imperative details out of the code, thus making what remains more declarative. For more information on that check out this post of mine: Macro Lists For The Win

## TDD

If your particular engineering culture uses TDD (test driven development), you may also say “Hey, this isn’t actually anything special, this is also what you get when you use TDD.”

Yes, that is also true. Test driven development forces you to write code such that each individual unit of work is broken up into it’s own contained, commonly stateless, function or object.

It’s suggested that the biggest value of TDD comes not from the actual testing, but from how TDD forces you to organize your code into small logical units of work, that are isolatable from the rest of the code.

In other words, TDD forces you to make smaller functions that do exactly what they say by their name and nothing else. Sound familiar? Yep, that is declarative programming.

## Compilers, Optimizers and Interpreters

The whole goal of compilers, optimizers and interpreters is to make it so you the coder can be more declarative and less imperative.

Compilers make it so you don’t have to write assembly (assembly being just about as imperative as you can get!). You can instead write higher level concepts about what you want done – like loop over an array and sum up the values – instead of having to write the assembly (or machine code!) to load values into memory or registers, do work, and write them back out to memory or registers.

Similarly, the whole goal of optimizers are to take code where you describe what you want to happen, and find a way to do the equivalent functionality in a faster way. In other words, you give the WHAT and it figures out the HOW. That is declarative programming.

Interestingly, switch statements are declarative as well. You tell the compiler what items to test for at run time but leave it up to the compiler to figure out how to test for them. It turns out that switch statements can decide at compile time whether they want to use binary searching, if/else if statements, or other tricks to try and make an efficient lookup for the values you’ve provided.

Surprised to hear that? Give this a read: Something You May Not Know About the Switch Statement in C/C++

Similarly, profile guided optimization (PGO) is a way for the optimizer to know how your code actually behaves at runtime, to get a better idea at what machine code it ought to generate to be more optimal. Again, it’s taking your more declarative high level instructions, and creating imperative low level instructions that actually handle the HOW of doing what your code wants to do in a better way.

## C#

If you’ve spent any time using C#, I’ll bet you’ve come to the same conclusion I have: If it takes you more than one line of code to do a single logical unit of work (read a file into a string, sort a list, etc), then you are probably doing it wrong, and there is probably some built in functionality already there to do it for you.

When used “correctly”, C# really tends to be declarative.

## C++ Advancements Towards Being Declarative

In the old days of C, there were no constructors or destructors. This meant that you had to code carefully and initialize, deinitialize things at just the right time.

These were imperative details that if you got wrong, would cause bugs and make a bad day for you and the users of your software.

C++ improved on this problem by adding constructors and destructors. You could now put these imperative details off in another section and then not worry about it in the bulk of the code. C++ made C code more declarative by letting you focus more on the WHAT to do, and less on HOW to do it, in every line of code.

In more recent years, we’ve seen C++ get a lot of upgrades, many of which make C++ more declarative. In other words, common things are now getting language and/or STL library support.

For one, there are many operations built in which people used to do by hand that are now built in – such as std::sort or std::qsort. You no longer have to write out a sorting algorithm imperatively, you just use std::sort and move on.

Another really good example of C++ getting more declarative is lambdas. Lambdas look fancy and new, but they are really just a syntactic shortcut to doing something we could do all along. When you make a lambda, the compiler makes a class for you that overloads the parentheses operator, has storage for your captures and captures those captures. A struct that looks like this is called a functor and has existed for a long time before lambdas ever entered C++. The only difference is that if you want to use a functor now, you don’t have to go through a bunch of nitty gritty imperative details for making your functor class. Now, you just defined a lambda and move on.

## Domain Specific Languages

Domain specific languages – aka DSLs – exist to let people write code meant for specific purposes. Some examples of DSLs are:

• HTML – what we use to make static webpages
• XSLT – a language to transform XML data into other data
• SQL – a language to query information from databases
• Regex – a language to search text

Believe it or not, DSL is a synonym of declarative programming languages.

HTML for instance completely cuts out things like loops, memory allocation and image loading, and lets you just specify how a web page should look. HTML is declarative because you deal only with the issues in the problem space, not with the imperative details of how to make it all happen.

It’s similar for the others in the list, and other DSLs not on the list. They all try to remove complexity you don’t care about to try and distill a language that deals only with the things in the problem space.

## Our Job As Programmers

As programmers, it’s only one part of our job to make “good code” that is easy to read and easy to maintain, but many non programmers would laugh to hear that we spend so much time thinking about that.

The other part of our job is the end result of what our program does. This is what non programmers focus more heavily on of course, and is ultimately what makes software successful or not – at least in the short term. Software needs to do good stuff well to be successful, but if you don’t make good code, you are going to sink your business in bugs, inflexibility, maintenance costs, etc.

Programmers mainly either write code for use by other programmers (such as libraries and APIs), or they make software for use by other people.

In both cases, the job is really that we are trying to hide away imperative details (implementation complexity) and give our customers a domain specific language to do what they want to do in the easiest and best way possible. It’s very important in both cases that the users of your API or the users of your software don’t have to deal with things outside the problem space. They want to work declaratively, saying only what to do, and have our software handle the imperative details of how to do it. They paid for the software so they didn’t have to deal with those details.

As an example, when you work in an excel spreadsheet and it does an average of a row of columns, it doesn’t make you decide whether it should use SIMD instructions to do the math or scalar instructions. You don’t really care, and it almost certainly doesn’t matter enough to anyone using excel which to do, so excel just does whatever it does internally to give you what you asked for.

It can be a challenge knowing what needs to be hidden away when making an API or software for users, but that comes from understanding what it is that your customers actually need and what they are trying to do, which is already a super important step.

The good news is that you don’t have to perfectly match the customers needs to improve their lives. Any imperative details that you can remove is a win. I’m not talking about taking away abilities that people want and should have, I’m talking about removing “chores”, especially ones that if done wrong can cause problems – like nulling out a pointer after deleting it, or initializing all members of a class when creating an object, or the details of loading an image into memory.

None of this should really be that surprising to anyone, but hopefully thinking of these things in a declarative vs imperative way formalizes and generalizes some ideas.

## Why Wouldn’t You Program Declaratively?

Purely declarative programming means that you only describe the things you care about and nothing else. If this is true, why wouldn’t you ALWAYS program declaratively? In fact, why do imperative languages even exist? Why would you ever want to waste time dealing with what you by definition did not care about?

Well, for one, it’s impossible to nail down what it is exactly that people do and do not care about, especially in something like a programming language which is likely to be used for lots of different purposes by lots of different people. It’s been real eye opening seeing the diverse needs of the C++ community in recent years for instance. As a C++ game programmer, surrounded by primarily C++ game programmers, I thought I knew what the language needed, but there are lots of things I never considered because I don’t have a need for, unlike some other C++ programmers out there.

Another big point is that declarative languages by definition are a sort of black box. You tell it what to do but not how. It has to figure out the details of how to do it in a good way. The problem is that the compiler (or similar process) has limited abilities to make these choices, and also has limited information about the problem space.

For instance, a declarative language may let you work with a set and say “put item X into the set” and “does item Y exist in this set?”. You can imagine it could perhaps use a hash table, where each hash bucket was a linked list of values. This way, when you queried if the item Y was in the set, it could hash it, then do comparisons against whatever items were in that bucket.

That implementation is fairly reasonable for many programs.

What if instead, you want to keep a set of unique visitors to a popular website, like say google.com? That set is going to use a ton of memory and/or be very slow because it’s going to be HUGE.

In that case, someone is likely to go with a probabilistic algorithm perhaps (maybe a bloom filter), where it’s ok that the answer isn’t exactly right, because the memory usage and computation time drops off significantly going probabilistic, and actually makes the feature possible.

The declarative language is likely not going to be smart enough to figure out that it should use a probabilistic algorithm, and nobody ever told it that it could.

Sure, you could add probabilistic set support to the declarative language, and people could specifically ask for it when they need it (they care about it, so it should be part of the DSL), but we could make this argument about many other things. The point is just that without super smart AI and lots more information (and freedom to make decisions independently of humans), a declarative language is going to be pretty limited in how well it can do in all situations.

Because of this, it’s important for the programmer to be able to profile processing time and other resource usage, and be able to “go imperative” where needed to address any problems that come up.

This is similar to how when writing C++, when we REALLY NEED TO, we can write some inline assembly. The C++ is the more declarative language, that allows you to write more imperative assembly when you need to.

It’s important to note that I’m not saying that declarative programming is inherently slower than imperative programming though. Declarative languages can actually be much faster and more efficient with resources believe it or not. In the example at the beginning of the post where i used std::rotate to replace a loop that moved items in an array, it’s possible that std::rotate uses a memmove to move the bulk of the items, instead of an item by item copy like what I coded. That would be a much better solution, especially for large array sizes.

So, declarative programming isn’t necessarily slower than imperative programming, but, for the times it isn’t doing well enough, we need a way to turn off “auto pilot” mode and give imperative instructions for how to do something better.

In more human terms: If you asked someone to go get the mail, you likely will say “can you get my mail? here’s the key and it’s in box 62.”. You wouldn’t tell the person how to walk to the door, open it, walk out, close it, etc. However, if there were special instructions such as “please check the package locker too”, you would give those details.

Basically, you give only the details that are needed, as simply as possible, but reserve the right to give as fine grained details as needed, when they are needed.

So, i propose this:

• We as programmers ought to be programming declaratively by default, only resorting to imperative programming when we need to.
• Our job is to empower our customers to work declaratively by making them good DSLs (aka interfaces), but we should remember that it might be important to let them also go more imperative when needed.

Here are some interesting links about managing code complexity and writing high quality code:
Functions Should Be Short And Sweet, But Why?
Bitsquid: Managing Coupling
Thoughts on Declarative and Imperative Languages
Declarative vs. Imperative Programming for the Web

# Low Tech Homomorphic Encryption

Homomorphic encryption is a special type of encryption that lets you do calculations on encrypted values as if they weren’t encrypted. One reason it’s desired is that secure computing could be done in the cloud, if practical homomorphic encryption were available.

Homomorphic encryption has been a hot research topic since 2009, when Craig Gentry figured out a way to do it while working on his PhD. Since then, people have been working on making it better, faster and more efficient.

You can read more about a basic incarnation of his ideas in my blog posts:
Super Simple Symmetric Leveled Homomorphic Encryption Implementation
Improving the Security of the Super Simple Symmetric Leveled Homomorphic Encryption Implementation

This post is about a low tech type of homomorphic encryption that anyone can easily do and understand. There is also some very simple C++ that implements it.

This idea may very well be known about publically, but I’m not aware of any places that talk about it. I may just be ignorant of them though so ::shrug::

## Quick Update

I’ve gotten some feedback on this article, the most often feedback being that this is obfuscation not encryption. I think that’s a fair assessment as the secret value you are trying to protect is in no way transformed, but is just hidden. This post could easily be titled Homomorphic Obfuscation, and perhaps should be.

To see other feedback and responses to this post, check out the reddit links at the bottom!

## The Idea

The idea is actually super simple:

1. Take the value you want to encrypt.
2. Hide it in a list of a bunch of other random values, and remember where it is in the list. The position in the list is your key.
3. Send this list to an untrusted party.
4. They do the same calculation on every item in the list and send it back.
5. Since you know which value was your secret value, you know which answer is the one you care about.

At the end of that process, you have the resulting value, and they have no idea what value was your secret value. You have done, by definition, homomorphic encryption!

There is a caveat of course… they know that your secret value was ONE of the values on the list.

## Security Details

The thing here is that security is a sliding scale between resource usage (computation time, RAM, network bandwidth, etc) and security.

The list size is your security parameter in this case.

A larger list of random values means that it takes longer to transfer the data, more memory to store it, it takes longer to do the homomorphic computations, but the untrusted party is less sure about what your secret value is.

On the other hand, a shorter list is faster to transmit, easier to store, quicker to compute with, but the untrusted party has a better idea what your secret value is.

For maximal security you can just take this to the extreme – if your secret value is a 32 bit floating point number, you could make a list with all possible 2^32 floating point numbers in it, have them do the calculation and send it back. You can even do an optimization here and not even generate or send the list, but rather just have the person doing the calculations generate the full 2^32 float list, do the calculations, and send you the results.

That gets pretty big pretty fast though. That list would actually be 16 gigabytes, but the untrusted party would know almost nothing about your value, other than it can be represented by a 32 bit floating point number.

Depending on your security needs, you might be ok with shortening your list a bit to bring that number down. Making your list only be one million numbers long (999,999 random numbers and one value you actually care about), your list is only 3.8 megabytes.

## Some Interesting Abilities

Using this homomorphic encryption, like other homomorphic encryption, you can do computation involving multiple encrypted values. AKA you could multiply two encrypted values together. To do this, you are going to need to encrypt all values involved using the same key. In other words, they are going to have to be at the same index in each of their respective lists of random numbers.

Something else that is interesting is that you can also encode MULTIPLE secret values in your encrypted value list. You could have 1 secret value at index 50 and another at index 100 for instance. Doing this, you get a sort of homomorphic SIMD setup.

Homomorphic SIMD is actually a real thing in other homomorphic encryption methods as well. Check out this paper for instance:
Fully Homomorphic SIMD Operations

The only problem with homomorphic SIMD is that adding more secret values to the same encrypted list decreases the security, since there are more values in the list that you don’t want other people to know about.

You can of course also modify encrypted values by unencrypted values. You could multiply an encrypted value by 3, by multiplying every value in the list by 3.

## Extending to Public Key Cryptography

If you wanted to use asymmetric key cryptography (public/private keys) instead of symmetric key cryptography, that is doable here too.

What you would do is have the public key public as per usual, and that key would be used in a public key algorithm to encrypt the index of the secret value in the random list.

Doing this, the person who has the private key would be able to receive the list and encrypted index, decrypt the index, and then get the secret value out using that index.

## Sample Code Tests

The sample code only does Symmetric key encryption, and does these 3 tests:

1. Encrypts two floating point numbers into a single list, SIMD style, does an operation on the encrypted values, then unencrypts and verifies the results.
2. Does the same with two sets of floats (three floats in each set), to show how you can make encrypted values interact with each other. Does the operation, then unencrypts and verifies the results.
3. Encrypts three values of a 3 byte structure, does an operation on the encrypted values, then unencrypts and verifies the results.

All secret data was hidden in lists of 10,000,000 random values. That made the first two tests (the ones done with 4 byte floats) have encrypted files of 38.1MB (40,000,000 bytes), and the last test (the one done with a 3 byte struct) had a file size of 28.6 MB (30,000,000 bytes).

Here are the timing of the above tests:

## Sample Code

/*

Written by Alan Wolfe
http://blog.demofox.org
Tweets by Atrix256

*/

#pragma once
#include <vector>
#include <random>

// A static class with template functions in it.
// A namespace would be nice, except I want to hide some things as private.
class LTHE
{
public:

//=================================================================================
template <typename T>
static bool Encrypt (std::vector<T> values, size_t listSize, const char* fileName, std::vector<size_t>& keys, bool generateKeys = true)
{
// Make sure we have a list that is at least as long as the values we want to encrypt
if (values.size() > listSize)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): values.size() > listSize.n");
return false;
}

// Generate a list of keys if we are told to
// Ideally you want to take the first M items of a cryptographically secure shuffle
// of N items.
// This could be done with format preserving encryption or some other method
// to make it not roll and check, and also more secure random.
if (generateKeys)
{
keys.clear();
for (size_t i = 0, c = values.size(); i < c; ++i)
{
size_t newKey;
do
{
newKey = RandomInt<size_t>(0, listSize - 1);
}
while (std::find(keys.begin(), keys.end(), newKey) != keys.end());
keys.push_back(newKey);
}
}

// make a file of random values, size of T, count of <listSize>
FILE *file = fopen(fileName, "w+b");
if (!file)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", fileName);
return false;
}

// Note: this may not be the most efficient way to generate this much random data or
// write it all to the file.
// In a real crypto usage case, you'd want a crypto secure random number generator.
// You'd also want to make sure the random numbers had the same properties as your
// input values to help anonymize them better.
// Like if your numbers are not whole numbers, you don't want to generate only whole numbers.
// Or if your numbers are salaries, you may not want purely random values, but more "salaryish"
// looking numbers.
// You could alternately just do all 2^N possible values which would definitely anonymize
// the values you wanted to encrypt.  This is maximum security, but also takes most
// memory and most processing time.
size_t numUint32s = (listSize * sizeof(T)) / sizeof(uint32_t);
size_t numExtraBytes = (listSize * sizeof(T)) % sizeof(uint32_t);
for (size_t i = 0; i < numUint32s; ++i)
{
uint32_t value = RandomInt<uint32_t>();
if (fwrite(&value, sizeof(value), 1, file) != 1)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (uint32s).n");
fclose(file);
return false;
}
}
for (size_t i = 0; i < numExtraBytes; ++i)
{
uint8_t value = RandomInt<uint8_t>();
if (fwrite(&value, sizeof(value), 1, file) != 1)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (extra bytes).n");
fclose(file);
return false;
}
}

// Now put the values in the file where they go, based on their key
for (size_t i = 0, c = values.size(); i < c; ++i)
{
long pos = (long)(keys[i] * sizeof(T));
if (fseek(file, pos, SEEK_SET) != 0)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
fclose(file);
return false;
}
if (fwrite(&values[i], sizeof(values[i]), 1, file) != 1)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write secret value.n");
fclose(file);
return false;
}
}

// close file and return success
fclose(file);
return true;
}

//=================================================================================
template <typename T, typename LAMBDA>
static bool TransformHomomorphically (const char* srcFileName, const char* destFileName, const LAMBDA& function)
{
// open the source and dest file if we can
FILE *srcFile = fopen(srcFileName, "rb");
if (!srcFile)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", srcFileName);
return false;
}
FILE *destFile = fopen(destFileName, "w+b");
if (!destFile)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
fclose(srcFile);
return false;
}

// Process the data in the file and write it back out.
// This could be done much better.
// We could read more from the file at once.
// We could use SIMD.
// We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
// We could decouple the disk access from processing, so it was reading and writing while it was processing.
const size_t c_bufferSize = 1024;
std::vector<T> dataBuffer;
dataBuffer.resize(c_bufferSize);
do
{
// read data from the source file

// transform the data
for (size_t i = 0; i < elementsRead; ++i)
dataBuffer[i] = function(dataBuffer[i]);

// write the transformed data to the dest file
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
fclose(srcFile);
fclose(destFile);
return false;
}
}
while (!feof(srcFile));

// close files and return success
fclose(srcFile);
fclose(destFile);
return true;
}

//=================================================================================
template <typename T, typename LAMBDA>
static bool TransformHomomorphically (const char* src1FileName, const char* src2FileName, const char* destFileName, const LAMBDA& function)
{
// open the source and dest file if we can
FILE *srcFile1 = fopen(src1FileName, "rb");
if (!srcFile1)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src1FileName);
return false;
}
FILE *srcFile2 = fopen(src2FileName, "rb");
if (!srcFile2)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src2FileName);
fclose(srcFile1);
return false;
}
FILE *destFile = fopen(destFileName, "w+b");
if (!destFile)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
fclose(srcFile1);
fclose(srcFile2);
return false;
}

// Process the data in the file and write it back out.
// This could be done much better.
// We could read more from the file at once.
// We could use SIMD.
// We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
// We could decouple the disk access from processing, so it was reading and writing while it was processing.
const size_t c_bufferSize = 1024;
std::vector<T> dataBuffer1, dataBuffer2;
dataBuffer1.resize(c_bufferSize);
dataBuffer2.resize(c_bufferSize);
do
{
// read data from the source files

{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Different numbers of elements in each file!n");
fclose(srcFile1);
fclose(srcFile2);
fclose(destFile);
return false;
}

// transform the data
for (size_t i = 0; i < elementsRead1; ++i)
dataBuffer1[i] = function(dataBuffer1[i], dataBuffer2[i]);

// write the transformed data to the dest file
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
fclose(srcFile1);
fclose(srcFile2);
fclose(destFile);
return false;
}
}
while (!feof(srcFile1));

// close files and return success
fclose(srcFile1);
fclose(srcFile2);
fclose(destFile);
return true;
}

//=================================================================================
template <typename T>
static bool Decrypt (const char* fileName, std::vector<T>& values, std::vector<size_t>& keys)
{
// Open the file if we can
FILE *file = fopen(fileName, "rb");
if (!file)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", fileName);
return false;
}

// Read the values from the file.  The key is their location in the file.
values.clear();
for (size_t i = 0, c = keys.size(); i < c; ++i)
{
long pos = (long)(keys[i] * sizeof(T));
if (fseek(file, pos, SEEK_SET) != 0)
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
fclose(file);
return false;
}
T value;
{
fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not decrypt value for key.n");
fclose(file);
return false;
}
values.push_back(value);
}

// Close file and return success
fclose(file);
return true;
}

private:
template <typename T>
static T RandomInt (T min = std::numeric_limits<T>::min(), T max = std::numeric_limits<T>::max())
{
static std::random_device rd;
static std::mt19937 mt(rd());
static std::uniform_int<T> dist(min, max);
return dist(mt);
}
};


And here is the test program, main.cpp:

#include <stdio.h>
#include "LTHE.h"
#include <chrono>

//=================================================================================
// times a block of code
struct SBlockTimer
{
SBlockTimer()
{
m_start = std::chrono::high_resolution_clock::now();
}

~SBlockTimer()
{
std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
printf("    %0.2f secondsn", seconds.count());
}

std::chrono::high_resolution_clock::time_point m_start;
};

//=================================================================================
float TransformDataUnitary (float& value)
{
return (float)sqrt(value * 2.17f + 0.132);
}

//=================================================================================
float TransformDataBinary (float& value1, float value2)
{
return (float)sqrt(value1 * value1 + value2 * value2);
}

//=================================================================================
struct SStruct
{
uint8_t x, y, z;

static SStruct Transform (const SStruct& b)
{
SStruct ret;
ret.x = b.x * 2;
ret.y = b.y * 3;
ret.z = b.z * 4;
return ret;
}

bool operator != (const SStruct& b) const
{
return b.x != x || b.y != y || b.z != z;
}
};

//=================================================================================
int Test_FloatUnitaryOperation ()
{
printf("n----- " __FUNCTION__ " -----n");

// Encrypt the data
printf("Encrypting data:  ");
std::vector<float> secretValues = { 3.14159265359f, 435.0f };
std::vector<size_t> keys;
{
SBlockTimer timer;
if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
{
fprintf(stderr, "Could not encrypt data.n");
return -1;
}
}

// Transform the data
printf("Transforming data:");
{
SBlockTimer timer;
if (!LTHE::TransformHomomorphically<float>("Encrypted.dat", "Transformed.dat", TransformDataUnitary))
{
fprintf(stderr, "Could not transform encrypt data.n");
return -2;
}
}

// Decrypt the data
printf("Decrypting data:  ");
std::vector<float> decryptedValues;
{
SBlockTimer timer;
if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
{
fprintf(stderr, "Could not decrypt data.n");
return -3;
}
}

// Verify the data
printf("Verifying data:   ");
{
SBlockTimer timer;
for (size_t i = 0, c = secretValues.size(); i < c; ++i)
{
if (TransformDataUnitary(secretValues[i]) != decryptedValues[i])
{
fprintf(stderr, "decrypted value mismatch!n");
return -4;
}
}
}

return 0;
}

//=================================================================================
int Test_FloatBinaryOperation ()
{
printf("n----- " __FUNCTION__ " -----n");

// Encrypt the data
printf("Encrypting data:  ");
std::vector<float> secretValues1 = { 3.14159265359f, 435.0f, 1.0f };
std::vector<float> secretValues2 = { 1.0f, 5.0f, 9.0f };
std::vector<size_t> keys;
{
SBlockTimer timer;
if (!LTHE::Encrypt(secretValues1, 10000000, "Encrypted1.dat", keys))
{
fprintf(stderr, "Could not encrypt data.n");
return -1;
}
if (!LTHE::Encrypt(secretValues2, 10000000, "Encrypted2.dat", keys, false)) // reuse the keys made for secretValues1
{
fprintf(stderr, "Could not encrypt data.n");
return -1;
}
}

// Transform the data
printf("Transforming data:");
{
SBlockTimer timer;
if (!LTHE::TransformHomomorphically<float>("Encrypted1.dat", "Encrypted2.dat", "Transformed.dat", TransformDataBinary))
{
fprintf(stderr, "Could not transform encrypt data.n");
return -2;
}
}

// Decrypt the data
printf("Decrypting data:  ");
std::vector<float> decryptedValues;
{
SBlockTimer timer;
if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
{
fprintf(stderr, "Could not decrypt data.n");
return -3;
}
}

// Verify the data
printf("Verifying data:   ");
{
SBlockTimer timer;
for (size_t i = 0, c = secretValues1.size(); i < c; ++i)
{
if (TransformDataBinary(secretValues1[i], secretValues2[i]) != decryptedValues[i])
{
fprintf(stderr, "decrypted value mismatch!n");
return -4;
}
}
}

return 0;
}

//=================================================================================
int Test_StructUnitaryOperation ()
{
printf("n----- " __FUNCTION__ " -----n");

// Encrypt the data
printf("Encrypting data:  ");
std::vector<SStruct> secretValues = { {0,1,2},{ 3,4,5 },{ 6,7,8 } };
std::vector<size_t> keys;
{
SBlockTimer timer;
if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
{
fprintf(stderr, "Could not encrypt data.n");
return -1;
}
}

// Transform the data
printf("Transforming data:");
{
SBlockTimer timer;
if (!LTHE::TransformHomomorphically<SStruct>("Encrypted.dat", "Transformed.dat", SStruct::Transform))
{
fprintf(stderr, "Could not transform encrypt data.n");
return -2;
}
}

// Decrypt the data
printf("Decrypting data:  ");
std::vector<SStruct> decryptedValues;
{
SBlockTimer timer;
if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
{
fprintf(stderr, "Could not decrypt data.n");
return -3;
}
}

// Verify the data
printf("Verifying data:   ");
{
SBlockTimer timer;
for (size_t i = 0, c = secretValues.size(); i < c; ++i)
{
if (SStruct::Transform(secretValues[i]) != decryptedValues[i])
{
fprintf(stderr, "decrypted value mismatch!n");
return -4;
}
}
}

return 0;
}

//=================================================================================
int main (int argc, char **argv)
{
// test doing an operation on a single encrypted float
int ret = Test_FloatUnitaryOperation();
if (ret != 0)
{
system("pause");
return ret;
}

// test doing an operation on two encrypted floats
ret = Test_FloatBinaryOperation();
if (ret != 0)
{
system("pause");
return ret;
}

// test doing an operation on a single 3 byte struct
ret = Test_StructUnitaryOperation();
if (ret != 0)
{
system("pause");
return ret;
}

printf("nAll Tests Passed!nn");
system("pause");
return 0;
}


If you found this post interesting or useful, or you have anything to add or talk about, let me know!

Reddit discussion:
r/programming
r/cryptography

# A Data Point for MSVC vs Clang Code Generation

I’m a windows PC game developer using MSVC 2015 update 1, working in C++.

More and more, I hear people talk about how great clang is, and that it generates much better code than MSVC, among other niceties.

I have been chalking it up to maybe just being a fad, but keeping my feelers out there to see if I can get some concrete comparitive info.

Well, I’ve stumbled on some of that concrete comparitive info and want to share it with anyone else who is wondering if clang really is better than MSVC. This is just one data point, but it feels like it’s just the tip of the iceberg.

I wasn’t really sure what he was getting at, so clicked the link to check it out (http://godbolt.org/g/WDpPYq).

It turned out to be very relevant to my interest, because his particular example is comparing a value to a bunch of compile time constants. That is basically the core of what I’ve been looking into with my last few posts asking whether code or data was faster!

This first particular example is comparing a compile time constant to other compile time constants, so the code completely melts away and at runtime just returns the compile time calculated result. That isn’t very interesting, but it is nice to see that clang did so much at compile time. FWIW MSVC was able to do this all at compile time as well, so they are even so far.

What is more interesting is what happens when you test a compile time unknown against compile time constants. Let’s check it out… (https://godbolt.org/g/dKBDSK)

What the assembly does is subtract 1 from the input (it’s unsigned so if 0, wraps around to max int value), and then compares it against 5 to know if it’s in the group or not. Clang realized the numbers were continuous and so made a nice optimization.

In this case, MSVC did similar in release x64:

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
bool match = false;
(void)std::initializer_list<bool>{ (match = match || u == t)... };
return match;
}

int main(int argc, char** argv)
{
00007FF62DB61000  lea         eax,[rcx-1]
00007FF62DB61003  cmp         eax,4
00007FF62DB61006  setbe       al
return one_of(argc, 1, 2, 3, 4, 5);
00007FF62DB61009  movzx       eax,al
}
00007FF62DB6100C  ret


But in x86 release it did a bunch of if/else if/else if’s!

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
bool match = false;
(void)std::initializer_list<bool>{ (match = match || u == t)... };
return match;
}

int main(int argc, char** argv)
{
00331002  in          al,dx
return one_of(argc, 1, 2, 3, 4, 5);
00331003  mov         eax,dword ptr [argc]
00331006  cmp         eax,1
00331009  je          main+26h (0331026h)
0033100B  cmp         eax,2
0033100E  je          main+26h (0331026h)
00331010  cmp         eax,3
00331013  je          main+26h (0331026h)
00331015  cmp         eax,4
00331018  je          main+26h (0331026h)
0033101A  cmp         eax,5
0033101D  je          main+26h (0331026h)
0033101F  xor         al,al
00331021  movzx       eax,al
}
00331024  pop         ebp
00331025  ret
return one_of(argc, 1, 2, 3, 4, 5);
00331026  mov         al,1
00331028  movzx       eax,al
}
0033102B  pop         ebp
0033102C  ret


You are probably asking “what does clang do in x86?” well it turns out it does the same thing as in x64, it doesn’t fall back to if/else if/else if like MVSC does (proof: add -m32 in goldbolt. https://godbolt.org/g/khnrtO). One point to clang!

What if the numbers are not so continuous though? It turns out it can actually switch to using a binary search! (https://godbolt.org/g/iBkqja)

MSVC on the other hand just does a bunch of if/else if/else if tests, in both x86 release and x64 release.

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
bool match = false;
(void)std::initializer_list<bool>{ (match = match || u == t)... };
return match;
}

int main(const int argc, const char *[])
{
00007FF6C05A1000  cmp         ecx,1AB42h
00007FF6C05A1006  je          main+63h (07FF6C05A1063h)
00007FF6C05A1008  cmp         ecx,40Fh
00007FF6C05A100E  je          main+63h (07FF6C05A1063h)
00007FF6C05A1010  cmp         ecx,0B131h
00007FF6C05A1016  je          main+63h (07FF6C05A1063h)
00007FF6C05A1018  cmp         ecx,93BBh
00007FF6C05A101E  je          main+63h (07FF6C05A1063h)
00007FF6C05A1020  cmp         ecx,121Bh
00007FF6C05A1026  je          main+63h (07FF6C05A1063h)
00007FF6C05A1028  cmp         ecx,0EE9h
00007FF6C05A102E  je          main+63h (07FF6C05A1063h)
00007FF6C05A1030  cmp         ecx,0E1Fh
00007FF6C05A1036  je          main+63h (07FF6C05A1063h)
00007FF6C05A1038  cmp         ecx,995h
00007FF6C05A103E  je          main+63h (07FF6C05A1063h)
00007FF6C05A1040  cmp         ecx,5FEh
00007FF6C05A1046  je          main+63h (07FF6C05A1063h)
00007FF6C05A1048  cmp         ecx,5BFh
00007FF6C05A104E  je          main+63h (07FF6C05A1063h)
00007FF6C05A1050  cmp         ecx,5
00007FF6C05A1053  je          main+63h (07FF6C05A1063h)
00007FF6C05A1055  cmp         ecx,0FFFEh
00007FF6C05A105B  je          main+63h (07FF6C05A1063h)
return one_of(argc, 1471, 2453, 3817, 45361, 37819, 109378, 1534, 4635, 1039, 3615, 5, 65534);
00007FF6C05A105D  xor         al,al
00007FF6C05A105F  movzx       eax,al
}
00007FF6C05A1062  ret
return one_of(argc, 1471, 2453, 3817, 45361, 37819, 109378, 1534, 4635, 1039, 3615, 5, 65534);
00007FF6C05A1063  mov         al,1
00007FF6C05A1065  movzx       eax,al
}
00007FF6C05A1068  ret


## Closing

This is just one data point about how clang is better than MSVC, but it is a data point. I’m betting there are more if we looked for them.

This makes me wonder how switch statements do in clang vs msvc, and also makes me wonder if clang ever uses jump tables or more advanced data structures in either switch statements, or other code that does comparison against a potentially large number of compile time constants. Those thoughts are driven by this things seen in this article: Something You May Not Know About the Switch Statement in C/C++

The examples above used C++14 level C++ to implement “one_of”. If you can use C++17 level C++, you can also implement it this way, which also does the binary search (Also written by Jason Turner):
https://godbolt.org/g/RZgjRQ

PS wouldn’t it be nice if godbolt supported MSVC so we could do this sort of analysis on MSVC code? It’s in the works, but unsure when it’ll be available. Apparently licensing isn’t the issue, so lets hope it comes sooner rather than later! If you want it, maybe ping @mattgodbolt and let him know how much you want that functionality (:

Have any other clang vs MSVC info? If so, I’d love to hear about it!

# Is Code Faster Than Data? Examining Hash Tables

This series of posts is aimed at examining if and how ad hoc code crafted for a specific static (unchanging / constant) data set can run faster than typical generic run time data structures. I think the answer is an obvious “yes, we can often do better”, but these posts explore the details of the problem space and explore how and when we might do better.

The last post explored switch statement performance compared to array access performance.

A switch statement is just a way of telling the compiler how we want to map integral inputs to either some code to run, or some value to return. It’s up to the compiler how to make that happen.

Because compiler makers presumably want to make their compiler generate fast code, it seems like a switch statement should be able to match the speed of an array since a switch statement could very well be implemented as an array when all it is doing is returning different values based on input. Maybe it could even beat an array, in the case of a sparse array, an array with many duplicates, or other situations.

In practice, this doesn’t seem to be the case though, and switch statements are actually quite a bit slower than arrays from the experimentation I’ve done. The main part of the overhead seems to be that it always does a jump (goto) based on the input you are switching on. It can do some intelligent things to find the right location to jump to, but if all you are doing is returning a value, it doesn’t seem smart enough to do a memory read from an array and return that value, instead of doing a jump.

You can read a nice analysis on how switch statements are compiled on the microsoft compiler here: Something You May Not Know About the Switch Statement in C/C++.

Today we are going to be analyzing how hash tables fare against switch statements, arrays, and a few other things.

## Testing Details

I ran these tests in x86/x64 debug/release in visual studio 2015.

I got a list of 100 random words from http://www.randomwordgenerator.com/ and made sure they were all lowercase. I associated an integer value with them, from 1 to 100. My tests are all based on the string being the key and the integer being the value.

I have that data stored/represented in several ways for performing lookups:

1. std::map.
2. std::unordered_map.
3. std::unordered_map using crc32 hash function.
4. std::unordered_map using crc32 hash function modulo 337 and salted with 1147287 to prevent collisions.
5. SwitchValue() switches on crc32 of input string.
6. SwitchValueValidate() switches on crc32 of input string but does a single strcmp to handle possibly invalid input.
7. SwitchValueMinimized() switches on crc32 of input string modulo 337 and salted with 1147287 to prevent collisions.
8. SwitchValueMinimizedValidate() like SwitchValueMinimized() but does a single strcmp to handle possibly invalid input.
9. g_SwitchValueMinimizedArray, the array version of SwitchValueMinimized().
10. g_SwitchValueMinimizedArrayValidate, the array version of SwitchValueMinimizedValidate().
11. BruteForceByStartingLetter() switches on first letter, then brute force strcmp’s words beginning with that letter.
12. BruteForce() brute force strcmp’s all words.

The non validating switch statement functions have an __assume(0) in their default case to remove the overhead of testing for invalid values. This is to make them as fast as possible for the cases when you will only be passing valid values in. If ever that contract was broken, you’d hit undefined behavior, so the performance boost comes at a cost. The Validate versions of the switch functions don’t do this, as they are meant to take possibly invalid input in, and handle it gracefully. Both validating and not validating input are common use cases so I wanted to represent both in the performance analysis.

Here are the tests done:

1. In Order – looks up all strings in order and sums the associated values.
2. Shuffled – looks up all strings in random order and sums the associated values.
3. Pre-Hashed Keys In Order – looks up all strings in order and sums the associated values, using pre-hashed keys.
4. Pre-Hashed Keys Shuffled – looks up all strings in random order and sums the associated values, using pre-hashed keys.

The second two tests only apply to the value lookups which can take pre-hashed keys. For instance, g_SwitchValueMinimizedArray can be indexed by a key that was hashed before the program ran, but a std::unordered_map cannot be indexed by a hash value that was calculated in advance.

Each of those tests were done 5,000 times in a row to make performance differences stand out more, and that full amount of time is the time reported. That process was done 50 times to give both an average (a mean) and a standard deviation to show much much the time samples differed.

The source code for the tests can be found here:
Github: Atrix256/RandomCode/HashVsSwitch

## Results

Here are the results, in milliseconds. The values in parentheses are the standard deviations, which are also in milliseconds.

In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

 Debug Release Win32 x64 Win32 x64 std::map 7036.77 (126.41) 7070.18 (155.49) 33.02 (2.68) 35.40 (1.43) std::unordered_map 4235.31 (24.41) 4261.36 (145.16) 19.97 (0.45) 20.12 (0.62) std::unordered_map crc32 4236.38 (80.72) 4275.36 (116.65) 24.36 (0.47) 23.47 (0.86) std::unordered_map crc32 minimized 4034.50 (12.72) 4323.67 (170.55) 26.39 (0.50) 23.68 (0.71) SwitchValue() 123.28 (0.98) 144.29 (4.91) 6.81 (0.30) 5.47 (0.29) SwitchValueValidate() 127.59 (1.22) 147.41 (5.20) 8.84 (0.35) 7.99 (0.36) SwitchValueMinimized() 128.83 (0.95) 151.48 (4.66) 8.28 (0.38) 10.18 (0.37) SwitchValueMinimizedValidate() 132.44 (1.02) 159.85 (6.73) 12.65 (0.40) 10.89 (0.36) g_SwitchValueMinimizedArray 104.15 (1.13) 122.94 (5.98) 7.68 (0.36) 6.08 (0.36) g_SwitchValueMinimizedArrayValidate 107.75 (1.07) 120.75 (2.80) 10.49 (0.37) 8.95 (0.32) BruteForceByStartingLetter() 19.92 (0.63) 22.01 (0.86) 4.85 (0.24) 5.81 (0.26) BruteForce() 118.65 (1.09) 140.20 (2.28) 31.53 (0.56) 46.47 (0.83)

Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation.

 Debug Release Win32 x64 Win32 x64 std::map 7082.92 (214.13) 6999.90 (193.82) 32.14 (0.59) 34.20 (0.62) std::unordered_map 4155.85 (133.00) 4221.84 (124.70) 20.21 (0.42) 20.09 (0.47) std::unordered_map crc32 4286.44 (95.39) 4300.81 (64.37) 24.55 (0.57) 23.06 (0.57) std::unordered_map crc32 minimized 4186.27 (75.35) 4111.73 (43.36) 26.36 (0.56) 23.65 (0.54) SwitchValue() 127.93 (3.85) 137.63 (1.31) 6.97 (0.32) 5.47 (0.27) SwitchValueValidate() 131.46 (2.34) 141.38 (1.47) 8.92 (0.38) 7.86 (0.37) SwitchValueMinimized() 133.03 (2.93) 145.74 (1.50) 9.16 (0.37) 10.50 (0.41) SwitchValueMinimizedValidate() 135.47 (2.27) 151.58 (1.48) 12.13 (0.40) 10.13 (0.43) g_SwitchValueMinimizedArray 106.38 (2.70) 118.61 (3.73) 8.18 (0.31) 5.19 (0.29) g_SwitchValueMinimizedArrayValidate 109.32 (2.34) 120.94 (3.02) 10.49 (0.55) 9.00 (0.40) BruteForceByStartingLetter() 20.45 (0.92) 21.64 (0.76) 4.90 (0.31) 5.87 (0.32) BruteForce() 120.70 (2.16) 140.95 (1.71) 32.50 (0.47) 45.90 (0.79)

Pre-hashed In Order

Look up all strings in sequential order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

 Debug Release Win32 x64 Win32 x64 SwitchValue() 12.49 (0.61) 13.23 (0.37) 1.94 (0.17) 1.81 (0.12) SwitchValueValidate() 17.08 (1.06) 16.72 (0.57) 4.32 (0.30) 4.05 (0.21) SwitchValueMinimized() 11.83 (0.69) 12.06 (0.51) 1.29 (0.13) 1.58 (0.17) SwitchValueMinimizedValidate() 16.02 (0.84) 15.84 (0.66) 3.25 (0.24) 3.47 (0.27) g_SwitchValueMinimizedArray 1.23 (0.06) 1.15 (0.10) 0.00 (0.00) 0.00 (0.00) g_SwitchValueMinimizedArrayValidate 4.21 (0.32) 2.99 (0.20) 2.45 (0.17) 2.66 (0.20)

Pre-hashed Shuffled

Look up all strings in random order and sum the associated values. Repeat 5,000 times to get a timing sample. Take 50 timing samples and report average and std deviation. Uses pre-hashed keys for lookups.

 Debug Release Win32 x64 Win32 x64 SwitchValue() 12.96 (1.37) 13.45 (0.47) 1.84 (0.11) 1.81 (0.16) SwitchValueValidate() 16.27 (2.01) 16.57 (0.63) 2.65 (0.19) 2.85 (0.17) SwitchValueMinimized() 11.75 (0.63) 12.15 (0.45) 1.07 (0.07) 1.06 (0.11) SwitchValueMinimizedValidate() 16.44 (0.99) 16.33 (0.58) 3.43 (0.18) 3.41 (0.22) g_SwitchValueMinimizedArray 1.13 (0.06) 1.18 (0.10) 0.32 (0.05) 0.31 (0.04) g_SwitchValueMinimizedArrayValidate 4.50 (0.32) 3.31 (0.18) 2.82 (0.16) 3.29 (0.18)

## Observations

There’s a lot of data, but here’s the things I found most interesting or relevant to what I’m looking at (generic data structures vs ad hoc code for data).

Tests 1 and 2

std::map and std::unordered map are very, very slow in debug as you might expect. It would be interesting to look deeper and see what it is that they are doing in debug to slow them down so much.

There is some tribal knowledge in the C++ world that says to not use std::map and to use std::unordered_map instead, but I was surprised to see just how slow std::map was. in x64 release, std::map took about 75% the time that brute force did, and in win32 release, it took the same time or was slower! std::map isn’t hash based, you give it a comparison function that returns -1,0, or 1 meaning less than, equal or greater than. Even so, you have to wonder how the heck the algorithm can be so slow that brute force is a comparable replacement for lookup times!

It’s interesting to see that everything i tried (except brute force) was significantly faster than both std::map and std::unordered_map. That saddens me a little bit, but to be fair, the usage case I’m going after is a static data structure that has fast lookup speeds, which isn’t what unordered_map aims to solve. This just goes to show that yes, if you have static data that you want fast lookup times for, making ad hoc code or rolling your own read only data structure can give you significant wins to performance, and also can help memory issues (fragmentation and wasted allocations that will never be used).

It was surprising to see that switching on the first letter and brute forcing the strings with the same first letter did so well. That is one of the faster results, competing with SwitchValue() for top dog. The interesting thing though is that BruteForceByStartingLetter() gracefully handles invalid input, while SwitchValue() does not and has undefined behavior, so another point goes to BruteForceByStartingLetter().

Tests 3 and 4

These tests were done with pre-hashed keys to simulate an ideal setup.

If you have a static key to value data structure and have the ability to make ad hoc code for your specific static data, chances are pretty good that you’ll also be able to pre-hash whatever keys you are going to be looking up so you don’t have to hash them at run time. Also, if you are doing multiple lookups with a single key for some reason, you may opt to calculate the hash only on the first lookup, and then from there re-use the hashed key.

These tests simulated those situations.

As expected, the perf results on these tests are much better than those that hash the key on demand for each lookup. Less work done at runtime means better performance.

Based on the results of the last blog post – that array lookups are super fast – you probably aren’t surprised to see that g_SwitchValueMinimizedArray is the winner for performance by far.

It is so fast that the in order case doesn’t even register any time having been taken. This is probably a little misleading, because doing the in order tests (and even the shuffled tests) are very cache friendly. In reality, you probably would have more cache misses and it wouldn’t be quite as cheap as what is being reported, but would still be super fast compared to the other options.

In second place comes SwitchValueMinimized() which is the switch statement function version of g_SwitchValueMinimizedArray. Arrays still beat switch statements, as we found out in the last post!

In third place comes SwitchValue(), which is the same as SwitchValueMinimized() but has sparser values used in the switch statement, which make it more difficult for the compiler to generate efficient code. For instance, having the full range of 32 bits as case statement values, and having them all be pseudo random numbers (because they are the result of a hash!) rules out the ability for the compiler to make a jump table array, or find any patterns in the numbers. The SwitchValueMinimized() function on the other hand has only 337 possible values, and so even though the values are sparse (there are 100 items in those 337 possible values), it’s a small enough number that a jump table array could be used without issues.

After that comes all the validated versions of the tests. It makes sense that they would be slower, because they do all the same work, and then some additional work (strcmp) to ensure that the input is valid.

## Getting The Fastest Results

If you have some static data that maps keys to values, and you need it to be fast for doing lookups, it looks like writing something custom is the way to go.

The absolutely fastest way to do it is to make an array out of your data items and then pre-process (or compile time process) any places that do a lookup, to convert keys to array indices. then, at run time, you only need to do an array lookup to a known index to get your data, which is super fast. If your data has duplicates, you might also be able to make the keys which point at duplicate data instead just point at the same array index, to de-duplicate your data.

If doing that is too complex, or too much work, a low tech and low effort way to handle the problem seems to be to break your data up into buckets, possibly based on their first letter, and then doing brute force (or something else) to do the lookup among the fewer number of items.

In fact, that second method is sort of like a hard coded trie which is only two levels deep.

If you needed to do some hashing at runtime, finding a faster hash function (that also worked in constexpr, or at least had a constexpr version!) could help you get closer to the pre-hashed keys timings. The good news is the hash doesn’t have to be particularly good. It just has to be fast and have no collisions for the input values you wish to use. That seems like something where brute force searching simple hash functions with various salt values may give you the ideal solution, but probably would take a very, very long time to find what you are looking for. You might notice that the default hash used for std::unordered_map is actually faster than the crc32 implementation I used.

Of course, we also learned what NOT to do. Don’t use brute force, and don’t use std::map. Using std::unordered_map isn’t super aweful compared to those solutions, but you can do a lot better if you want to.

## Why This?

This fast key to value lookup might sound like a contrived need to some people, but in game development (I am a game developer!), there is commonly the concept of a game database, where you look up data about things (how much damage does this unit do?) by looking up a structure based on a unique ID that is a string, named by a human. So, in game dev, which also has high performance needs, optimizing this usage case can be very helpful. There is a little bit more talk about game data needs here: Game Development Needs Data Pipeline Middleware.

## Is Code Faster Than Data?

I still think ad hoc code for data structures can often be faster than generic data structures, and the experiments on this post have positive indications of that.

Another way I think ad hoc code could be helpful is when you have hierarchical and/or heterogeneous data structures. By that I mean data structures which have multiple levels, where each level may actually have different needs for how best to look up data in it, and in fact, even siblings on the same level maybe have different needs for how best to look up data in it.

In these cases, you could make some data types which had virtual functions to handle the nature of the data needing different solutions at each level, but those virtual function calls and abstractions add up.

I think it’d be superior to have hard coded code that says “oh, you want index 4 of the root array? ok, that means you are going to binary search this list next”. Of course, that code needs to be generated by another program to be effective. If a human has to make sure all that code stays up to date, it’ll be a ton of work, and it’ll be broken, making very subtle hard to reproduce bugs.

A downside I can see to ad hoc code solutions is possibly thrashing the instruction cache more. Not sure if that’d be an issue in practice, it’d be interesting to try more complex data structures and see how it goes.

Also, it might be difficult to have good heuristics to figure out what is best in which situations. I could see a utility possibly generating different variations of code and running them to see which was most performant. Seems like it’d be a challenge to get 100% right all the time, but our experiments make it seems like it’d be easy to do significantly better than generic algorithms which are meant to be dynamic at runtime.

I also think that more complex data structures are more likely to get benefit of having custom code made for them. Simple ones less likely so. It’s hard to beat an array lookup. That being said, the unbeatable simple data structures make great building blocks for the more complex ones (;

It probably would also be good to look into memory usage a bit more to see how ad hoc code compares to generic algorithms. If ad hoc code is much faster but uses more memory, that’d have to be a conscious decision to make when weighing the trade offs.

Maybe in the future, the C++ standards will allow for static data structure types that you have to initialize with compile time constants (allowing constexpr), that are optimized for lookup times since they can’t have any inserts or deletes? I wonder how much demand there would be for that?

Here’s a good read on some other string friendly data structures:
Data Structures for Strings

Twitter user @ores brought up two interesting points:

1. It would be interesting to see gperf performs in this situation. If makes a faster minimal perfect hash function, it’ll get us closer to the pre-hashed keys timings.
2. It would be interesting to time scripting languages to see if for them code is faster than data or not. Another interesting aspect of this would be to look at a JIT compiled scripting language like lua-jit. The thing that makes JIT interesting is that it can compile for your specific CPU, instead of having to compile for a more generic set of CPU features. That gives it the opportunity to make code which will perform better on your specific machine.

# Is Code Faster Than Data? Switch Statements vs. Arrays

The last post explored compile time hashing a bit and at the end showed a way to use compile time code to assist a string to enum function, by having a minimal perfect hash that was guaranteed at compile time not to have any hash collisions.

That was pretty interesting to me, and made me wonder what else was possible with compile time (or compile time assisted) data structures. This would be mainly for static data structures that were read only. We could get creative and allow for deletes, inserts and data edits, but we’ll probably come back to that in a future post.

In pursuing other compile time data structures, there is one big question that would be easy to make assumptions about: Is custom code for compile time data structure lookup functionality faster than data?

In other words, could we hard code a hash table key to value lookup that was more performant than a general case hash table which was meant to be used by any data?

The answer to this is “usually it can be” IMO, since we could do lots of things specific to the data, like finding cheaper / better hash functions for the specific data input we were expecting. Since we aren’t doing inserts or deletes, or data modification, we could also allocate just the right amount of memory, in one shot, which would reduce memory waste and memory fragmentation.

But lets do some experiments and see. Let’s start off by comparing how a switch statement performs compared to some simple array types.

## Testing Details

I ran these tests in x86/x64 debug/release in visual studio 2015.

I made 5000 random uint32’s and stored them in these containers:

• std::vector
• std::array
• C style array
• dynamic memory
• Switch statement function
• Switch statement function with __assume(0) in default case

The __assume(0) in the switch default case is a hint to the optimizer to not do the usual bounds checking needed for handling a value not present in the switch statement. It’s microsoft specific, but there are equivelants in other compilers. Note that if you switch a value that would usually be handled by the default case, you would then have undefined behavior. It’s a speed up, but comes at a price. You can read about it here: msdn: __assume. This test code never passes values that would trigger the default case, so is arguably a good candidate for __assume, if you believe there are any good usages for __assume.

For a neat read on some details of undefined behavior check this out: Undefined behavior can result in time travel (among other things, but time travel is the funkiest)

I did the following tests.

• Sequential Sum: Add up the values at the indices 0-4999, in order. For the switch function, it was called with indices 0-4999 in order
• Shuffle Sum: Same as sequential sum, but in a shuffled order.
• Sparse Sum: Same as shuffle sum, but only doing the first 1000 indices. An alternate switch function was used which only contained values for those first 1000 indices, simulating the ability for us to strip out values never actually needed, to see if that has an impact on performance.

Timing was measured using std::chrono::high_resolution_clock, timing each test done 1,000 times in a row to make timing differences more apparent. I did this 50 times for each test to get an average and a standard deviation for those 50 samples.

The source code for the tests is at:
Github: Atrix256/RandomCode/ArrayAccessSpeeds

## Results

The results are below. Times are in milliseconds. The number in parentheses is the standard deviation, also in milliseconds.

Sequential Sum: Sum 5,000 numbers 1,000 times, in order.

 Debug Release Win32 x64 Win32 x64 std::vector 177.16 (4.68) 57.87 (6.48) 1.23 (0.25) 0.30 (0.46) std::array 94.53 (5.84) 53.61 (3.96) 1.25 (0.29) 0.34 (0.48) C array 8.67 (0.48) 9.90 (0.57) 1.22 (0.37) 0.30 (0.46) dynamic memory 8.96 (0.29) 10.01 (0.77) 1.27 (0.40) 0.30 (0.46) switch function 148.50 (3.48) 115.01 (4.83) 47.53 (3.86) 43.53 (1.75) switch function assume 0 135.96 (1.22) 95.54 (2.83) 46.33 (0.97) 38.91 (0.72)

Shuffle Sum: Sum 5,000 numbers 1,000 times, in random order.

 Debug Release Win32 x64 Win32 x64 std::vector 179.82 (8.13) 46.75 (1.05) 2.78 (0.42) 1.41 (0.19) std::array 89.94 (1.23) 39.47 (1.07) 2.58 (0.29) 1.42 (0.19) C array 8.59 (0.41) 8.63 (0.49) 2.55 (0.25) 1.39 (0.21) dynamic memory 8.23 (0.26) 8.41 (0.50) 2.73 (0.25) 1.40 (0.20) switch function 151.83 (0.95) 116.37 (16.91) 55.74 (1.97) 48.20 (0.56) switch function assume 0 142.07 (0.97) 103.01 (11.17) 55.45 (2.00) 42.09 (0.79)

Sparse Sum: Sum 1,000 of the 5,000 numbers 1,000 times, in random order.

 Debug Release Win32 x64 Win32 x64 std::vector 34.86 (0.46) 9.35 (0.48) 0.54 (0.14) 0.26 (0.25) std::array 17.88 (0.45) 7.93 (0.27) 0.52 (0.10) 0.25 (0.25) C array 1.72 (0.39) 1.70 (0.46) 0.52 (0.10) 0.24 (0.25) dynamic memory 1.67 (0.45) 1.70 (0.35) 0.55 (0.18) 0.26 (0.25) switch function 29.10 (0.49) 18.88 (0.92) 13.42 (0.43) 9.51 (0.43) switch function assume 0 26.74 (0.52) 18.19 (0.39) 12.15 (0.50) 9.23 (0.42)

## Observations

It’s interesting to see that std::array is 10 times slower than a c array or dynamic memory, in debug win32, and that std::vector is 20 times slower. In debug x64 std::array is only about 5 times slower, and std::vector is only a little bit slower than std::array. In release, std::vector and std::array are more in line with the cost of a c array, or dynamic memory.

It’s expected that std::array and std::vector are slower in debug because they do things like check indices for being out of bounds. They also have a lot more code associated with them which may not “optimize away” in debug, like it does in release (including inlining functions). It’s nice to see that they become basically free abstractions in release though. It gives us more confidence that they can be used in performance critical areas without too much extra worry, as long as we are ok with the debug slowdowns and want or need the debug functionality the types give us.

It’s also interesting to see that adding the __assume(0) to the default case in the switch statements actually improves performance a measurable amount in most tests.

The switch function IS NOT faster than the simple array access, and in fact is quite a bit slower!

Why is it so much slower?

One reason is because of the function call overhead associated with each lookup. When I marked the function as inline, the compiler refused to inline it. When i marked it as __forceinline to force it to inline, the compiler took A LONG TIME in the “generating code” section. I left it for 10 minutes and it still wasn’t done so killed it since it is unacceptably long. I tried a bit to get apples to apples comparisons by putting the test code into the function, but the optimizer ended up realizing it could run the function once and multiply it’s result by the number of test repeats I wanted to do. Trying to fool the optimizer meant doing things that weren’t quite an apples to apples comparison anymore, and made the results get slower. So, I’m leaving it at this: In the release assembly, you can see the function call to the switch function, where you don’t have one with eg. dynamic memory and function calls are not free, so no, the function call is not faster than not having a function call!

The second reason is how switch statements work compared to how arrays work. The switch statement first does a comparison to see if the value is within range of the maximum valued case (in other words, it does a range check, like what std::array and std::vector do in debug! It does this in debug and release both.), and then it does a jmp to a location of code where it then does a return with a constant value.

An array on the other hand just calculates the offset of a memory value based on the index passed in and the object size, and then reads that memory address.

Arrays have more direct access to the values than switch statements do.

It’s true that the switch statements could be optimized to behave more like arrays, and the whole function could go away in lieu of a memory lookup, but the optimizer doesn’t make that optimization here.

Here’s a good read on details of how switch statements are actually implemented in msvc under various conditions:
Something You May Not Know About the Switch Statement in C/C++

On the memory usage front, the simplest array (not worrying about things like geometric dynamic growth policies or memory padding) takes 20,000 bytes to store 5000 uint32’s.

In win32 (didn’t try x64 but I’m sure it’s similar), gutting the switch functions to only have a single case made a huge difference in executable size which shows us that switch statements are bad for memory efficiency too:
Debug went from 340 KB (348,160 bytes) to 244 KB (249,856 bytes).
Release went from 191 KB (196,096 bytes) to 133 KB (136,192 bytes).

## More Questions

I’d like to see how switch statements compare to hash tables. I think switch statements will still be slower, but it would be good to investigate the details and confirm.

This also makes me think that the “String To Enum” code in the last post, which makes use of a switch statement, is probably not the ideal code for doing that, as far as performance and memory usage go.

I think that instead of being switch statement based, it should probably be array based.

However, the idea that we can have compile time assurance that our hash has no collisions is really nice and we get that from doing case statements on run time hashes of our full set of valid inputs. It would be nice to find a way to leverage that while still getting the speed we want, as automatically as possible. At worst, code generation could be used, but it would be nicer to do something less rigid / heavy handed.

Lastly, I still believe that “ad hoc” custom fit code for specific data lookups have the best potential for performance. I want to keep looking into it to see about finding some good avenues for success.

Have any info to share? Please do!

Thanks for reading and I hope you find this useful for at least interesting.

# Exploring Compile Time Hashing

Never put off til run time what can be done at compile time.

C++11 gave us constexpr which lets us make C++ code that the compiler can run during compilation, instead of at runtime.

This is great because now we can use C++ to do some things that we previously had to use macros or templates for.

As with many of the newish C++ features, it feels like there are some rough edges to work out with constexpr, but it adds a lot of new exciting capabilities.

In this post we will explore some new possibilities, and get a better feel for this new world of compile time code execution. There are going to be some unexpected surprises along the way 😛

## Testing Details

The code in this post will be using some compile time crc code I found at Github Gist: oktal/compile-time-crc32.cc. I haven’t tested it for correctness or speed, but it serves the purpose of being a compile time hash implementation that allows us to explore things a bit.

I’ve compiled and analyzed the code in this post in visual studio 2015, in both debug/release and x86/x64. There are differences in behavior between debug and release of course, but x86 and x64 behaved the same. If you have different results with different compilers or different code, please share!

With that out of the way, onto the fun!

We are going to be looking at:

1. Simple Compile Time Hashing Behavior
2. Compile Time Hash Switching
3. Leveraging Jump Tables
4. Perfect Hashing
5. Minimally Perfect Hashing
6. Compile Time Assisted String To Enum

## Simple Compile Time Hashing Behavior

Let’s analyze some basic cases of trying to do some compile time hashing

    const char *hello1String = "Hello1";
unsigned int hashHello1 = crc32(hello1String);  // 1) Always Run Time.
unsigned int hashHello2 = crc32("Hello2");      // 2) Always Run Time.

// 3) error C2131: expression did not evaluate to a constant
//const char *hello3String = "Hello3";
//constexpr unsigned int hashHello3 = crc32(hello3String);
constexpr unsigned int hashHello4 = crc32("Hello4");  // 4) Debug: Run Time.  Release: Compile Time

printf("%X %X %X %Xn", hashHello1, hashHello2, hashHello4, crc32("hello5"));  // 5) Always Run Time. (!!!)


Let’s take a look at the assembly for the above code when compiled in debug. The assembly line calls to crc32 are highlighted for clarity.

    const char *hello1String = "Hello1";
00007FF717B71C3E  lea         rax,[string "Hello1" (07FF717B7B124h)]
00007FF717B71C45  mov         qword ptr [hello1String],rax
unsigned int hashHello1 = crc32(hello1String);  // 1) Always Run Time.
00007FF717B71C49  xor         edx,edx
00007FF717B71C4B  mov         rcx,qword ptr [hello1String]
00007FF717B71C4F  call        crc32 (07FF717B710C3h)
00007FF717B71C54  mov         dword ptr [hashHello1],eax
unsigned int hashHello2 = crc32("Hello2");      // 2) Always Run Time.
00007FF717B71C57  xor         edx,edx
00007FF717B71C59  lea         rcx,[string "Hello2" (07FF717B7B12Ch)]
00007FF717B71C60  call        crc32 (07FF717B710C3h)
00007FF717B71C65  mov         dword ptr [hashHello2],eax

// 3) error C2131: expression did not evaluate to a constant
//const char *hello3String = "Hello3";
//constexpr unsigned int hashHello3 = crc32(hello3String);
constexpr unsigned int hashHello4 = crc32("Hello4");  // 4) Debug: Run Time.  Release: Compile Time
00007FF717B71C68  xor         edx,edx
00007FF717B71C6A  lea         rcx,[string "Hello4" (07FF717B7B134h)]
00007FF717B71C71  call        crc32 (07FF717B710C3h)
00007FF717B71C76  mov         dword ptr [hashHello4],eax

printf("%X %X %X %Xn", hashHello1, hashHello2, hashHello4, crc32("hello5"));  // 5) Always Run Time. (!!!)
00007FF717B71C79  xor         edx,edx
00007FF717B71C7B  lea         rcx,[string "hello5" (07FF717B7B13Ch)]
00007FF717B71C82  call        crc32 (07FF717B710C3h)
00007FF717B71C87  mov         dword ptr [rsp+20h],eax
00007FF717B71C8B  mov         r9d,0BECA76E1h
00007FF717B71C91  mov         r8d,dword ptr [hashHello2]
00007FF717B71C95  mov         edx,dword ptr [hashHello1]
00007FF717B71C98  lea         rcx,[string "%X %X %X %Xn" (07FF717B7B150h)]
00007FF717B71C9F  call        printf (07FF717B711FEh)


As you can see, there is a “call crc32” in the assembly for every place that we call crc32 in the c++ code – 4 crc32 calls in the c++, and 4 crc32 calls in the asm. That means that all of these crc32 calls happen at run time while in debug mode.

I can sort of see the reasoning for always doing constexpr at runtime in debug mode, since you probably want to be able to step through constexpr functions to see how they operate. I’d bet that is the reasoning here.

Let’s see what it compiles to in release. Release is a little bit harder to understand since the optimizations make it difficult/impossible to pair the c++ lines with the asm.

00007FF68DC010BA  lea         rdx,[string "Hello1"+1h (07FF68DC02211h)]
00007FF68DC010C1  mov         ecx,7807C9A2h
00007FF68DC010C6  call        crc32_rec (07FF68DC01070h)
00007FF68DC010CB  lea         rdx,[string "Hello2"+1h (07FF68DC02219h)]
00007FF68DC010D2  mov         ecx,7807C9A2h
00007FF68DC010D7  mov         edi,eax
00007FF68DC010D9  call        crc32_rec (07FF68DC01070h)
00007FF68DC010DE  lea         rdx,[string "hello5"+1h (07FF68DC02221h)]
00007FF68DC010E5  mov         ecx,4369E96Ah
00007FF68DC010EA  mov         ebx,eax
00007FF68DC010EC  call        crc32_rec (07FF68DC01070h)
00007FF68DC010F1  mov         r9d,0BECA76E1h
00007FF68DC010F7  mov         dword ptr [rsp+20h],eax
00007FF68DC010FB  mov         r8d,ebx
00007FF68DC010FE  lea         rcx,[string "%X %X %X %Xn" (07FF68DC02228h)]
00007FF68DC01105  mov         edx,edi
00007FF68DC01107  call        printf (07FF68DC01010h)


We can see that in release, there are still 3 calls to crc32 which means that only one hash actually happens at compile time.

From the assembly we can easily see that “Hello1”, “Hello2” and “Hello5” are hashed at runtime. The assembly shows those strings as parameters to the function.

That leaves only “Hello4” remaining, which means that is the one that got hashed at compile time. You can actually see that on line 12, the value 0x0BECA76E1 is being moved into register r9d. If you step through the code in debug mode, you can see that the value of hashHello4 is actually 0x0BECA76E1, so that “move constant into register” on line 12 is the result of our hash happening at compile time. Pretty neat right?

I was actually surprised to see how many hashes remained happening at run time though, especially the one that is a parameter to printf. There really is no reason I can think of why they would need to remain happening at run time, versus happening at compile time, other than (this?) compiler not aggressively moving whatever it can to compile time. I really wish it worked more like that though, and IMO I think it should. Maybe in the future we’ll see compilers move more that direction.

## Compile Time Hash Switching

Something neat about being able to do hashing at compile time, is that you can use the result of a compile time hash as a case value in a switch statement!

Let’s explore that a bit:

    unsigned int hash = crc32("Hello1");  // 1) Run Time.
constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time. Release: Not calculated at all.
switch (hash) { // 3) Uses variable on stack
case hashTestHello2: {  // 4) Compile Time Constant.
printf("An");
break;
}
case crc32("Hello3"): {  // 5) Compile Time Constant.
printf("Bn");
break;
}
// 6) error C2196: case value '1470747604' already used
/*
case crc32("Hello2"): {
printf("Cn");
break;
}
*/
default: {
printf("Cn");
break;
}
}


Something interesting to note is that if you have duplicate cases in your switch statement, due to things hashing to the same value (either duplicates, or actual hash collisions) that you will get a compile time error. This might come in handy, let’s come back to that later.

Let’s look at the assembly code in debug:

    unsigned int hash = crc32("Hello1");  // 1) Run Time.
00007FF77B4119FE  xor         edx,edx
00007FF77B411A00  lea         rcx,[string "Hello1" (07FF77B41B124h)]
00007FF77B411A07  call        crc32 (07FF77B4110C3h)
00007FF77B411A0C  mov         dword ptr [hash],eax
constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time. Release: Not calculated at all.
00007FF77B411A0F  xor         edx,edx
00007FF77B411A11  lea         rcx,[string "Hello2" (07FF77B41B12Ch)]
00007FF77B411A18  call        crc32 (07FF77B4110C3h)
00007FF77B411A1D  mov         dword ptr [hashTestHello2],eax
switch (hash) { // 3) Uses variable on stack
00007FF77B411A20  mov         eax,dword ptr [hash]
00007FF77B411A23  mov         dword ptr [rbp+0F4h],eax
00007FF77B411A29  cmp         dword ptr [rbp+0F4h],20AEE342h
00007FF77B411A33  je          Snippet_CompileTimeHashSwitching1+71h (07FF77B411A51h)
00007FF77B411A35  cmp         dword ptr [rbp+0F4h],57A9D3D4h
00007FF77B411A3F  je          Snippet_CompileTimeHashSwitching1+63h (07FF77B411A43h)
00007FF77B411A41  jmp         Snippet_CompileTimeHashSwitching1+7Fh (07FF77B411A5Fh)
case hashTestHello2: {  // 4) Compile Time Constant.
printf("An");
00007FF77B411A43  lea         rcx,[string "An" (07FF77B41B144h)]
00007FF77B411A4A  call        printf (07FF77B4111FEh)
break;
00007FF77B411A4F  jmp         Snippet_CompileTimeHashSwitching1+8Bh (07FF77B411A6Bh)
}
case crc32("Hello3"): {  // 5) Compile Time Constant.
printf("Bn");
00007FF77B411A51  lea         rcx,[string "Bn" (07FF77B41B160h)]
00007FF77B411A58  call        printf (07FF77B4111FEh)
break;
00007FF77B411A5D  jmp         Snippet_CompileTimeHashSwitching1+8Bh (07FF77B411A6Bh)
}
// 6) error C2196: case value '1470747604' already used
/*
case crc32("Hello2"): {
printf("Cn");
break;
}
*/
default: {
printf("Cn");
00007FF77B411A5F  lea         rcx,[string "Cn" (07FF77B41B164h)]
00007FF77B411A66  call        printf (07FF77B4111FEh)
break;
}
}


We can see that the hash for “Hello1” and “Hello2” are both calculated at run time, and that the switch statement uses the stack variable [hash] to move the value into a register to do the switch statement.

Interestingly though, on lines 14 and 16 we can see it moving a constant value into registers to use in a cmp (compare) operation. 0x20AEE342 is the hash value of “Hello3” and 0x57A9D3D4 is the hash value of “Hello2” so it ended up doing those hashes at compile time, even though we are in debug mode. This is because case values must be known at compile time.

It’s interesting to see though that the compiler calculates hashTestHello2 at runtime, even though the only place we use it (in the case statement), it puts a compile time constant from a compile time hash. Odd.

Let’s see what happens in release:

00007FF7FA9B10B4  lea         rdx,[string "Hello1"+1h (07FF7FA9B2211h)]
00007FF7FA9B10BB  mov         ecx,7807C9A2h
00007FF7FA9B10C0  call        crc32_rec (07FF7FA9B1070h)
00007FF7FA9B10C5  cmp         eax,20AEE342h
00007FF7FA9B10CA  je          main+49h (07FF7FA9B10F9h)
00007FF7FA9B10CC  cmp         eax,57A9D3D4h
00007FF7FA9B10D1  je          main+36h (07FF7FA9B10E6h)
00007FF7FA9B10D3  lea         rcx,[string "Cn" (07FF7FA9B2220h)]
00007FF7FA9B10DA  call        printf (07FF7FA9B1010h)
// ... More asm code below, but not relevant


Release is a little more lean which is nice. On line 3 we calculate the hash of “Hello1” at runtime, then on lines 4 and 6, we compare it against the constant values of our compile time hashes for “Hello2” and “Hello3”. That is all that is done at runtime, which is more in line with what we’d like to see the compiler do. It’s still a little bit lame that it didn’t see that “Hello1” was a compile time constant that was being hashed, and did it at runtime, but at least it got most of the hashing to happen at compile time.

What if we change the “hash” variable to be constexpr?

    // Release: this function just prints "Cn" and exits.  All code melted away at compile time!
constexpr unsigned int hash = crc32("Hello1");  // 1) Debug: Run Time
constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time
switch (hash) { // 3) Debug: Compile Time Constant.
case hashTestHello2: {   // 4) Debug: Compile Time Constant.
printf("An");
break;
}
case crc32("Hello3"): {   // 5) Debug: Compile Time Constant.
printf("Bn");
break;
}
default: {
printf("Cn");
break;
}
}


Let’s check it out in debug first:

constexpr unsigned int hash = crc32("Hello1");  // 1) Debug: Run Time
00007FF71F7A1ABE  xor         edx,edx
00007FF71F7A1AC0  lea         rcx,[string "Hello1" (07FF71F7AB124h)]
00007FF71F7A1AC7  call        crc32 (07FF71F7A10C3h)
00007FF71F7A1ACC  mov         dword ptr [hash],eax
constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 2) Debug: Run Time
00007FF71F7A1ACF  xor         edx,edx
switch (hash) { // 3) Debug: Compile Time Constant.
00007FF71F7A1AE0  mov         dword ptr [rbp+0F4h],0CEA0826Eh
00007FF71F7A1AEA  cmp         dword ptr [rbp+0F4h],20AEE342h
00007FF71F7A1AF4  je          Snippet_CompileTimeHashSwitching2+72h (07FF71F7A1B12h)
00007FF71F7A1AF6  cmp         dword ptr [rbp+0F4h],57A9D3D4h
00007FF71F7A1B00  je          Snippet_CompileTimeHashSwitching2+64h (07FF71F7A1B04h)
00007FF71F7A1B02  jmp         Snippet_CompileTimeHashSwitching2+80h (07FF71F7A1B20h)
case hashTestHello2: {   // 4) Debug: Compile Time Constant.
printf("An");
00007FF71F7A1B04  lea         rcx,[string "An" (07FF71F7AB144h)]
case hashTestHello2: {   // 4) Debug: Compile Time Constant.
printf("An");
00007FF71F7A1B0B  call        printf (07FF71F7A11FEh)
break;
00007FF71F7A1B10  jmp         Snippet_CompileTimeHashSwitching2+8Ch (07FF71F7A1B2Ch)
}
case crc32("Hello3"): {   // 5) Debug: Compile Time Constant.
printf("Bn");
00007FF71F7A1B12  lea         rcx,[string "Bn" (07FF71F7AB160h)]
00007FF71F7A1B19  call        printf (07FF71F7A11FEh)
break;
00007FF71F7A1B1E  jmp         Snippet_CompileTimeHashSwitching2+8Ch (07FF71F7A1B2Ch)
}
default: {
printf("Cn");
00007FF71F7A1B20  lea         rcx,[string "Cn" (07FF71F7AB164h)]
00007FF71F7A1B27  call        printf (07FF71F7A11FEh)
break;
}
}


The code does a runtime hash of “Hello1” and “Hello2” on lines 4 and 9. Then, on line 12, it moves the compile time hash value of “Hello1” into memory. On line 13 it compares that against the compile time hash value of “Hello3”. On line 15 it compares it against the compile time hash value of “Hello2”.

Now let’s check release:

00007FF7B7B91074  lea         rcx,[string "Cn" (07FF7B7B92210h)]
00007FF7B7B9107B  call        printf (07FF7B7B91010h)


Awesome! It was able to do ALL calculation at compile time, and only do the printf at run time. Neat!

As one final test, let’s see what happens when we put a crc32 call straight into the switch statement.

    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 1) Debug: Run Time. Release: Not calculated at all.
switch (crc32("Hello1")) {  // 2) Always Run Time (!!!)
case hashTestHello2: {  // 3) Compile Time Constant.
printf("An");
break;
}
case crc32("Hello3"): {  // 4) Compile Time Constant.
printf("Bn");
break;
}
default: {
printf("Cn");
break;
}
}


Here’s the debug assembly:

    constexpr unsigned int hashTestHello2 = crc32("Hello2"); // 1) Debug: Run Time. Release: Not calculated at all.
00007FF72ED51B7E  xor         edx,edx
00007FF72ED51B80  lea         rcx,[string "Hello2" (07FF72ED5B12Ch)]
00007FF72ED51B87  call        crc32 (07FF72ED510C3h)
00007FF72ED51B8C  mov         dword ptr [hashTestHello2],eax
switch (crc32("Hello1")) {  // 2) Always Run Time (!!!)
00007FF72ED51B8F  xor         edx,edx
00007FF72ED51B91  lea         rcx,[string "Hello1" (07FF72ED5B124h)]
00007FF72ED51B98  call        crc32 (07FF72ED510C3h)
00007FF72ED51B9D  mov         dword ptr [rbp+0D4h],eax
00007FF72ED51BA3  cmp         dword ptr [rbp+0D4h],20AEE342h
00007FF72ED51BAF  cmp         dword ptr [rbp+0D4h],57A9D3D4h
00007FF72ED51BB9  je          Snippet_CompileTimeHashSwitching3+5Dh (07FF72ED51BBDh)
00007FF72ED51BBB  jmp         Snippet_CompileTimeHashSwitching3+79h (07FF72ED51BD9h)
case hashTestHello2: {  // 3) Compile Time Constant.
printf("An");
00007FF72ED51BBD  lea         rcx,[string "An" (07FF72ED5B144h)]
00007FF72ED51BC4  call        printf (07FF72ED511FEh)
break;
00007FF72ED51BC9  jmp         Snippet_CompileTimeHashSwitching3+85h (07FF72ED51BE5h)
}
case crc32("Hello3"): {  // 4) Compile Time Constant.
printf("Bn");
00007FF72ED51BCB  lea         rcx,[string "Bn" (07FF72ED5B160h)]
00007FF72ED51BD2  call        printf (07FF72ED511FEh)
break;
00007FF72ED51BD7  jmp         Snippet_CompileTimeHashSwitching3+85h (07FF72ED51BE5h)
}
default: {
printf("Cn");
00007FF72ED51BD9  lea         rcx,[string "Cn" (07FF72ED5B164h)]
00007FF72ED51BE0  call        printf (07FF72ED511FEh)
break;
}
}


It was able to do the case value crc’s at compile time, but the other two it did at runtime. Not surprising for debug. Let’s check release:

00007FF6A46D10B4  lea         rdx,[string "Hello1"+1h (07FF6A46D2211h)]
00007FF6A46D10BB  mov         ecx,7807C9A2h
00007FF6A46D10C0  call        crc32_rec (07FF6A46D1070h)
00007FF6A46D10C5  cmp         eax,20AEE342h
00007FF6A46D10CA  je          main+49h (07FF6A46D10F9h)
00007FF6A46D10CC  cmp         eax,57A9D3D4h
00007FF6A46D10D1  je          main+36h (07FF6A46D10E6h)
00007FF6A46D10D3  lea         rcx,[string "Cn" (07FF6A46D2220h)]
00007FF6A46D10DA  call        printf (07FF6A46D1010h)
// ... More asm code below, but not relevant


It did the hash for “Hello1” at run time (why??), but it did the others at release. A bit disappointing that it couldn’t make the “Hello1” hash be compile time, but we saw this behavior before so nothing new there.

## Leveraging Jump Tables

In the above switch statements, the tests for hashes were very much “If hash is a, do this, else if hash is b, do that”. It was an if/else if/else if style chain.

Switch statements actually have the ability to become jump tables though, which let them get to the right case value with fewer comparisons.

Check out this code:

    // Debug: Jump Table
// Release: Just does the constant case, everything else goes away
unsigned int i = 3;
switch (i) {
case 0: printf("An"); break;
case 1: printf("Bn"); break;
case 2: printf("Cn"); break;
case 3: printf("Dn"); break;
case 4: printf("En"); break;
case 5: printf("Fn"); break;
case 6: printf("Gn"); break;
case 7: printf("Hn"); break;
default: printf("Nonen"); break;
}


Here’s the debug assembly:

    // Debug: Jump Table
// Release: Just does the constant case, everything else goes away
unsigned int i = 3;
00007FF69CF9238E  mov         dword ptr [i],3
switch (i) {
00007FF69CF92395  mov         eax,dword ptr [i]
00007FF69CF92398  mov         dword ptr [rbp+0D4h],eax
00007FF69CF9239E  cmp         dword ptr [rbp+0D4h],7
00007FF69CF923A5  ja          $LN11+0Eh (07FF69CF92434h) 00007FF69CF923AB mov eax,dword ptr [rbp+0D4h] 00007FF69CF923B1 lea rcx,[__ImageBase (07FF69CF80000h)] 00007FF69CF923B8 mov eax,dword ptr [rcx+rax*4+1244Ch] 00007FF69CF923BF add rax,rcx 00007FF69CF923C2 jmp rax case 0: printf("An"); break; 00007FF69CF923C4 lea rcx,[string "An" (07FF69CF9B144h)] 00007FF69CF923CB call printf (07FF69CF911FEh) 00007FF69CF923D0 jmp$LN11+1Ah (07FF69CF92440h)
case 1: printf("Bn"); break;
00007FF69CF923D2  lea         rcx,[string "Bn" (07FF69CF9B160h)]
00007FF69CF923D9  call        printf (07FF69CF911FEh)
00007FF69CF923DE  jmp         $LN11+1Ah (07FF69CF92440h) case 2: printf("Cn"); break; 00007FF69CF923E0 lea rcx,[string "Cn" (07FF69CF9B164h)] 00007FF69CF923E7 call printf (07FF69CF911FEh) 00007FF69CF923EC jmp$LN11+1Ah (07FF69CF92440h)
case 3: printf("Dn"); break;
00007FF69CF923EE  lea         rcx,[string "Dn" (07FF69CF9B168h)]
00007FF69CF923F5  call        printf (07FF69CF911FEh)
00007FF69CF923FA  jmp         $LN11+1Ah (07FF69CF92440h) case 4: printf("En"); break; 00007FF69CF923FC lea rcx,[string "En" (07FF69CF9B16Ch)] 00007FF69CF92403 call printf (07FF69CF911FEh) 00007FF69CF92408 jmp$LN11+1Ah (07FF69CF92440h)
case 5: printf("Fn"); break;
00007FF69CF9240A  lea         rcx,[string "Fn" (07FF69CF9B170h)]
00007FF69CF92411  call        printf (07FF69CF911FEh)
00007FF69CF92416  jmp         $LN11+1Ah (07FF69CF92440h) case 6: printf("Gn"); break; 00007FF69CF92418 lea rcx,[string "Gn" (07FF69CF9B174h)] 00007FF69CF9241F call printf (07FF69CF911FEh) 00007FF69CF92424 jmp$LN11+1Ah (07FF69CF92440h)
case 7: printf("Hn"); break;
00007FF69CF92426  lea         rcx,[string "Hn" (07FF69CF9B178h)]
00007FF69CF9242D  call        printf (07FF69CF911FEh)
00007FF69CF92432  jmp         $LN11+1Ah (07FF69CF92440h) default: printf("Nonen"); break; 00007FF69CF92434 lea rcx,[string "Nonen" (07FF69CF9AFDCh)] 00007FF69CF9243B call printf (07FF69CF911FEh) }  On line 8 and 9 it checks to see if the value we are switching on is greater than 7, and if so, jumps to 07FF69CF92434h, which is the “default” case where it prints “None”. If the number is not greater than 7, it calculates an address based on the value we are switching on, and then jumps to it, on line 14. Instead of testing for every possible value in an if/else if/else if/else if type setup, it can go IMMEDIATELY to the code associated with the particular case statement. This is a jump table and can be a big speed improvement if you have a lot of cases, or if the code is called a lot. I’ll spare you the release assembly. The compiler can tell that this is a compile time known result and only has the printf for the “D” without any of the other logic. Let’s go back to our crc32 code and try and leverage a jump table:  // Debug / Release: Does hash and jump table. // Note: It AND's against 7 and then tests to see if i is greater than 7 (!!!) unsigned int i = crc32("Hello") & 7; switch (i) { case 0: printf("An"); break; case 1: printf("Bn"); break; case 2: printf("Cn"); break; case 3: printf("Dn"); break; case 4: printf("En"); break; case 5: printf("Fn"); break; case 6: printf("Gn"); break; case 7: printf("Hn"); break; default: printf("Nonen"); break; }  I’ll show you just the debug assembly for brevity. It does the hash and jump table at runtime for both debug and release. Interestingly, even though we do &7 on the hash value, the switch statement STILL makes sure that the value being switched on is not greater than 7. It can never be greater than 7, and that can be known at compile time, but it still checks. This is true even of the release assembly!  // Debug / Release: Does hash and jump table. // Note: It AND's against 7 and then tests to see if i is greater than 7 (!!!) unsigned int i = crc32("Hello") & 7; 00007FF7439C24CE xor edx,edx 00007FF7439C24D0 lea rcx,[string "Hello" (07FF7439CB180h)] 00007FF7439C24D7 call crc32 (07FF7439C10C3h) 00007FF7439C24DC and eax,7 00007FF7439C24DF mov dword ptr [i],eax switch (i) { 00007FF7439C24E2 mov eax,dword ptr [i] 00007FF7439C24E5 mov dword ptr [rbp+0D4h],eax 00007FF7439C24EB cmp dword ptr [rbp+0D4h],7 00007FF7439C24F2 ja$LN11+0Eh (07FF7439C2581h)
00007FF7439C24F8  mov         eax,dword ptr [rbp+0D4h]
00007FF7439C24FE  lea         rcx,[__ImageBase (07FF7439B0000h)]
00007FF7439C2505  mov         eax,dword ptr [rcx+rax*4+12598h]
00007FF7439C250F  jmp         rax
case 0: printf("An"); break;
00007FF7439C2511  lea         rcx,[string "An" (07FF7439CB144h)]
case 0: printf("An"); break;
00007FF7439C2518  call        printf (07FF7439C11FEh)
00007FF7439C251D  jmp         $LN11+1Ah (07FF7439C258Dh) case 1: printf("Bn"); break; 00007FF7439C251F lea rcx,[string "Bn" (07FF7439CB160h)] 00007FF7439C2526 call printf (07FF7439C11FEh) 00007FF7439C252B jmp$LN11+1Ah (07FF7439C258Dh)
case 2: printf("Cn"); break;
00007FF7439C252D  lea         rcx,[string "Cn" (07FF7439CB164h)]
00007FF7439C2534  call        printf (07FF7439C11FEh)
00007FF7439C2539  jmp         $LN11+1Ah (07FF7439C258Dh) case 3: printf("Dn"); break; 00007FF7439C253B lea rcx,[string "Dn" (07FF7439CB168h)] 00007FF7439C2542 call printf (07FF7439C11FEh) 00007FF7439C2547 jmp$LN11+1Ah (07FF7439C258Dh)
case 4: printf("En"); break;
00007FF7439C2549  lea         rcx,[string "En" (07FF7439CB16Ch)]
00007FF7439C2550  call        printf (07FF7439C11FEh)
00007FF7439C2555  jmp         $LN11+1Ah (07FF7439C258Dh) case 5: printf("Fn"); break; 00007FF7439C2557 lea rcx,[string "Fn" (07FF7439CB170h)] 00007FF7439C255E call printf (07FF7439C11FEh) 00007FF7439C2563 jmp$LN11+1Ah (07FF7439C258Dh)
case 6: printf("Gn"); break;
00007FF7439C2565  lea         rcx,[string "Gn" (07FF7439CB174h)]
00007FF7439C256C  call        printf (07FF7439C11FEh)
00007FF7439C2571  jmp         $LN11+1Ah (07FF7439C258Dh) case 7: printf("Hn"); break; 00007FF7439C2573 lea rcx,[string "Hn" (07FF7439CB178h)] 00007FF7439C257A call printf (07FF7439C11FEh) 00007FF7439C257F jmp$LN11+1Ah (07FF7439C258Dh)
default: printf("Nonen"); break;
00007FF7439C2581  lea         rcx,[string "Nonen" (07FF7439CAFDCh)]
00007FF7439C2588  call        printf (07FF7439C11FEh)
}


If we make “i” be a constexpr variable, it doesn’t affect debug, but in release, it is able to melt away all the code and just prints the correct case.

00007FF6093F1074  lea         rcx,[string "An" (07FF6093F2210h)]
00007FF6093F107B  call        printf (07FF6093F1010h)


## Perfect Hashing

Perfect hashing is when you have a known set of inputs, and a hash function such that there is no collision between any of the inputs.

Perfect hashing can be great for being able to turn complex objects into IDs for faster lookup times, but the IDs are not usually going to be contiguous. One ID might be 3, and another might be 45361. Perfect hashing can still be useful though.

The code below shows show some compile time perfect hashing could be achieved.

    // Debug: Does have some sort of jump table setup, despite the cases not being continuous.
// Release: prints "An".  All other code melts away at compile time.
static const unsigned int c_numBuckets = 16;
static const unsigned int c_salt = 1337;

constexpr unsigned int i = crc32("Identifier_A", c_salt) % c_numBuckets;
switch (i) {
case (crc32("Identifier_A", c_salt) % c_numBuckets): printf("An"); break;
case (crc32("Identifier_B", c_salt) % c_numBuckets): printf("Bn"); break;
case (crc32("Identifier_C", c_salt) % c_numBuckets): printf("Cn"); break;
case (crc32("Identifier_D", c_salt) % c_numBuckets): printf("Dn"); break;
case (crc32("Identifier_E", c_salt) % c_numBuckets): printf("En"); break;
case (crc32("Identifier_F", c_salt) % c_numBuckets): printf("Fn"); break;
case (crc32("Identifier_G", c_salt) % c_numBuckets): printf("Gn"); break;
case (crc32("Identifier_H", c_salt) % c_numBuckets): printf("Hn"); break;
default: printf("Nonen"); break;
}


One nice thing about doing perfect hashing at compile time like this is that if you ever have a hash collision, you’ll have a duplicate case in your switch statement, and will get a compile error. This means that you are guaranteed that your perfect hash is valid at runtime. With non compile time perfect hashes, you could easily get into a situation where you added some more valid inputs and may now have a hash collision, which would make hard to track subtle bugs as two input values would be sharing the same index for whatever read and/or write data they wanted to interact with.

You might notice that i am doing %16 on the hash instead of %8, even though there are only 8 items I want to test against.

The reason for that is hash collisions.

When you hash a string you are effectively getting a pseudo random number back that will always be the same for when that string is given as input.

For the above to work correctly and let me modulus against 8, i would have to roll an 8 sided dice 8 times and get no repeats.

There are 8! (8 factorial) different ways to roll that 8 sided dice 8 times and not get any collisions.

There are 8^8 different ways to roll the 8 sided dice 8 times total.

To get the chance that we roll an 8 sided dice 8 times and get no repeats (no hash collisions), the probability is 8! / 8^8 or 0.24%.

The salt in the crc32 function allows us to effectively re-roll our dice. Each salt value is a different roll of the dice.

How that fits in is that 0.24% of all salt values should let us be able to do %8 on our hashes and not have a hash collision.

Those odds are pretty bad, but they aren’t SUPER bad. We could just brute force search to find a good salt value to use and then hard code the salt in, like i did in the example above.

Unfortunately, the probability I calculated above only works if the hash function gives uniformly distributed output and is “truly random”.

In practice no hash function or pseudo random number generator is, and in fact this crc32 function has NO salt values which make for no collisions! I brute force tested all 2^32 (4.2 billion) possible salt values and came up with no salt that worked!

To get around that problem, instead of trying to get a perfect fit of 8 hashes with values 0-7 without collisions, i opted to go for 8 hashes with values 0-15 with no collisions. That changes my odds for the better, and there are in fact many salts that satisfy that.

It’s the equivalent of rolling a 16 sided dice 8 times without repeats.

Thinking about things a bit differently than before, the first roll has a 100% chance of not being a duplicate. The next roll has a 15/16 chance of not being duplicate. The next has a 14/16 chance and so on until the 8th roll which has a 9/16 chance.

Multiplying those all together, we end up with a 12.08% chance of rolling a 16 sided dice 8 times and not getting a duplicate. That means 12% of the salts (about 1 in 9) won’t produce collisions when we use 16 buckets, which makes it much easier for us to find a salt to use.

Looking at the disassembly in debug, we can see the jump table is miraculously in tact! This is great because now we can get compile time assured perfect hashing of objects, and can use a jump table to convert a runtime object into a perfect hash result.

Note that in release, the code melts away and just does the “correct printf”.

    // Debug: Does have some sort of jump table setup, despite the cases not being continuous.
// Release: prints "An".  All other code melts away at compile time.
static const unsigned int c_numBuckets = 16;
static const unsigned int c_salt = 1337;

constexpr unsigned int i = crc32("Identifier_A", c_salt) % c_numBuckets;
00007FF7CE651E5E  mov         edx,539h
00007FF7CE651E63  lea         rcx,[string "Identifier_A" (07FF7CE65B190h)]
00007FF7CE651E6A  call        crc32 (07FF7CE6510C3h)
00007FF7CE651E6F  xor         edx,edx
00007FF7CE651E71  mov         ecx,10h
00007FF7CE651E76  div         eax,ecx
00007FF7CE651E78  mov         eax,edx
00007FF7CE651E7A  mov         dword ptr [i],eax
switch (i) {
00007FF7CE651E7D  mov         dword ptr [rbp+0D4h],4
00007FF7CE651E87  cmp         dword ptr [rbp+0D4h],0Eh
00007FF7CE651E8E  ja          $LN11+0Eh (07FF7CE651F1Dh) 00007FF7CE651E94 mov eax,dword ptr [rbp+0D4h] 00007FF7CE651E9A lea rcx,[__ImageBase (07FF7CE640000h)] 00007FF7CE651EA1 mov eax,dword ptr [rcx+rax*4+11F34h] 00007FF7CE651EA8 add rax,rcx 00007FF7CE651EAB jmp rax case (crc32("Identifier_A", c_salt) % c_numBuckets): printf("An"); break; 00007FF7CE651EAD lea rcx,[string "An" (07FF7CE65B144h)] 00007FF7CE651EB4 call printf (07FF7CE6511FEh) 00007FF7CE651EB9 jmp$LN11+1Ah (07FF7CE651F29h)
case (crc32("Identifier_B", c_salt) % c_numBuckets): printf("Bn"); break;
00007FF7CE651EBB  lea         rcx,[string "Bn" (07FF7CE65B160h)]
00007FF7CE651EC2  call        printf (07FF7CE6511FEh)
00007FF7CE651EC7  jmp         $LN11+1Ah (07FF7CE651F29h) case (crc32("Identifier_C", c_salt) % c_numBuckets): printf("Cn"); break; 00007FF7CE651EC9 lea rcx,[string "Cn" (07FF7CE65B164h)] 00007FF7CE651ED0 call printf (07FF7CE6511FEh) 00007FF7CE651ED5 jmp$LN11+1Ah (07FF7CE651F29h)
case (crc32("Identifier_D", c_salt) % c_numBuckets): printf("Dn"); break;
00007FF7CE651ED7  lea         rcx,[string "Dn" (07FF7CE65B168h)]
00007FF7CE651EDE  call        printf (07FF7CE6511FEh)
00007FF7CE651EE3  jmp         $LN11+1Ah (07FF7CE651F29h) case (crc32("Identifier_E", c_salt) % c_numBuckets): printf("En"); break; 00007FF7CE651EE5 lea rcx,[string "En" (07FF7CE65B16Ch)] 00007FF7CE651EEC call printf (07FF7CE6511FEh) 00007FF7CE651EF1 jmp$LN11+1Ah (07FF7CE651F29h)
case (crc32("Identifier_F", c_salt) % c_numBuckets): printf("Fn"); break;
00007FF7CE651EF3  lea         rcx,[string "Fn" (07FF7CE65B170h)]
00007FF7CE651EFA  call        printf (07FF7CE6511FEh)
00007FF7CE651EFF  jmp         $LN11+1Ah (07FF7CE651F29h) case (crc32("Identifier_G", c_salt) % c_numBuckets): printf("Gn"); break; 00007FF7CE651F01 lea rcx,[string "Gn" (07FF7CE65B174h)] 00007FF7CE651F08 call printf (07FF7CE6511FEh) 00007FF7CE651F0D jmp$LN11+1Ah (07FF7CE651F29h)
case (crc32("Identifier_H", c_salt) % c_numBuckets): printf("Hn"); break;
00007FF7CE651F0F  lea         rcx,[string "Hn" (07FF7CE65B178h)]
00007FF7CE651F16  call        printf (07FF7CE6511FEh)
00007FF7CE651F1B  jmp         $LN11+1Ah (07FF7CE651F29h) default: printf("Nonen"); break; 00007FF7CE651F1D lea rcx,[string "Nonen" (07FF7CE65AFDCh)] 00007FF7CE651F24 call printf (07FF7CE6511FEh) }  ## Minimally Perfect Hashing Minimally perfect hashing is like perfect hashing, except the results are contiguous values. If you have 8 possible inputs to your minimally perfect hash function, you are going to get as output 0-7. The order of what inputs map to which outputs isn’t strictly defined unless you want to go through a lot of extra effort to make it be that way though. This is even more useful than perfect hashing, as you can hash a (known good) input and use the result as an index into an array, or similar! For more info on MPH, check out my post on it: O(1) Data Lookups With Minimal Perfect Hashing The code below is a way of doing compile time assisted minimally perfect hashing:  // Debug / Release: // Runs crc32 at runtime only for "i". The cases are compile time constants as per usual. // Does a jumptable type setup for the switch and does fallthrough to do multiple increments to get the right ID. // // Release with constexpr on i: // does the printf with a value of 2. The rest of the code melts away. static const unsigned int c_numBuckets = 16; static const unsigned int c_salt = 1337; static const unsigned int c_invalidID = -1; unsigned int i = crc32("Identifier_F", c_salt) % c_numBuckets; unsigned int id = c_invalidID; switch (i) { case (crc32("Identifier_A", c_salt) % c_numBuckets): ++id; case (crc32("Identifier_B", c_salt) % c_numBuckets): ++id; case (crc32("Identifier_C", c_salt) % c_numBuckets): ++id; case (crc32("Identifier_D", c_salt) % c_numBuckets): ++id; case (crc32("Identifier_E", c_salt) % c_numBuckets): ++id; case (crc32("Identifier_F", c_salt) % c_numBuckets): ++id; case (crc32("Identifier_G", c_salt) % c_numBuckets): ++id; case (crc32("Identifier_H", c_salt) % c_numBuckets): ++id; // the two lines below are implicit behavior of how this code works // break; // default: id = c_invalidID; break; } printf("id = %in", id);  Here’s the debug assembly for the above. The release does similar, so i’m not showing it.  // Debug / Release: // Runs crc32 at runtime only for "i". The cases are compile time constants as per usual. // Does a jumptable type setup for the switch and does fallthrough to do multiple increments to get the right ID. // // Release with constexpr on i: // does the printf with a value of 2. The rest of the code melts away. static const unsigned int c_numBuckets = 16; static const unsigned int c_salt = 1337; static const unsigned int c_invalidID = -1; unsigned int i = crc32("Identifier_F", c_salt) % c_numBuckets; 00007FF79E481D0E mov edx,539h 00007FF79E481D13 lea rcx,[string "Identifier_F" (07FF79E48B1E0h)] 00007FF79E481D1A call crc32 (07FF79E4810C3h) 00007FF79E481D1F xor edx,edx 00007FF79E481D21 mov ecx,10h 00007FF79E481D26 div eax,ecx 00007FF79E481D28 mov eax,edx 00007FF79E481D2A mov dword ptr [i],eax unsigned int id = c_invalidID; 00007FF79E481D2D mov dword ptr [id],0FFFFFFFFh switch (i) { 00007FF79E481D34 mov eax,dword ptr [i] 00007FF79E481D37 mov dword ptr [rbp+0F4h],eax 00007FF79E481D3D cmp dword ptr [rbp+0F4h],0Eh 00007FF79E481D44 ja$LN11+8h (07FF79E481D9Fh)
00007FF79E481D46  mov         eax,dword ptr [rbp+0F4h]
00007FF79E481D4C  lea         rcx,[__ImageBase (07FF79E470000h)]
00007FF79E481D53  mov         eax,dword ptr [rcx+rax*4+11DB8h]
00007FF79E481D5D  jmp         rax
case (crc32("Identifier_A", c_salt) % c_numBuckets): ++id;
00007FF79E481D5F  mov         eax,dword ptr [id]
00007FF79E481D62  inc         eax
00007FF79E481D64  mov         dword ptr [id],eax
case (crc32("Identifier_B", c_salt) % c_numBuckets): ++id;
00007FF79E481D67  mov         eax,dword ptr [id]
00007FF79E481D6A  inc         eax
00007FF79E481D6C  mov         dword ptr [id],eax
case (crc32("Identifier_C", c_salt) % c_numBuckets): ++id;
00007FF79E481D6F  mov         eax,dword ptr [id]
00007FF79E481D72  inc         eax
00007FF79E481D74  mov         dword ptr [id],eax
case (crc32("Identifier_D", c_salt) % c_numBuckets): ++id;
00007FF79E481D77  mov         eax,dword ptr [id]
00007FF79E481D7A  inc         eax
00007FF79E481D7C  mov         dword ptr [id],eax
case (crc32("Identifier_E", c_salt) % c_numBuckets): ++id;
00007FF79E481D7F  mov         eax,dword ptr [id]
00007FF79E481D82  inc         eax
00007FF79E481D84  mov         dword ptr [id],eax
case (crc32("Identifier_F", c_salt) % c_numBuckets): ++id;
00007FF79E481D87  mov         eax,dword ptr [id]
00007FF79E481D8A  inc         eax
00007FF79E481D8C  mov         dword ptr [id],eax
case (crc32("Identifier_G", c_salt) % c_numBuckets): ++id;
00007FF79E481D8F  mov         eax,dword ptr [id]
00007FF79E481D92  inc         eax
00007FF79E481D94  mov         dword ptr [id],eax
case (crc32("Identifier_H", c_salt) % c_numBuckets): ++id;
00007FF79E481D97  mov         eax,dword ptr [id]
00007FF79E481D9A  inc         eax
00007FF79E481D9C  mov         dword ptr [id],eax
// the two lines below are implicit behavior of how this code works
// break;
// default: id = c_invalidID; break;
}

printf("id = %in", id);
00007FF79E481D9F  mov         edx,dword ptr [id]
00007FF79E481DA2  lea         rcx,[string "id = %in" (07FF79E48B238h)]
// the two lines below are implicit behavior of how this code works
// break;
// default: id = c_invalidID; break;
}

printf("id = %in", id);
00007FF79E481DA9  call        printf (07FF79E4811FEh)


As you can see, the jump table is still in tact, which is good, but it does a lot of repeated increments to get the right ID values. I wish the compiler were smart enough to “flatten” this and just give each case it’s proper ID value.

As is, this could be a performance issue if you had a very large number of inputs.

You could always just hard code a number there instead of relying on the fallthrough and increment, but then there is a lot of copy pasting. Maybe you could do something clever with macros or templates to help that though.

## Compile Time Assisted String To Enum

Another interesting thing to think about is that we could actually use compile time hashing to convert a string to an enum.

In this case, let’s say that we don’t know if our input is valid or not. Since we don’t know that, we have to switch on the hash of our input string, but then do a string compare against whatever string has that hash, to make sure it matches. If it does match, it should take that enum value, else it should be invalid.

Since that would be a lot of error prone copy/pasting, I simplified things a bit by using a macro list:

    // Debug / Release:
//   Runs crc32 at runtime only for "i".  The cases are compile time constants as per usual.
//   Does a jumptable type setup for the switch and does a string comparison against the correct string.
//   If strings are equal, sets the enum value.

static const unsigned int c_numBuckets = 16;
static const unsigned int c_salt = 1337;

const char* testString = "Identifier_F";
unsigned int i = crc32(testString, c_salt) % c_numBuckets;

// This macro list is used for:
//  * making the enum
//  * making the cases in the switch statement
// D.R.Y. - Don't Repeat Yourself.
// Fewer moving parts = fewer errors, but admittedly is harder to understand vs redundant code.
#define ENUM_VALUE_LIST
VALUE(Identifier_A)
VALUE(Identifier_B)
VALUE(Identifier_C)
VALUE(Identifier_D)
VALUE(Identifier_E)
VALUE(Identifier_F)
VALUE(Identifier_G)
VALUE(Identifier_H)

// Make the enum values.
// Note these enum values are also usable as a contiguous ID if you needed one for an array index or similar.
// You could define an array with size EIdentifier::count for instance and use these IDs to index into it.
enum class EIdentifier : unsigned char {
#define VALUE(x) x,
ENUM_VALUE_LIST
#undef VALUE
count,
invalid = (unsigned char)-1
};

// do a compile time hash assisted string comparison to convert string to enum
EIdentifier identifier = EIdentifier::invalid;
switch (i) {
#define VALUE(x) case (crc32(#x, c_salt) % c_numBuckets) : if(!strcmp(testString, #x)) identifier = EIdentifier::x; else identifier = EIdentifier::invalid; break;
ENUM_VALUE_LIST
#undef VALUE
default: identifier = EIdentifier::invalid;
}

// undefine the enum value list
#undef ENUM_VALUE_LIST

printf("string translated to enum value %i", identifier);


Here’s the debug assembly showing it working like it’s supposed to:

    // Debug / Release:
//   Runs crc32 at runtime only for "i".  The cases are compile time constants as per usual.
//   Does a jumptable type setup for the switch and does a string comparison against the correct string.
//   If strings are equal, sets the enum value.

static const unsigned int c_numBuckets = 16;
static const unsigned int c_salt = 1337;

const char* testString = "Identifier_F";
00007FF6B188179E  lea         rax,[string "Identifier_F" (07FF6B188B1E0h)]
00007FF6B18817A5  mov         qword ptr [testString],rax
unsigned int i = crc32(testString, c_salt) % c_numBuckets;
00007FF6B18817A9  mov         edx,539h
00007FF6B18817AE  mov         rcx,qword ptr [testString]
00007FF6B18817B2  call        crc32 (07FF6B18810C3h)
00007FF6B18817B7  xor         edx,edx
00007FF6B18817B9  mov         ecx,10h
00007FF6B18817BE  div         eax,ecx
00007FF6B18817C0  mov         eax,edx
00007FF6B18817C2  mov         dword ptr [i],eax

// This macro list is used for:
//  * making the enum
//  * making the cases in the switch statement
// D.R.Y. - Don't Repeat Yourself.
// Fewer moving parts = fewer errors, but admittedly is harder to understand vs redundant code.
#define ENUM_VALUE_LIST
VALUE(Identifier_A)
VALUE(Identifier_B)
VALUE(Identifier_C)
VALUE(Identifier_D)
VALUE(Identifier_E)
VALUE(Identifier_F)
VALUE(Identifier_G)
VALUE(Identifier_H)

// Make the enum values.
// Note these enum values are also usable as a contiguous ID if you needed one for an array index or similar.
// You could define an array with size EIdentifier::count for instance and use these IDs to index into it.
enum class EIdentifier : unsigned char {
#define VALUE(x) x,
ENUM_VALUE_LIST
#undef VALUE
count,
invalid = (unsigned char)-1
};

// do a compile time hash assisted string comparison to convert string to enum
EIdentifier identifier = EIdentifier::invalid;
00007FF6B18817C5  mov         byte ptr [identifier],0FFh
switch (i) {
00007FF6B18817C9  mov         eax,dword ptr [i]
00007FF6B18817CC  mov         dword ptr [rbp+114h],eax
00007FF6B18817D2  cmp         dword ptr [rbp+114h],0Eh
switch (i) {
00007FF6B18817D9  ja          $LN25+20h (07FF6B1881904h) 00007FF6B18817DF mov eax,dword ptr [rbp+114h] 00007FF6B18817E5 lea rcx,[__ImageBase (07FF6B1870000h)] 00007FF6B18817EC mov eax,dword ptr [rcx+rax*4+11924h] 00007FF6B18817F3 add rax,rcx 00007FF6B18817F6 jmp rax #define VALUE(x) case (crc32(#x, c_salt) % c_numBuckets) : if(!strcmp(testString, #x)) identifier = EIdentifier::x; else identifier = EIdentifier::invalid; break; ENUM_VALUE_LIST 00007FF6B18817F8 lea rdx,[string "Identifier_A" (07FF6B188B190h)] 00007FF6B18817FF mov rcx,qword ptr [testString] 00007FF6B1881803 call strcmp (07FF6B18811CCh) 00007FF6B1881808 test eax,eax 00007FF6B188180A jne Snippet_CompileTimeHashAssistedStringToEnum+92h (07FF6B1881812h) 00007FF6B188180C mov byte ptr [identifier],0 00007FF6B1881810 jmp Snippet_CompileTimeHashAssistedStringToEnum+96h (07FF6B1881816h) 00007FF6B1881812 mov byte ptr [identifier],0FFh 00007FF6B1881816 jmp$LN25+24h (07FF6B1881908h)
$LN7: 00007FF6B188181B lea rdx,[string "Identifier_B" (07FF6B188B1A0h)] 00007FF6B1881822 mov rcx,qword ptr [testString] 00007FF6B1881826 call strcmp (07FF6B18811CCh) 00007FF6B188182B test eax,eax 00007FF6B188182D jne Snippet_CompileTimeHashAssistedStringToEnum+0B5h (07FF6B1881835h) 00007FF6B188182F mov byte ptr [identifier],1 00007FF6B1881833 jmp Snippet_CompileTimeHashAssistedStringToEnum+0B9h (07FF6B1881839h) 00007FF6B1881835 mov byte ptr [identifier],0FFh 00007FF6B1881839 jmp$LN25+24h (07FF6B1881908h)
$LN10: 00007FF6B188183E lea rdx,[string "Identifier_C" (07FF6B188B1B0h)] 00007FF6B1881845 mov rcx,qword ptr [testString] 00007FF6B1881849 call strcmp (07FF6B18811CCh) 00007FF6B188184E test eax,eax 00007FF6B1881850 jne Snippet_CompileTimeHashAssistedStringToEnum+0D8h (07FF6B1881858h) 00007FF6B1881852 mov byte ptr [identifier],2 00007FF6B1881856 jmp Snippet_CompileTimeHashAssistedStringToEnum+0DCh (07FF6B188185Ch) 00007FF6B1881858 mov byte ptr [identifier],0FFh 00007FF6B188185C jmp$LN25+24h (07FF6B1881908h)
$LN13: 00007FF6B1881861 lea rdx,[string "Identifier_D" (07FF6B188B1C0h)] 00007FF6B1881868 mov rcx,qword ptr [testString] 00007FF6B188186C call strcmp (07FF6B18811CCh) 00007FF6B1881871 test eax,eax 00007FF6B1881873 jne Snippet_CompileTimeHashAssistedStringToEnum+0FBh (07FF6B188187Bh) 00007FF6B1881875 mov byte ptr [identifier],3 00007FF6B1881879 jmp Snippet_CompileTimeHashAssistedStringToEnum+0FFh (07FF6B188187Fh) 00007FF6B188187B mov byte ptr [identifier],0FFh 00007FF6B188187F jmp$LN25+24h (07FF6B1881908h)
$LN16: 00007FF6B1881884 lea rdx,[string "Identifier_E" (07FF6B188B1D0h)] 00007FF6B188188B mov rcx,qword ptr [testString] 00007FF6B188188F call strcmp (07FF6B18811CCh) 00007FF6B1881894 test eax,eax 00007FF6B1881896 jne Snippet_CompileTimeHashAssistedStringToEnum+11Eh (07FF6B188189Eh) 00007FF6B1881898 mov byte ptr [identifier],4 00007FF6B188189C jmp Snippet_CompileTimeHashAssistedStringToEnum+122h (07FF6B18818A2h) 00007FF6B188189E mov byte ptr [identifier],0FFh 00007FF6B18818A2 jmp$LN25+24h (07FF6B1881908h)
$LN19: 00007FF6B18818A4 lea rdx,[string "Identifier_F" (07FF6B188B1E0h)] 00007FF6B18818AB mov rcx,qword ptr [testString] 00007FF6B18818AF call strcmp (07FF6B18811CCh) 00007FF6B18818B4 test eax,eax 00007FF6B18818B6 jne Snippet_CompileTimeHashAssistedStringToEnum+13Eh (07FF6B18818BEh) 00007FF6B18818B8 mov byte ptr [identifier],5 00007FF6B18818BC jmp Snippet_CompileTimeHashAssistedStringToEnum+142h (07FF6B18818C2h) 00007FF6B18818BE mov byte ptr [identifier],0FFh 00007FF6B18818C2 jmp$LN25+24h (07FF6B1881908h)
$LN22: 00007FF6B18818C4 lea rdx,[string "Identifier_G" (07FF6B188B1F0h)] 00007FF6B18818CB mov rcx,qword ptr [testString] 00007FF6B18818CF call strcmp (07FF6B18811CCh) 00007FF6B18818D4 test eax,eax 00007FF6B18818D6 jne Snippet_CompileTimeHashAssistedStringToEnum+15Eh (07FF6B18818DEh) 00007FF6B18818D8 mov byte ptr [identifier],6 00007FF6B18818DC jmp Snippet_CompileTimeHashAssistedStringToEnum+162h (07FF6B18818E2h) 00007FF6B18818DE mov byte ptr [identifier],0FFh 00007FF6B18818E2 jmp$LN25+24h (07FF6B1881908h)
$LN25: 00007FF6B18818E4 lea rdx,[string "Identifier_H" (07FF6B188B200h)] 00007FF6B18818EB mov rcx,qword ptr [testString] 00007FF6B18818EF call strcmp (07FF6B18811CCh) 00007FF6B18818F4 test eax,eax 00007FF6B18818F6 jne$LN25+1Ah (07FF6B18818FEh)
00007FF6B18818F8  mov         byte ptr [identifier],7
00007FF6B18818FC  jmp         $LN25+1Eh (07FF6B1881902h) 00007FF6B18818FE mov byte ptr [identifier],0FFh 00007FF6B1881902 jmp$LN25+24h (07FF6B1881908h)
#undef VALUE
default: identifier = EIdentifier::invalid;
00007FF6B1881904  mov         byte ptr [identifier],0FFh
}

// undefine the enum value list
#undef ENUM_VALUE_LIST

printf("string translated to enum value %i", identifier);
00007FF6B1881908  movzx       eax,byte ptr [identifier]
00007FF6B188190C  mov         edx,eax
00007FF6B188190E  lea         rcx,[string "string translated to enum value "... (07FF6B188B248h)]
00007FF6B1881915  call        printf (07FF6B18811FEh)


## Other Possibilities

Besides the above, I think there is a lot of other really great uses for constexpr functions.

For instance, i’d like to see how it’d work to have compile time data structures to do faster read access of constant data.

I want to see compile time trees, hash tables, sparse arrays, bloom filters, and more.

I believe they have the potential to be a lot more performant than even static data structures, since empty or unaccessed sections of the data structure could possibly melt away when the optimizer does it’s pass.

It may not turn out like that, but I think it’d be interesting to investigate it deeper and see how it goes.

I’d also like to see compilers get more aggressive about doing whatever it can at compile time. If there’s no reason for it to happen at runtime, why make it happen then? It is only going to make things slower.

You can find the source code for the code snippets in this post here: Github Atrix256/RandomCode/CompileTimeCRC/

Here’s a couple interesting discussions on constexpr on stack overflow:
detecting execution time of a constexpr function
How to ensure constexpr function never called at runtime?

# Path Tracing – Getting Started With Diffuse and Emissive

The images below were path traced using 100,000 samples per pixel, using the sample code provided in this post.

Path tracing is a specific type of ray tracing that aims to make photo realistic images by solving something called the rendering equation. Path tracing relies heavily on a method called Monte Carlo integration to get an approximate solution. In fact, it’s often called “Monte Carlo Path Tracing”, but I’ll refer to it as just “Path Tracing”.

In solving the rendering equation, path tracing automatically gets many graphical “features” which are cutting edge research topics of real time graphics, such as soft shadows, global illumination, color bleeding and ambient occlusion.

Unfortunately, path tracing can take a very long time to render – like minutes, hours, or days even, depending on scene complexity, image quality desired and specific algorithms used. Despite that, learning path tracing can still be very useful for people doing real time rendering, as it can give you a “ground truth” to compare your real time approximating algorithms to, and can also help you understand the underlying physical processes involved, to help you make better real time approximations. Sometimes even, data is created offline using path tracing, and is “baked out” so that it can be a quick and simple lookup during runtime.

There is a lot of really good information out there about path tracing, walking you through the rendering equations, monte carlo integration, and the dozen or so relevant statistical topics required to do monte carlo integration.

While understanding that stuff is important if you really want to get the most out of path tracing, this series of blog posts is meant to be more informal, explaining more what to do, and less deeply about why to do it.

When and if you are looking for resources that go deeper into the why, I highly recommend starting with these!

## Source Code

You can find the source code that goes along with this post here:

Github: Atrix256/RandomCode/PTBlogPost1/

PTBlogPost1.zip

## The Rendering Equation

The rendering equation might look a bit scary at first but stay with me, it is actually a lot simpler than it looks.

$L_o( \omega_o)= L_e(\omega_o)+\int_{\Omega}{f(\omega_i, \omega_o)L_i(\omega_i)(\omega_i \cdot n)\mathrm{d}\omega_i}$

Here’s a simplified version:

$LightOut(ViewDirection) = EmissiveLight(ViewDirection) + ReflectedLight(ViewDirection,AllDirections)$

In other words, the light you see when looking at an object is made up of how much it glows in your direction (like how a lightbulb or a hot coal in a fireplace glows), and also how much light is reflected in your direction, from light that is hitting that point on the object from all other directions.

It’s pretty easy to know how much an object is glowing in a particular direction. In the sample code, I just let a surface specify how much it glows (an emissive color) and use that for the object’s glow at any point on the surface, at any viewing angle.

The rest of the equation is where it gets harder. The rest of the equation is this:

$\int_{\Omega}{f(\omega_i, \omega_o)L_i(\omega_i)(\omega_i \cdot n)\mathrm{d}\omega_i}$

The integral (the $\int_{\Omega}$ at the front and the $\mathrm{d}\omega_i$ at the back) just means that we are going to take the result of everything between those two symbols, and add them up for every point in a hemisphere, multiplying each value by the fractional size of the point’s area for the hemisphere. The hemisphere we are talking about is the positive hemisphere surrounding the normal of the surface we are looking at.

One of the reasons things get harder at this point is that there are an infinite number of points on the hemisphere.

Let’s ignore the integral for a second and talk about the rest of the equation. In other words, lets consider only one of the points on the hemisphere for now.

• $f(\omega_i, \omega_o)$ – This is the “Bidirectional reflectance distribution function”, otherwise known as the BRDF. It describes how much light is reflected towards the view direction, of the light coming in from the point on the hemisphere we are considering.
• $L_i(\omega_i)$ – This is how much light is coming in from the point on the hemisphere we are considering.
• $(\omega_i \cdot n)$ – This is the cosine of the angle between the point on the hemisphere we are considering and the surface normal, gotten via a dot product. What this term means is that as the light direction gets more perpendicular to the normal, light is reflected less and less. This is based on the actual behavior of light and you can read more about it here if you want to: Wikipedia: Lambert’s Cosine Law. Here is another great link about it lambertian.pdf. (Thanks for the link Jay!)

So what this means is that for a specific point on the hemisphere, we find out how much light is coming in from that direction, multiply it by how much the BRDF says light should be reflected towards the view direction from that direction, and then apply Lambert’s cosine law to make light dimmer as the light direction gets more perpendicular with the surface normal (more parallel with the surface).

Hopefully that makes enough sense.

Bringing the integral back into the mix, we have to sum up the results of that process for each of the infinite points on the hemisphere, multiplying each value by the fractional size of the point’s area for the hemisphere. When we’ve done that, we have our answer, and have our final pixel color.

This is where Monte Carlo integration comes in. Since we can’t really sum the infinite points, multiplied by their area (which is infinitely small), we are instead going to take a lot of random samples of the hemisphere and average them together. The more samples we take, the closer we get to the actual correct value. Not too hard right?

Now that we have the problem of the infinite points on the hemisphere solved, we actually have a second infinity to deal with.

The light incoming from a specific direction (a point on the hemisphere) is just the amount of light leaving a different object from that direction. So, to find out how much light is coming in from that direction, you have to find what object is in that direction, and calculate how much light is leaving that direction for that object. The problem is, the amount of light leaving that direction for that object is in fact calculated using the rendering equation, so it in turn is based on light leaving a different object and so on. It continues like this, possibly infinitely, and even possibly in loops, where light bounces between objects over and over (like putting two mirrors facing eachother). We have possibly infinite recursion!

The way this is dealt with in path tracers is to just limit the maximum amount of bounces that are considered. Higher numbers of bounces gives diminishing returns in general, so usually it’s just the first couple of bounces that really make a difference. For instance, the images at the top of this post were made using 5 bounces.

## The Algorithm

Now that we know how the rendering equation works, and what we need to do to solve it, let’s write up the algorithm that we perform for each pixel on the screen.

1. First, we calculate the camera ray for the pixel.
2. If the camera ray doesn’t hit an object, the pixel is black.
3. If the camera ray does hit an object, the pixel’s color is determined by how much light that object is emitting and reflecting down the camera ray.
4. To figure out how much light that is, we choose a random direction in the hemisphere of that object’s normal and recurse.
5. At each stage of the recursion, we return: EmittedLight + 2 * RecursiveLight * Dot(Normal, RandomHemisphereAngle) * SurfaceDiffuseColor.
6. If we ever reach the maximum number of bounces, we return black for the RecursiveLight value.

We do the above multiple times per pixel and average the results together. The more times we do the process (the more samples we have), the closer the result gets to the correct answer.

By the way, the multiplication by 2 in step five is a byproduct of some math that comes out of integrating the BRDF. Check the links i mentioned at the top of the post for more info, or you can at least verify that I’m not making it up by seeing that wikipedia says the same thing. There is also some nice psuedo code there! (:
Wikipedia: Path Tracing: Algorithm

## Calculating Camera Rays

There are many ways to calculate camera rays, but here’s the method I used.

In this post we are going to path trace using a pin hole camera. In a future post we’ll switch to using a lens to get interesting lens effects like depth of field.

To generate rays for our pin hole camera, we’ll need an eye position, a target position that the eye is looking at, and an imaginary grid of pixels between the eye and the target.

This imaginary grid of pixels is what is going to be displayed on the screen, and may be thought of as the “near plane”. If anything gets between the eye and the near plane it won’t be visible.

To calculate a ray for a pixel, we get the direction from the eye to the pixel’s location on the grid and normalize that. That gives us the ray’s direction. The ray’s starting position is just the pixel’s location on that imaginary grid.

I’ll explain how to do this below, but keep in mind that the example code also does this process, in case reading the code is easier than reading a description of the math used.

First we need to figure out the orientation of the camera:

1. Calculate the camera’s forward direction by normalizing (Target – Eye).
2. To calculate the camera’s right vector, cross product the forward vector with (0,1,0).
3. To calculate the camera’s up vector, cross product the forward vector with the right vector.

Note that the above assumes that there is no roll (z axis rotation) on the camera, and that it isn’t looking directly up.

Next we need to figure out the size of our near plane on the camera’s x and y axis. To calculate this, we have to define both a near plane distance (I use 0.1 in the sample code) as well as a horizontal and vertical field of view (I use a FOV of 40 degrees both horizontally and vertically, and make a square image, in the sample code).

You can get the size of the window on each axis then like this:

1. WindowRight = tangent(HorizontalFOV / 2) * NearDistance
2. WindowTop = tangent(VerticalFOV / 2) * NearDistance

Note that we divide the field of view by 2 on each axis because we are going to treat the center of the near plane as (0,0). This centers the near plane on the camera.

Next we need to figure out where our pixel’s location is in world space, when it is at pixel location (x,y):

1. Starting at the camera’s position, move along the camera’s forward vector by whatever your near plane distance is (I use a value of 0.1). This gets us to the center of the imaginary pixel grid.
2. Next we need to figure out what percentage on the X and Y axis our pixel is. This will tell us what percentage on the near plane it will be. We divide x by ScreenWidth and y by ScreenHeight. We then put these percentages in a [-1,1] space by multiplying the percentages by 2 and subtracting 1. We will call these values u and v, which equate to the x and y axis of the screen.
3. Starting at the center of the pixel grid that we found, we are going to move along the camera’s right vector a distance of u and the camera’s up vector a distance of v.

We now have our pixel’s location in the world.

Lastly, this is how we calculate the ray’s position and direction:

1. RayDir = normalize(PixelWorld – Eye)
2. RayStart = PixelWorld

We now have a ray for our pixel and can start solving eg ray vs triangle equations to see what objects in the scene our ray intersects with.

That’s basically all there is to path tracing at a high level. Next up I want to talk about some practical details of path tracing.

## Rendering Parameters, Rendering Quality and Rendering Time

A relevant quote from @CasualEffects:

Below are a few scenes rendered at various samples per pixel (spp) and their corresponding rendering times. They are rendered at 128×128 with 5 bounces. I used 8 threads to utilize my 8 cpu cores. Exact machine specs aren’t really important here, I just want to show how sample count affects render quality and render times in a general way.

There’s a couple things worth noting.

First, render time grows roughly linearly with the number of samples per pixel. This lets you have a pretty good estimate how long a render will take then, if you know how long it took to render 1000 samples per pixel, and now you want to make a 100,000 samples per pixel image – it will take roughly 100 times as long!

Combine that with the fact that you need 4 times as many samples to half the amount of error (noise) in an image and you can start to see why monte carlo path tracing takes so long to render nice looking images.

This also applies to the size of your render. The above were rendered at 128×128. If we instead rendered them at 256×256, the render times would actually be four times as long! This is because our image would have four times as many pixels, which means we’d be taking four times as many samples (at the same number of samples per pixel) to make an image of the same quality at the higher resolution.

You can affect rendering times by adjusting the maximum number of bounces allowed in your path tracer as well, but that is not as straightforward of a relationship to rendering time. The rendering time for a given number of bounces depends on the complexity and geometry of the scene, so is very scene dependent. One thing you can count on though is that decreasing the maximum number of bounces will give you the same or faster rendering times, while increasing the maximum number of bounces will give you the same or slower rendering times.

Something else that’s really important to note is that the first scene takes a lot more samples to start looking reasonable than the second scene does. This is because there is only one small light source in the first image but there is a lot of ambient light from a blue sky in the second image. What this means is that in the first scene, many rays will hit darkness, and only a few will hit light. In the second scene, many rays will hit either the small orb of light, or will hit the blue sky, but all rays will hit a light source.

The third scene also takes a lot more samples to start looking reasonable compared to the fourth scene. This is because in the third scene, there is a smaller, brighter light in the middle of the ceiling, while in the fourth scene there is a dimmer but larger light that covers the entire ceiling. Again, in the third scene, rays are less likely to hit a light source than in the fourth scene.

Why do these scenes converge at different rates?

Well it turns out that the more difference there is in the things your rays are likely to hit (this difference is called variance, and is the source of the “noisy look” in your path tracing), the longer it takes to converge (find the right answer).

This makes a bit of sense if you think of it just from the point of view of taking an average.

If you have a bunch of numbers with an average of N, but the individual numbers vary wildly from one to the next, it will take more numbers averaged together to get closer to the actual average.

If on the other hand, you have a bunch of numbers with an average of N that aren’t very different from eachother (or very different from N), it will take fewer numbers to get closer to the actual average.

Taken to the extreme, if you have a bunch of numbers with an average of N, that are all exactly the value N, it only takes one sample to get to the actual true average of N!

You can read a discussion on this here: Computer Graphics Stack Exchange: Is it expected that a naive path tracer takes many, many samples to converge?

There are lots of different ways of reducing variance of path tracing in both the sampling, as well as in the final image.

Some techniques actually just “de-noise” the final image rendered with lower sample counts. Some techniques use some extra information about each pixel to denoise the image in a smarter way (such as using a Z buffer type setup to do bilateral filtering).

Here’s such a technique that has really impressive results. Make sure and watch the video!

Nonlinearly Weighted First-order Regression for Denoising Monte Carlo Renderings

There is also a nice technique called importance sampling where instead of shooting the rays out in a uniform random way, you actually shoot your rays in directions that matter more and will get you to the actual average value faster. Importance sampling lets you get better results with fewer rays.

In the next post or two, I’ll show a very simple importance sampling technique (cosine weighted hemisphere sampling) and hope in the future to show some more sophisticated importance sampling methods.

## Some Other Practical Details

Firstly, I want to mention that this is called “naive” path tracing. There are ways to get better images in less time, and algorithms that are better suited for different scenes or different graphical effects (like fog or transparent objects), but this is the most basic, and most brute force way to do path tracing. We’ll talk about some ways to improve it and get some more features in future posts.

Hitting The Wrong Objects

I wanted to mention that when you hit an object and you calculate a random direction to cast a new ray in, there’s some very real danger that the new ray you cast out will hit the same object you just hit! This is due to the fact that you are starting the ray at the collision point of the object, and the floating point math may or may not consider the new ray to hit the object at time 0 or similar.

One way to deal with this is to move the ray’s starting point a small amount down the ray direction before testing the ray against the scene. If pushed out far enough (I use a distance of 0.001) it will make sure the ray doesn’t hit the same object again. It sounds like a dodgy thing to do, because if you don’t push it out enough (how far is enough?) it will give you the wrong result, and if you push it out too far to be safe, you can miss thin objects, or can miss objects that are very close together. In practice though, this is the usual solution to the problem and works pretty well without too much fuss. Note that this is a problem in all ray tracing, not just path tracing, and this solution of moving the ray by a small epsilon is common in all forms of ray tracing I’ve come across!

Another way to deal with the problem is to give each object a unique ID and then when you cast a ray, tell it to ignore the ID of the object you just hit. This works flawlessly in practice, so long as your objects are convex – which is usually the case for ray tracing since you often use triangles, quads, spheres, boxes and similar to make your scene. However, this falls apart when you are INSIDE of a shape (like how the images at the top of this post show objects INSIDE a box), and it also falls apart when you have transparent objects, since sometimes it is appropriate for an object to be intersected again by a new ray.

I used to be a big fan of object IDs, but am slowly coming to terms with just pushing the ray out a little bit. It’s not so bad, and handles transparents and being inside an object (:

Gamma Correction

After we generate our image, we need to apply gamma correction by taking each color channel to the power of 1/2.2. A decent approximation to that is also to just take the square root of each color channel, as the final value for that color channel. You can read about why we do this here: Understanding Gamma Correction

HDR vs LDR

There is nothing in our path tracer that has any limitations on how bright something can be. We could have a bright green light that had a color value of (0, 100000, 0)! Because of this, the final pixel color may not necessarily be less than one (but it will be a positive number). Our color with be a “High Dynamic Range” color aka HDR color. You can use various tone mapping methods to turn an HDR color into an LDR color – and we will be looking at that in a future post – but for now, I am just clamping the color value between 0 and 1. It’s not the best option, but it works fine for our needs right now.

Divide by Pi?

When looking at explanations or implementations of path tracing, you’ll see that some of them divide colors by pi at various stages, and others don’t. Since proper path tracing is very much about making sure you have every little detail of your math correct, you might wonder whether you should be dividing by pi or not, and if so, where you should do that. The short version is it actually doesn’t really matter believe it or not!

Here are two great reads on the subject for a more in depth answer:
PI or not to PI in game lighting equation

Random Point on Sphere and Bias

Correctly finding a uniformly random point on a sphere or hemisphere is actually a little bit more complicated that you might expect. If you get it wrong, you’ll introduce bias into your images which will make for some strange looking things.

Here is a good read on some ways to do it correctly:
Wolfram Math World: Sphere Point Picking

Here’s an image where the random point in sphere function was just completely wrong:

Here’s an image where the the random hemisphere function worked by picking a random point in a cube and normalizing the resulting vector (and multiplying it by -1 if it was in the wrong hemisphere). That gives too much weighting to the corners, which you can see manifests itself in the image on the left as weird “X” shaped lighting (look at the wall near the green light), instead of on the right where the lighting is circular. Apologies if it’s hard to distinguish. WordPress is converting all my images to 8BPP! :X

Primitive Types & Counts Matter

Here are some render time comparisons of the Cornell box scene rendered at 512×512, using 5 bounces and 100 samples per pixel.

There are 3 boxes which only have 5 sides – the two boxes inside don’t have a bottom, and the box containing the boxes doesn’t have a front.

I started out by making the scene with 30 triangles, since it takes 2 triangles to make a quad, and 5 quads to make a box, and there are 3 boxes.

Those 30 triangles rendered in 21.1 seconds.

I then changed it from 30 triangles to 15 quads.

It then took 6.2 seconds to render. It cut the time in half!

This is not so surprising though. If you look at the code for ray vs triangle compared to ray vs quad, you see that ray vs quad is just a ray vs triangle test were we first test the “diagonal line” of the quad, to see which of the 2 corners should be part of the ray vs triangle test. Because of this, testing a quad is just about as fast as testing a triangle. Since using quads means we have half the number of primitives, turning our triangles into quads means our rendering time is cut in half.

Lastly, i tried turning the two boxes inside into oriented boxes that have a width, height, depth, an axis of rotation and an angle of rotation around that axis. The collision test code puts the ray into the local space of the oriented box and then just does a ray vs axis aligned box test.

Doing that, i ended up with 5 quads (for the box that doesn’t have a front, it needed to stay quads, unless i did back face culling, which i didn’t want to) and two oriented boxes.

The render time for that was 5.5 seconds, so it did shave off 0.7 seconds, which is a little over 11% of the rendering time. So, it was worth while.

For such a low number of primitives, I didn’t bother with any spatial acceleration structures, but people do have quite a bit of luck on more complex scenes with bounding volume hierarchies (BVH’s).

For naive path tracing code, since the first ray hit is entirely deterministic which object it will hit (if any), we could also cache that first intersection and re-use it for each subsequent sample. That ought to make a significant difference to rendering times, but basically in the next path tracing post we’ll be removing the ability to do that, so I didn’t bother to try it and time it.

Debugging

As you are working on your path tracer, it can be useful to render an image at a low number of samples so that it’s quick to render and you can see whether things are turning out the way you want or not.

Another option is to have it show you the image as it’s rendering more and more samples per pixel, so that you can see it working.

If you make a “real time” visualizer like that, some real nice advice from Morgan McGuire (Computer graphics professor and the author of http://graphicscodex.com/) is to make a feature where if you click on a pixel, it re-renders just that pixel, and does so in a single thread so that you can step through the process of rendering that pixel to see what is going wrong.

I personally find a lot of value in visualizing per-pixel values in the image to see how values look across the pixels to be able to spot problems. You can do this by setting the emissive lighting to be based on the value you want to visualize and setting the bounce count to 1, or other similar techniques.

Below are two debug images I made while writing the code for this post to try and understand how some things were going wrong. The first image shows the normals of the surface that the camera rays hit (i put x,y,z of the normal into r,g,b but multiply the values by 0.5 and add 0.5 to make them be 0 to 1, instead of -1 to 1). This image let me see that the normals coming out of my ray vs oriented box test were correct.

The second image shows the number of bounces done per pixel. I divided the bounce count by the maximum number of bounces and used that as a greyscale value for the pixel. This let me see that rays were able to bounce off of oriented boxes. A previous image that I don’t have anymore showed darker sides on the boxes, which meant that the ray bouncing wasn’t working correctly.

Immensely Parallelizable: CPU, GPU, Grid Computing

At an image size of 1024×1024, that is a little over 1 million pixels.

At 1000 samples per pixel, that means a little over 1 billion samples.

Every sample of every pixel only needs one thing to do it’s work: Read only access to the scene.

Since each of those samples are able to do their work independently, if you had 1 billion cores, path tracing could use them all!

The example code is multithreaded and uses as many threads as cores you have on your CPU.

Since GPUs are meant for massively parallelizable work, they can path trace much faster than CPUs.

I haven’t done a proper apples to apples test, but evidence indicates something like a 100x speed up on the GPU vs the CPU.

You can also distribute work across a grid of computers!

One way to do path tracing in grid computing would be to have each machine do a 100 sample image, and then you could average all of those 100 sample images together to get a much higher sample image.

The downside to that is that network bandwidth and disk storage pays the cost of the full size image for each computer you have helping.

A better way to do it would probably be to make each machine do all of the samples for a small portion of the image and then you can put the tiles together at the end.

While decreasing network bandwidth and disk space usage, this also allows you to do all of the pixel averaging in floating point, as opposed to doing it in 8 bit 0-255 values like you’d get from a bmp file made on another machine.

## Closing

In this post and the sample code that goes with it, we are only dealing with a purely diffuse (lambertian) surface, and emissive lighting. In future posts we’ll cover a lot more features like specular reflection (mirror type reflection), refraction (transparent objects), caustics, textures, bump mapping and more. We’ll also look at how to make better looking images with fewer samples and lots of other things too.

I actually have to confess that I did a bit of bait and switch. The images at the top of this post were rendered with an anti aliasing technique, as well as an importance sampling technique (cosine weighted hemisphere samples) to make the image look better faster. These things are going to be the topics of my next two path tracing posts, so be on the lookout! Here are the same scenes with the same number of samples, but with no AA, and uniform hemisphere sampling:

And the ones at the top for comparison:

While making this post I received a lot of good information from the twitter graphics community (see the people I’m following and follow them too! @Atrix256) as well as the Computer Graphics Stack Exchange.

Also, two specific individuals helped me out quite a bit:

@lh0xfb – She was also doing a lot of path tracing at the time and helped me understand where some of my stuff was going wrong, including replicating my scenes in her renderer to be able to compare and contrast with. I was sure my renderer was wrong, because it was so noisy! It turned out that while i tended to have small light sources and high contrast scenes, Lauren did a better job of having well lit scenes that converged more quickly.

@Reedbeta – He was a wealth of information for helping me understand some details about why things worked the way they did and answered a handful of graphics stack exchange questions I posted.

Thanks a bunch to you both, and to everyone else that participated in the discussions (:

# Incremental Averaging

This is a super short post about something I want to be able to reference again in the future (:

Let’s say that you need to average N things, where you don’t know what the final N will be, but you want to keep an average as you go.

For instance, let’s say you are doing monte carlo path tracing rendering, where the pixel you are showing is the average of however many samples you’ve had so far, but you are continuing to get new samples and want to show the updated average as you get new samples.

The formula for doing this is super simple.

NewAverage = OldAverage + (NewValue - OldAverage) / NewSampleCount;

// Or:

Average += (NewValue - Average) / NewSampleCount;


One way of thinking of the above equations is that you are adjusting the average by how much the new value would adjust the average.

Another way of thinking of the above two equations is this:

First figure out how far the new sample is from the average, then move towards that new amount by an ever decreasing amount, as the number of samples grow.

Because of this, if you are in a language such as glsl or hlsl which has linear interpolation built in (mix in glsl, lerp in hlsl), you can use linear interpolation as well:

Average = mix(Average, NewValue, 1.0 / NewSampleCount);


To see how this works out, check out the formula for lerp:

$Lerp(a,b,t) = a + (b-a)*t$

Substituting Average for a, NewValue for b, and 1/NewSampleCount for t, we get this:

$Lerp(Average, NewValue, 1/NewSampleCount)$
$= Average + (NewValue-Average)/NewSampleCount$

Which is the exact same formula as above. So, using lerp in that way is mathematically equivalent.

Here’s a link with more discussion on this:
Math Stack Exchange: Incremental averageing

Here’s an awesome post on this same subject, with a different resulting formula, by Christer Ericson (@ChristerEricson), author of the famous “Real Time Collision Detection” book, and VP of technology at Activision. His solution looks like it might be more numerically robust, but I haven’t analyzed it well enough yet to know for sure:
Robustly computing the centroid for a point set

A somewhat related topic, here’s a method for keeping a sum with floating point numbers, that is more accurate than normal summation. Useful when you need to sum numbers of different magnitudes, which would normally be a problem in floating point.
Wikipedia: Kahan Summation

Here’s an incremental (aka online) algorithm that gives you the average as well as the variance, which you can square root to get the standard deviation. This is useful for when you need to know how volatile data samples are – like if reporting results from profiling code, or maybe useful for path tracing for finding how much variance is in a pixel, so you know whether you need more samples or can get away with fewer.
Wikipedia: Standard deviation – Online algorithm