## G-Buffer Upsizing

The other day I had a thought:

Rendering smaller than full screen images is super helpful for performance, but upsizing an image results in pretty bad quality vs a full resolution render.

What if instead of upsizing the final rendered image, we upsized the values that were used to shade each pixel?

In other words, what if we rendered a scene from a less than full resolution g-buffer?

I was thinking that could be useful in doing ray based graphics, not having to trace or march quite so many rays, but also perhaps it could be useful for things like reflections where a user isn’t likely to notice a drop in resolution.

I’m sure I’m not the first to think of this, but I figured I’d explore it anyways and see what I could see.

I made an interactive shadertoy demo to explore this if you want to see it first hand:

## Result

In short, it does look better in a lot of ways because the normals, uv coordinates and similar parameters interpolate really well, but the edges of shapes are aliased really bad (jaggies).

Check out the images below to see what i mean. The first image is a full sized render. The second image is a 1/4 sized render (half x and half y resolution). The third image is a 1/16th sized render (quarter x and quarter y resolution)

For comparison, here’s a 1/4 sized and 1/16 sized render upsized using bicubic IMAGE interpolation instead of g-buffer data interpolation:

Despite the aliased results at 1/16th render size, this seems like it may be a reasonable technique at larger render sizes, depending on the level of quality you need. Doing half vertical or half horizontal resolution looks very close to the full sized image for instance. The edges are a tiny bit more aliased along one direction, but otherwise things seem decent:

Since the g-buffer has only limited space, you will probably want to bit pack multiple fields together in the same color channels. When you do that, you throw out the possibility of doing hardware interpolation unfortunately, because it interpolates the resulting bit packed value, not the individual fields that you packed in.

Even when doing the interpolation yourself in the pixel shader, for the most part you can really only store information that interpolates well. For instance, you could store a diffuse R,G,B color, but you wouldn’t want to store and then interpolate a material index. This is because you might have material index 10 (say it’s blue) next to material index 0 (say it’s green), and then when you interpolate you could end up with material index 5 which may be red. You’d get red between your blue and green which is very obviously wrong.

In my demo I did have a material index per pixel, but i just used nearest neighbor for that particular value always. To help the appearance of aliasing, I also stored an RGB diffuse color that i interpolated.

I stored the uvs in the g-buffer and sampled the textures themselves in the final shader though, to make sure and get the best texture information I could. This makes textures look great at virtually any resolution and is a lot of the reason why the result looks as good as it does IMO.

The fact that normals interpolate is a good thing, except when it comes to hard edges like the edge of the cube, or at the edge of any object really. In the case of the cube edge, it smooths the edge a little bit, making a surface that catches specular lighting and so highlights itself as being incorrect (!!). In the case of the edge of regular objects, a similar thing happens because it will interpolate between the normal at the edge of the object and the background, making a halo around the object which again catches specular lighting and highlights itself as being incorrect.

I think it could be interesting or fruitful to explore using edge detection to decide when to blend or not, to help the problem with normals, or maybe even just some edge detection based anti aliasing could be nice to make the resulting images better. The depth (z buffer value) could also maybe be used to help decide when to interpolate or not, to help the problem of halos around every object.

Interestingly, bicubic interpolation actually seems to enhance the problem areas compared to bilinear. It actually seems to highlight areas of change, where you would actually want it to sort of not point out the problems hehe. I think this is due to Runge’s phenomenon. Check out the depth information below to see what i mean. The first is bilinear, the second is bicubic:

One final side benefit of this I wanted to mention, is that if you are doing ray based rendering, where finding the geometry information per pixel can be time consuming, you could actually create your g-buffer once and re-shade it with different animated texture or lighting parameters, to give you a constant time (and very quick) render of any scene of any complexity, so long as the camera wasn’t moving, and there were no geometry changes happening. This is kind of along the same lines as the very first post I made to this blog about 4 years ago, which caches geometry in screen space tiles, allowing dirty rectangles to be used (MoriRT: Pixel and Geometry Caching to Aid Real Time Raytracing).

Anyone else go down this path and have some advice, or have any ideas on other things not mentioned? (:

Next up I think I want to look at temporal interpolation of g-buffers, to see what sort of characteristics that might have. (Quick update, the naive implementation of that is basically useless as far as i can tell: G-Buffer Temporal Interpolation).

## Related Stuff

On shadertoy, casty mentioned that if you have some full res information, and some less than full res information, you can actually do something called “Joint Bilateral Upsampling” to get a better result.

Joint Bilateral Upsampling

It turns out someone has already solved this challenge with great success. They use “the MSAA trick” to get more samples at the edges. Check out ~page 38:
GPU-Driven Rendering Pipelines

## Normalized Vector Interpolation TL;DR

My blog posts often serve as “external memory”, allowing me to go back and remember how specific things work months or years after I spent the time to learn about them.

So far it’s worked amazingly well! Instead of having a hazy memory of “oh um… i did bicubic interpolation once, how does that work again?” I can go back to my blog post, find the details with an explanation and simple working code, and can very rapidly come back up to speed. I seriously recommend keeping a blog if you are a programmer or similar. Plus, you know you really understand something when you can explain it to someone else, so it helps you learn to a deeper level than you would otherwise.

Anyways, this is going to be a very brief post on vector interpolation that I want to commit to my “external memory” for the future.

This is an answer to the question… “How do I interpolate between two normalized vectors?” or “How do i bilinearly or bicubically interpolate between normalized vectors?”

As an answer I found the three most common ways to do vector interpolation:

• Slerp – short for “spherical interpolation”, this is the most correct way, but is also the costliest. In practice you likely do not need the precision.
• lerp – short for “linear interpolation”, you just do a regular linear interpolation between the vectors and use that as a result.
• nlerp – short for “normalized linear interpolation” you just normalize the result of a lerp. Useful if you need your interpolated vector to be a normalized vector.

In practice, lerp/nlerp are pretty good at getting a pretty close interpolated direction so long as the angle they are interpolating between is sufficiently small (say, 90 degrees), and nlerp is of course good at keeping the right length, if you need a normalized vector. If you want to preserve the length while interpolating between non normalized vectors, you could always interpolate the length and direction separately.

Here is an example of the three interpolations on a large angle. Dark grey = start vector, light grey = end vector. Green = slerp, blue = lerp, orange = nlerp.

Here is an example of a medium sized angle (~90 degrees) interpolating the same time t between the angles.

Lastly, here’s a smaller angle (~35 degrees). You can see that the results of lerp / nlerp are more accurate as the angle between the interpolated vectors gets smaller.

If you do lerp or nlerp, you can definitely do both bilinear as well as bicubic interpolation since they are just regularly interpolated values (and then optionally normalized)

Using slerp, you can do bilinear interpolation, but I’m not sure how bicubic would translate.

## Code

Here’s some glsl code for slerp, lerp and nlerp. This code is for vec2’s specifically but the same code works for vectors of any dimension.

//============================================================
// https://keithmaggio.wordpress.com/2011/02/15/math-magician-lerp-slerp-and-nlerp/
vec2 slerp(vec2 start, vec2 end, float percent)
{
// Dot product - the cosine of the angle between 2 vectors.
float dot = dot(start, end);
// Clamp it to be in the range of Acos()
// This may be unnecessary, but floating point
// precision can be a fickle mistress.
dot = clamp(dot, -1.0, 1.0);
// Acos(dot) returns the angle between start and end,
// And multiplying that by percent returns the angle between
// start and the final result.
float theta = acos(dot)*percent;
vec2 RelativeVec = normalize(end - start*dot); // Orthonormal basis
// The final result.
return ((start*cos(theta)) + (RelativeVec*sin(theta)));
}

vec2 lerp(vec2 start, vec2 end, float percent)
{
return mix(start,end,percent);
}

vec2 nlerp(vec2 start, vec2 end, float percent)
{
return normalize(mix(start,end,percent));
}

An interactive shadertoy demo I made, that is also where the above images came from:

Further discussion on this topic may be present here:
Computer Graphics Stack Exchange: Interpolating vectors on a grid

Math Magician – Lerp, Slerp, and Nlerp
Understanding Slerp, Then Not Using It

## How and Why Cleaning Up Code or Processes Gives Multiplicative Benefits

The engineering manager of my team Paul Haban (@XpresoAdct) mentioned to me once in passing that when you fix a problem, you often get multiplicative returns beyond the initial problem you intended to fix.

This is an idea from Kanban, and while it was in my best interest to believe this to be true, since it allowed to refactor personally painful inherited systems and code, it felt like a sort of mysterious voodoo and I wasn’t really a believer.

I recently experienced it first hand though. I refactored something and the benefits started multiplying. People from distant sub teams came out of the woodwork very excited to hear about my changes. Of course, this happens from time to time, and it’s a lucky break to get benefits beyond what you were planning, but looking at it in hindsight, there are some really good reasons why this happened.

This applies to source code, processes, etc, but for simpler language, we’ll focus on this being about code.

Firstly, engineering is often about trade offs. You might see that solving a problem one way gives you certain benefits, while solving a problem a different way gives you other benefits. You weigh those things, talk to those affected to get their opinions in case you are missing information, and then you make the best decision you can with the information you have.

Sometimes you make a decision based on the current state of things, but then the situation changes, and the choices you make turn out to be bad choices for the new direction that things have taken. Now your code has turned bad.

Also of course, people sometimes people just make bad choices. We are human, we are learning, it’s how it goes. Sometimes people just make bad code to begin with.

A deeper chat of this sort of thing can be found here: No Bad Code, Creeping Normality and Social Structure Code Organization

But ultimately, here is what makes code “bad”: If it works less than ideally for someone or some thing that has to interact with it, it is on the spectrum of “bad code”, ranging from terrible code, to code that could be cleaned up, but doesn’t really matter enough to fix.

It also may be that code is bad for one set of interactions, while it is perfectly ideal for another set of interactions. This is the result of the trade offs weighed when solving the problem. That may just be a fact of life that you either can not really do anything about, or that in practical terms, you cannot do anything about due to cost versus reward analysis or whatever else.

Lastly, like in my case, you may have inherited some bad code from someone else. In this case, it could just be that you have different goals, or that you prefer a different trade off (pain flavor if you will) than the previous maintainer.

## How Does Bad Code Affect Others

By definition, bad code is code that is less than ideal for a person, or code that has to interact with it.

That means that when they interact with the code two things happen:

1. There are work arounds that have to be done to be able to get what is needed from the system.
2. There may be perfectly reasonable things that interactions with the system may want to do that are not possible, or not practically possible with real world constraints.

The more central this bad code is, and the more people that interact with it, the more that there is both workarounds, and desired functionality that can’t be realized.

## Refactoring

If you can legitimately refactor some code such that the result is decided to be better than where things are at now – say, there is less pain overall, or perhaps the pain is more concentrated on a group that nobody likes (hehe) – making that happen will make the code less bad.

Again, bad code is a spectrum, so it’s likely you’ll hit situations where the code will never be perfectly good code, but making it less bad is a good thing.

When you make code less bad, however you measure that, it means that the workarounds that needed to go up can start being taken down (simpler code, less maintenance, fewer things that can go wrong), and also, you open up the doorway for the improved functionality that was not previously practical.

Another way to think of it is that the optimist will say that fixing things gives multiplicative benefits. The cynic (realist?) on the other hand says that the less than ideal code has already incurred both a one time cost to the people that have interfaced with it, as well as a continual maintenance cost that is incurred by it existing, and that those costs are or were avoidable.

To me, this explains in a very down to earth way how the “voodoo” of multiplicative benefits actually comes about. It also shows a bit of how continual minor improvement really does add up (the main idea of Kanban), even when not taking into account things like the compound interest model (ie, saving you effort now allows you to save future effort sooner).

Go clean up some code, or fix a broken process. You will likely be surprised at how much benefit you get out of it!

I miss anything or you have a differing view? Let me know!

I was working on a Dust Game on shadertoy for a few weeks and got a decent amount of the way through it. Unfortunately I hit a snag.

There’s a sad fact that browsers compile WebGL shaders on the main thread. If ever WebGL blocks for too long, browsers consider WebGL hung and reports to the user that WebGL crashed.

The shaders for my game got complex enough that they were hitting the timeout while compiling, even though I wasn’t even close to being finished. So, I simplified a bit and published it unfinished, thinking that perhaps when WebGL 2.0 comes out this summer, I can revisit it.

Unfortunately, but totally understandably, different people have different machines in the wild, and some people have computers that take longer to compile shaders than my computer does, which means that plenty of people were still “crashing”. Because of this, the shadertoy people moved my shader back to draft/private until I simplified it more since crashes are no good, even if they are not realy real crashes in the normal sense. You can still see it by direct link here if you are interested: Shadertoy: Dust Sim Sandbox (Incomplete)

I’ve spent enough time on this thing so I’m going to leave it private for now, but I really think this shows a fundamental problem with WebGL.

Since the shaders compile on the main thread (apparently in many browsers if not all!), and detect a crash/hang by timing out, it means that when you write WebGL code, you really have to either keep things super simple in your shaders, or you have to be ok with having some amount of users experiencing “crashes” when the shader compiles time out on their machine.

In other words, you have to choose between innovation and stability.

That doesn’t really bode well for a platform like WebGL that basically is all about innovation! I’m sure this same problem plagues WebVR, which means that if you want to make a WebVR experience that works well on all machines, you better not do anything very complex 😛

Despite this failure, I did succeed in making a minesweeper game though, check it out here: Shadertoy: Minesweeper

Anyways, onto the next next shadertoy! I think I’m going to try to make one of those side view physics based car/motorcycle/bicycle hill climbing games.

## Update!

It turns out my understanding of the shader compiling process was a bit simplified. Read the link below for the full scoop, but in the end, the problem remains, because there doesn’t seem to be any way to query the status of the async work, so you seem to have to guess and hope you have enough work to do. In the case of shadertoy, there is no geometry to load, or things of that nature, so that makes it especially difficult to deal with.

Thanks to @tuan_kuranes for the better information (:

## Making a Ray Traced Snake Game in Shadertoy

Shadertoy.com recently added multipass rendering, and one of the founders – iq – introduced the feature to us by (ab)using an off screen texture to store state from previous frames. This was interesting because up until that point, you had to render each frame using only the current time, and current input from the user – you had no knowledge whatsoever of the past. He ended up making a brick breaking game, which is pretty awesome. You can see it here: Shadertoy: Bricks Game.

I decided to follow that path too and make a snake game in shadertoy. This is a high level overview of how the game works and various considerations that came up while creating the game.

## Simulation vs. Presentation

When you start making a shadertoy shader, you have a single pixel shader program to work in. In the past this is where you did all your work to display the image to the user.

Now with multipass rendering you have the option of creating an off screen buffer (up to four of them actually, but we only need one for this discussion), which you write a pixel shader program for as well. This program is able to write to the off screen buffer by emitting pixels (like usual), but interestingly, it can also read from the same off screen buffer. Since it can both read and write the image, this allows it to preserve state across frames.

Since this pixel shader has state (unlike the main on screen pixel shader), this is where all your “simulation code” has to go. This is where you do all your game logic. Strangely, even though this buffer is a full screen buffer, you only need to process as many pixels as you need for storage of your game state. You can use the “discard” keyword for all other pixels, to get a quick “early out” in the pixel processing.

The on screen pixel shader is able to read this off screen buffer which stores game state, but it isn’t able to write to it. Because of this, the on screen pixel shader is essentially the visualization of the simulation.

## Simulation Considerations

The simulation had quite a few considerations. Some of these were a little mind bending at first, but made sense after thinking about it for a while.

Perhaps the single strangest thing about the simulation pixel shader program is this: The pixel shader program can only write the value of the pixel that it represents. Let’s talk some details about what that means.

In the snake game, there is a grid of 16×16 pixels, where each pixel stores the state of a game cell. Besides this, there are four other pixels that store various pieces of game state – like what direction the snake is moving in, whether the snake is dead or not, and the location of the apple.

One complication of this is that to spawn an apple, we have to set the apple’s location variable, but we also have to mark that an apple is in the specific grid cell where it should be. What we essentially have to do is pick a random location for the apple to go to, and then we essentially have to say “if the pixel that this shader program is running for, is the specific grid cell where the apple should spawn at, then emit a pixel color that represents the necessary data to represent an apple”.

That is kind of weird. Even weirder perhaps, you have to remember that this pixel shader program is being run for several pixels per frame (once per variable you have in your state!), and they all have to be in agreement about the random position of the apple so that only one apple gets written to the board, so your random numbers have to be deterministic, yet still seemingly random to the player.

It’s odd to think that the entire game simulation is running per variable in your simulation, but that is what is happening.

There’s a trap that’s easy to fall in here too, where you might be tempted to write code like this: “If the pixel that we are running the shader program for is the new location that the snake wants to move to, and that grid cell contains a snake body part, then set the game over game state variable flag to true”

The problem there is that our if statement guarantees that we are running the pixel shader program for a grid cell, not for the game over game state variable, so when we change the game over state variable, it doesn’t get written out, since this pixel shader program instance isn’t running for that variable.

I actually wrote that exact code and it took me a little while to figure out why the game wouldn’t end when the snake crashed into itself, but it would end if it went out of bounds. The code looked very similar, and was very simple, and one worked, but the other didn’t.

The answer here is that instead of only running the logic for the grid cell that the snake head wants to enter, you have to make the logic run for all instances. So, unfortunately that means a texture read to read the grid cell that the snake head wants to enter into. That way, all instances have the information they need to run the same logic.

Technically, I guess it’s only the “game over” variable that needs to do the texture read, but it’s a lot easier and safer if you make all pixels run the same code deterministically. Also, having each pixel run the same code is likely a performance gain.

You might be able to squeeze some more performance out of being clever, but it’s real easy to bite you in the end too! In this case, the snake game runs well enough (60fps on my machine, even with the camera edge on which is the worst case) so I’m calling it good.

Numerical Problems

One gotcha to be wary of is that since you are storing game state variables in pixels, is that every variable is a vec4 of floats, and that they can only store floating point values between 0 and 1. So, if you are storing the position of a cell in a grid, you need to “normalize it” to be between 0 and 1.

Sadly, the floating point isn’t really guaranteed to be a full 32 bit floating point number either, and might be an 8 bit fixed point number or similar on some platforms. This makes it real easy to have numerical problems when you normalize your values into the 0 to 1 range.

Random Apple Spawning

When an apple is eaten (or the game is starting) an apple needs to spawn in a random location that isn’t yet taken by the snake’s body. This would mean generating random numbers in a loop until we found an empty spot. This could possibly be a perf problem for the pixel shader, possibly doing a large number of iterations, and possibly not finding a valid location by the time it ran out of it’s fixed size for loop indices.

I went with a different solution. When there is no apple spawned, the simulation generates a random position to put the apple at, and if it’s empty, it puts the apple there. If the position already was taken by a snake body part, it leaves the apple unspawned and tries again next frame. With a suitably high frame rate (it runs at 60fps for me!) it ought to be able to find an empty spot pretty quickly, even if you have a really large snake.

Frame Rate Independent Gameplay

When making games on any platform, it’s very easy to have problems with objects moving faster on machines that have higher FPS. The typical way to solve this is to keep track of how much time has elapsed between frames, keeping a total as the frames progress, and when the total gets above a certain point, you do a game update and reset the total back to zero (or the remainder after subtracting your “tick time” out of the total).

This was luckily very easy to do in shadertoy as well. There is a value you are provided called “iTimeDelta” which is how long the last frame took to render, in seconds. I divide this number by how many ticks i want per second so that the total will always be between 0 and 1, and then add it into a state variable (which is stored in a pixel). When doing the add, if the result is ever greater than 1, i do a tick, and reset it to 0.

Doing the tick just means moving the snake and handling whether the snake ate an apple, died, etc.

Low Latency Input

At first I had my input handling be handled inside of the tick. This was a problem because the tick happens 12 times a second, which means if you want the snake to change directions, you better have the key pressed down during one of those ticks. If you press it and then release it between ticks, the key press is lost and the snake doesn’t change direction and you die. This was extremely noticeable and made for really bad controls.

To solve this, I moved the key press handling to be OUTSIDE the tick, and had it be processed every frame. Instead of changing the snake’s direction right away though, I queued it up to be handled on the next tick. This way, you can quickly tap keys and they are registered as they should be on the next tick. The controls feel fine now and all is well.

## Visualization & Rendering

The rendering of the snake game had fewer considerations than the simulation, but there are a couple things worth mentioning.

There are just a few parts to the rendering:

1. If the ray hits the game board, make it look like the game board
2. If the ray enters the small box of “play area” above the game board, it raytraces the contents of the game board
3. If the ray misses everything, it does a lookup in a cube map texture to get the background
4. If the ray hits anything, it uses some variables (normal, diffuse color, shinyness) from the collision to do the shading. It also does a lookup in the cube map texture to do some environment mapping to give the impression that reflection is happening, even though it isn’t

The second bullet item is the most interesting. What the code does is find the grid cells that the ray enters and leaves the play area, and then it walks from the start to the end, checking each cell to see if it has anything in it, and if so, doing a raytrace test against it.

At first I used the Bresenham line algorithm to walk the grid cells, but when there were holes in some of my spheres in strange places, I realized my problem. Bresenham isn’t meant to draw to (visit) every grid cell that a line passes through – it is only meant to draw to the most important ones!

So, I replaced Bresenham with a grid traversal algorithm and all was well.

The game is rendered using real time raytracing, but it only does PRIMARY rays. It doesn’t do reflection, refraction or shadows. This is because doing any of those things means raytracing against the game board (grid cells) more than once. Since we have to do a texture lookup per grid cell we want to raytrace, this could equal a lot of texture lookups as you can imagine. There is a hard limit to the number you can do (I’m pretty sure, although I don’t know what it is, and it probably varies from machine to machine), so I wanted to try and not push things too hard, and just stayed with primary rays only.

## That’s All!

That’s basically it. Multipass rendering is an awesome feature in shadertoy. On one hand it makes me a bit sad getting it, because people don’t have to work within such tight constraints when doing the amazing things they do in shadertoy. On the other hand though, I think multipass rendering really ought to empower people to do some much more amazing things than in previous days – including but not limited to games.

I really look forward to learning from what other people are doing to learn more about how to leverage multipass rendering in interesting ways.

I also look forward to contributing my own game and non game multipass shaders. I have some ideas for both, keep an eye out (:

## Simplifying Boolean Expressions With Karnaugh Maps

Karnaugh maps are a formalized way of turning a truth table into a fairly minimal logical expression.

It’s fairly minimal in that it’s the minimal “sum of products” representation, but that might not be the minimal representation of the logic circuit.

The sum of products representation just means that you OR (sum) multiple terms that are ANDed (product) together. For instance the below is a sum of products expression:

$Out = \overline{A}B + AB$

With multiplication being AND, addition being OR, and a line over a letter being a NOT, the above could be written in C++ code like this:

bool out = (!A && B) || (A && B);

It would be real easy to write code like that especially if you were adding onto an existing condition, and you might not even notice that it isn’t the minimal Boolean expression to make the desired output.

Using Boolean algebra, you can do the following simplifications:
$Out = \overline{A}B + AB\\ = B*(\overline{A}+A)\\ = B*1\\ = B$

Which simplifies the C++ code to just this:

bool out = B;

Using Boolean algebra to simplify, you’d have to remember (or derive) the identity that $A+\overline{A}=1$, and all the other identities to help you simplify equations.

Karnaugh maps make this easier because you will be able to see visually what can be combined (simplified) and what can’t.

Again though, while they give you the smallest possible sum of products representation of the logic circuit, that may not be the smallest possible representation of the circuit.

Let’s get to it!

## Two Variable Karnaugh Map: Basics

Going with the example above, it takes two Boolean variables as input (A and B), and gives one Boolean variable as output. Having two input variables means we need a two variable Karnaugh map.

The first step to building the Karnaugh map is having a truth table for the input to output mappings. For our example we’ll use this truth table. This is one of many truth tables that satisfies our equation, so we are working backwards a bit, but hopefully it still makes sense. Usually you would start with the truth table and get a Boolean equation, not the other way around.
$\begin{array}{c|c|c} A & B & Out\\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ \end{array}$

The next thing we do is make our karnaugh map by making a square 2×2 grid where one side of the square is the possible A values, the other side is the possible B values, and the contents of the grid cells are the values we want the formula to come out to be:

$\begin{array}{ccc} & \overline{B} & B \\ \overline{A} & 0 & 1 \\ A & 0 & 1 \\ \end{array}$

Next, what you do is circle all 1’s that are in groups of a power of two (one item or two items). Doing that, we get this:

$\begin{array}{ccc} & \overline{B} & B \\ \cline{3-3} \overline{A} & 0 & \multicolumn{1}{|c|}{1} \\ A & 0 & \multicolumn{1}{|c|}{1} \\ \cline{3-3} \end{array}$

Looking at what we circled, we can see that both values of $A$ are involved in our group, so $A$ doesn’t matter to the output. We can also see that $\overline{B}$ is not involved in any places where there is a 1, so we can ignore that too. All that leaves is $B$, which is our final, and most minimal answer.

That agrees with the Boolean algebra solution, but came without having to remember any identities. Hooray!

## Two Variable Karnaugh Map: Overlaps

If there were multiple groups, you would combine each group with OR to get the final answer. Groups are also allowed to overlap! For instance, let’s look at this truth table to start out:

$\begin{array}{c|c|c} A & B & Out\\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \\ \end{array}$

Turning that into a Karnaugh map, we get this:

$\begin{array}{ccc} & \overline{B} & B \\ \overline{A} & 0 & 1 \\ A & 1 & 1 \\ \end{array}$

Next when it’s time to circle our groups, we have two groups, and they overlap! Here is the first group, which is the same as before:

$\begin{array}{ccc} & \overline{B} & B \\ \cline{3-3} \overline{A} & 0 & \multicolumn{1}{|c|}{1} \\ A & 1 & \multicolumn{1}{|c|}{1} \\ \cline{3-3} \end{array}$

That group can be expressed as just B.

The other group is this:

$\begin{array}{ccc} & \overline{B} & B \\ \overline{A} & 0 & 1 \\ \cline{2-3} A & \multicolumn{1}{|c}{1} & \multicolumn{1}{c|}{1} \\ \cline{2-3} \end{array}$

That group can be expressed as just A.

Lastly, we OR our groups together, aka we sum our products, and we get $A+B$ as an answer, which in other words is just A OR B. Check out the truth table and you can see that it is indeed a truth table for OR!

## Two Variable Karnaugh Map: Single Sized Groups

What if we don’t have groups of two though, what if we only have groups of one? Let’s explore that real quick with the following truth table:

$\begin{array}{c|c|c} A & B & Out\\ \hline 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ \end{array}$

That becomes this Karnaugh map:

$\begin{array}{ccc} & \overline{B} & B \\ \overline{A} & 1 & 0 \\ A & 0 & 1 \\ \end{array}$

We once again have two groups, but they are each of size one, which is totally ok!

The upper left group is expressed as $\overline{AB}$, while the lower right group is expressed as $AB$. We OR those two groups together to get the answer: $\overline{AB} + AB$. That’s all there is to it.

## Four Variable Karnaugh Map: Don’t Care Values

Let’s get a little more sophisticated and see how we would handle four input variables (We could go to three variables next but after learning two and four it’ll be easy to see how to do three). We will start with a truth table, but our truth table will only contain the input values we care about. We’ll omit the ones we don’t care about.

$\begin{array}{c|c} ABCD & Out\\ \hline 0001 & 0 \\ 0011 & 1 \\ 0111 & 0 \\ 1111 & 1 \\ \end{array}$

We’ll put 1’s and 0’s into the Karnaugh map to match our truth table, but put x’s where the output wasn’t listed. These are “don’t care” values, where they could either be 0’s or 1’s, depending on whichever is more convinient for us when simplifying. We are also going to change how we label the map a bit.

$\begin{array}{cccccc} & & \multicolumn{4}{c}{CD} \\ & & 00 & 01 & 11 & 10 \\ & 00 & x & 0 & 1 & x \\ AB & 01 & x & x & 0 & x \\ & 11 & x & x & 1 & x \\ & 10 & x & x & x & x \\ \end{array}$

In this case, even with the wild card don’t care values, we still just have two 1 item groups, that we OR together to get the answer:

$\overline{AB}CD+ABCD$

Note that we could factor out the CD and make it into the below, but then it would no longer be in a sum of products form.

$CD(\overline{AB}+AB)$

You might also have noticed the strange ordering of the values in the table: 00, 01, 11, 10. Normally doesn’t 2 (10) come before 3 (11)? It does, except in this case, we only want to change one variable at a time between neighboring cells. Going from 01 to 11 means that the first bit changed, while going from 01 to 10 means that two bits changed, so isn’t useful for us finding groups. The order that the numbers are in is actually called Gray Code, named after Frank Grey (Wikipedia: Gray Code).

## Four Variable Karnaugh Map: Larger Groups

When dealing with karnaugh maps, like I said before, the groups have to be a size of a power of two, but interestingly it can be a power of two on each axis. So valid groups include 4×1, 2×1, 2×2, 4×2 and others. Let’s take a look at one where we encounter a 2×2 group.

$\begin{array}{c|c} ABCD & Out\\ \hline 0000 & 0 \\ 0001 & 0 \\ 0010 & 0 \\ 0011 & 0 \\ 0100 & 0 \\ 0101 & 1 \\ 0110 & 0 \\ 0111 & 1 \\ 1000 & 0 \\ 1001 & 0 \\ 1010 & 0 \\ 1011 & 1 \\ 1100 & 0 \\ 1101 & 1 \\ 1110 & 0 \\ 1111 & 1 \\ \end{array}$

That gives us the Karnaugh map:

$\begin{array}{cccccc} & & \multicolumn{4}{c}{CD} \\ & & 00 & 01 & 11 & 10 \\ & 00 & 0 & 0 & 0 & 0 \\ AB & 01 & 0 & 1 & 1 & 0 \\ & 11 & 0 & 1 & 1 & 0 \\ & 10 & 0 & 0 & 1 & 0 \\ \end{array}$

There are two groups there. The first is the 2×2 group below and is the intersection of where B is 1, and D is 1, so can be represented as BD.

$\begin{array}{cccccc} & & \multicolumn{4}{c}{CD} \\ & & 00 & 01 & 11 & 10 \\ & 00 & 0 & 0 & 0 & 0 \\ \cline{4-5} AB & 01 & 0 & \multicolumn{1}{|c}{1} & \multicolumn{1}{c|}{1} & 0 \\ & 11 & 0 & \multicolumn{1}{|c}{1} & \multicolumn{1}{c|}{1} & 0 \\ \cline{4-5} & 10 & 0 & 0 & 1 & 0 \\ \end{array}$

The second group is a 1×2 group that overlaps the first, and is where A,C and D are 1, but B can be either 0 or 1. That makes it able to be represented as ACD.

$\begin{array}{cccccc} & & \multicolumn{4}{c}{CD} \\ & & 00 & 01 & 11 & 10 \\ & 00 & 0 & 0 & 0 & 0 \\ AB & 01 & 0 & 1 & 1 & 0 \\ \cline{5-5} & 11 & 0 & 1 & \multicolumn{1}{|c|}{1} & 0 \\ & 10 & 0 & 0 & \multicolumn{1}{|c|}{1} & 0 \\ \cline{5-5} \end{array}$

We combine those groups with OR to get the answer:

$BD+ACD$

## Four Variable Karnaugh Map: Wrap Around

Interestingly, you can make groups by wrapping around the edges of the Karnaugh map, either horizontally or vertically. Let’s start with a truth table:

$\begin{array}{c|c} ABCD & Out\\ \hline 0000 & 0 \\ 0001 & 0 \\ 0010 & 0 \\ 0011 & 1 \\ 0100 & 0 \\ 0101 & 0 \\ 0110 & 0 \\ 0111 & 0 \\ 1000 & 0 \\ 1001 & 0 \\ 1010 & 0 \\ 1011 & 1 \\ 1100 & 0 \\ 1101 & 0 \\ 1110 & 0 \\ 1111 & 0 \\ \end{array}$

That gives us the Karnaugh map:

$\begin{array}{cccccc} & & \multicolumn{4}{c}{CD} \\ & & 00 & 01 & 11 & 10 \\ & 00 & 0 & 0 & 1 & 0 \\ AB & 01 & 0 & 0 & 0 & 0 \\ & 11 & 0 & 0 & 0 & 0 \\ & 10 & 0 & 0 & 1 & 0 \\ \end{array}$

Here is the group highlighted below, which is represented as $\overline{B}CD$, which is also the answer:

$\begin{array}{cccccc} & & \multicolumn{4}{c}{CD} \\ & & 00 & 01 & 11 & 10 \\ & 00 & 0 & 0 & \multicolumn{1}{|c|}{1} & 0 \\ \cline{5-5} AB & 01 & 0 & 0 & 0 & 0 \\ & 11 & 0 & 0 & 0 & 0 \\ \cline{5-5} & 10 & 0 & 0 & \multicolumn{1}{|c|}{1} & 0 \\ \end{array}$

## Two Variable Karnaugh Map: Handling Redundant Info

If you are like me, you might be wondering – If Karnaugh maps can give you the minimal sum of products expression for a truth table, how does it deal with redundant information or solutions that are of equal size, so it’s ambiguous which to choose?

For instance, Let’s go with the truth table table below. All other inputs not listed are “don’t care” values.

$\begin{array}{c|c} AB & Out\\ \hline 00 & 0 \\ 11 & 1 \\ \end{array}$

It’s obvious that the output bit corresponds exactly to both A and B separately. Which one does it choose, or does it make some more complex expression that involves both?

Here is the Karnaugh map:

$\begin{array}{ccc} & \overline{B} & B \\ \overline{A} & 0 & x \\ A & x & 1 \\ \end{array}$

Well, the disambiguation comes up now that you – the pattern finder in the Karnaugh map – chooses between the two possible groups.

One answer, which is perfectly valid is the below, which is just $A$.

$\begin{array}{ccc} & \overline{B} & B \\ \overline{A} & 0 & x \\ \cline{2-3} A & \multicolumn{1}{|c}{x} & \multicolumn{1}{c|}{1} \\ \cline{2-3} \end{array}$

The other answer, which is also perfectly valid, is the below, which is just $B$

$\begin{array}{ccc} & \overline{B} & B \\ \cline{3-3} \overline{A} & 0 & \multicolumn{1}{|c|}{x} \\ A & x & \multicolumn{1}{|c|}{1} \\ \cline{3-3} \end{array}$

So, the disambiguation / simplification is left up to the user to choose, but yes, it still comes up with a minimal sum of products answer, and doesn’t try to incorporate both bits into a more complex logic operation.

## Other Notes

The act of turning a truth table into a logical expression is called logical synthesis, if you want to read more along those lines.

You might be wondering if because there is a sum of products form, if there is also a product of sums form? There is in fact, and you can get that form from Karnaugh maps as well. It may result in a more optimal logical expression. More info on that here: Is Karnaugh Map possible for Maxterms?.

You might be tempted to bring a higher number of variables into the mix. Be warned… adding a 5th variable makes the Karnaugh map into a 4x4x2 3d shape. Adding a 6th variable makes it into a 4x4x4 cube. Adding a 7th variable makes it into a 4x4x4x2 hypercube, and the pattern continues.

For higher numbers of inputs, people will often use a different algorithm instead, that I hope to write a post on before too long. You can read about it here: Wikipedia: Quine–McCluskey algorithm

Lastly, you might be wondering, what do i do if i have M input bits, and N output bits? How can I make a circuit or a set of instructions to generate a minimal logical expression to encompass that?

Well, one simple way is to handle each bit separately and have N Karnaugh maps each having M variables. A problem there though is that computers do operations on multiple bits at the same time with most operations, so having each bit do it’s calculations without considering sharing instructions with another bit leaves some efficiency on the table.

I’m not sure of any better algorithms currently, but I’ve asked on stack exchange so there may be some more info there by the time you read this:
Algorithms for logical synthesis of multiple output bits?

## What Happens When you Mix Hash Tables and Binary Searching?

While not the most cache friendly operation, binary searching a sorted data set is a pretty good way to search a data set for a specific value because for N items, you have an average case and worst case of O(log N) (Wikipedia: Binary Search Algorithm).

Hashing is also a decent way to do data searching, because it has a best case for search of O(1) with the average case being able to be near that when tuned well, but in practice due to collisions it can get get up to O(n) (Wikipedia: Hash Table).

Note that if you know the keys in advance, you can ALWAYS get O(1) lookup by using the information from the last post: Minimal Perfect Hashing.

One way to deal with hash collisions is to store all the collisions for a specific hash value in a list or array.

For instance, if you had a 4 bit hash, you could have 16 arrays, where the [0]th array stored all the items that hashed to 0, the [1]th array stored all items that hashed to 1 and so on, going up to the [15]th array which stored all items that hashed to 15.

What would happen if we stored the arrays in sorted order and then did a binary search within the buckets? What would that mean to search times?

Interestingly, assuming a good hash function, the answer is that every bit of hash you use subtracts one from the number of tests you’d have to do with binary searching (See footnote 1).

## Examples

For instance, let’s say you had 1024 items sorted in an array, you would have to do up to 10 tests to search for a value in that array using a binary search since log2(1024)=10 (See footnote 2).

If we use a 1 bit hash, assuming a good hash function, that would split the array into two arrays each having 512 items in it. Those each can take up to 9 tests to search using a binary search since log2(512)=9. Doing the hash to choose which of the two lists to search cuts our search from up to 10 tests, down to 9 tests.

If instead we used a 2 bit hash, we would divide the 1024 item list into four lists each having 256 items in them. Each of these lists can be searched with up to 8 tests since log2(256) = 8. So using a 2 bit hash, we can cut our search down to 8 tests max.

Let’s say that we used an 8 bit hash. That would cut our list up into 256 lists, each of which only has 4 items in it. Each list can be searched with up to 2 tests since log2(4) = 2. Using an 8 bit hash cuts our max number of searches from 10 down to 2!

Let’s say we used a 16 bit hash, what would happen then?

Well, that would mean we had 65536 hash buckets, but only 1024 items to store in the buckets. If we had a best case collision free hash, that means only 1024 buckets had an item in them, and the rest were empty.

You’d have to hash the value you were searching for to get the hash bucket index, and if there was an item there, you’d have to test that item. If there was no item there, you could return that the value wasn’t found.

The hash isn’t free though, so this is basically O(1) or doing 1 test.

So, while each bit of hash subtracts one test from the binary search, it of course can’t go negative, or become free, so it basically has a minimum of 1.

## Footnotes

1. Technically it only approximately subtracts one from the number of tests you’d have to do, even if the hash function divides the list as evenly as possible, due to footnote 2 and not being able to split an odd number of items exactly in half.
2. technically 1024 items in an array could take up to 11 tests! You can see why with 4 items with indices 0,1,2,3. First you’d test index 2. If the value we were looking for was greater, we’d test 3 and then be done with either a found or not found result. That’s just 2 searches total. But, if the value we were looking for was less than index 2, we’d have to test index 1, and then optionally test index 0 depending on how the search value compared to index 1. With only 3 items, indicies 0,1,2, we test index 1, then either index 0 or 2 and be done, and so only have to test twice. A binary search takes log2(N+1) tests, where N is the number of items you are looking through.

## Quick Blurb

The last post i wrote on minimal perfect hashing was insanely popular (by my standards) on both reddit and hacker news. The previous highest number of visitors I had ever had to my blog in one day was about 230, but on the 16th I got 4187 visitors, and on the 17th I got 7277 visitors. That means there were over 10,000 visitors over those two days.

That is nuts!!

As a data point for those who might find it interesting, the bulk of the traffic from the first day was reddit, and the bulk of the traffic from the second day was hacker news.

I also found it interesting to see what all those people looked at after the minimal perfect hashing algorithm.

I’ve learned as the years have gone on so some of my older stuff probably contains more errors than my newer stuff (which is also not error free I’m sure).

Anyways, thanks for reading, and I hope you guys find my writings interesting and useful. Please feel free to comment and correct any misinformation, or let us know about better alternatives (:

## O(1) Data Lookups With Minimal Perfect Hashing

Hash tables are great in that you can hash a key and then use that hash as an index into an array to get the information associated with that key. That is very fast, so long as you use a fast hash function.

The story doesn’t end there though because hash functions can have collisions – multiple things can hash to the same value even though they are different. This means that in practice you have to do things like store a list of objects per hash value, and when doing a lookup, you need to do a slower full comparison against each item in the bucket to look for a match, instead of being able to only rely on the hash value comparisons.

What if our hash function didn’t have collisions though? What if we could take in N items, hash them, and get 0 to N-1 as output, with no collisions? This post will talk about how to make that very thing happen, with simple sample C++ code as well, believe it or not!

## Minimal Perfect Hashing

Perfect hashing is a hash function which has no collisions. You can hash N items, and you get out N different hash values with no collisions. The number of items being hashed has to be smaller than or equal to the possible values your hash can give as output though. For instance if you have a 4 bit hash which gives 16 possible different hash values, and you hash 17 different items, you are going to have at least one collision, it’s unavoidable. But, if you hash 16 items or fewer, it’s possible in that situation that you could have a hash which had no collisions for the items you hashed.

Minimal perfect hashing takes this a step further. Not only are there no collisions, but when you hash N items, you get 0 to N-1 as output.

You might imagine that this is possible – because you could craft limitless numbers of hashing algorithms, and could pass any different salt value to it to change the output values it gives for inputs – but finding the specific hashing algorithm, and the specific salt value(s) to use sounds like a super hard brute force operation.

The method we are going to talk about today is indeed brute force, but it cleverly breaks the problem apart into smaller, easier to solve problems, which is pretty awesome. It’s actually a pretty simple algorithm too.

Here is how you create a minimal perfect hash table:

1. Hash the items into buckets – there will be collisions at this point.
2. Sort the buckets from most items to fewest items.
3. Find a salt value for each bucket such that when all items in that bucket are hashed, they claim only unclaimed slots. Store this array of salt values for later. it will be needed when doing a data lookup.
4. If a bucket only has one item, instead of searching for a salt value, you can just find an unclaimed index and store -(index+1) into the salt array.

Once you have your minimal perfect hash calculated, here is how you do a lookup:

1. Hash the key, and use that hash to find what salt to use.
2. If the salt is positive (or zero), hash the key again using the salt. The result of that hash will be an index into the data table.
3. If the salt is negative, take the absolute value and subtract one to get the index in the data table.
4. Since it’s possible the key being looked for isn’t in the table, compare the key with the key stored at that index in the table. If they match, return the data at that index. If they don’t match, the key was not in the table.

Pretty simple isn’t it?

## Algorithm Characteristics

This perfect minimal hash algorithm is set up to be slow to generate, but fast to query. This makes it great for situations where the keys are static and unchanging, but you want fast lookup times – like for instance loading a data file for use in a game.

Interestingly though, while you can’t add new keys, you could make it so you could delete keys. You would just remove the key / value pair from the data set, and then when doing a lookup you’d find an empty slot.

Also, there is nothing about this algorithm that says you can’t modify the data associated with the keys, at runtime. Modifying the data associated with a key doesn’t affect where the key/value pair is stored in the table, so you can modify the data all you want.

If you wanted to be able to visit the items in a sorted order, when searching for the perfect minimal hash, you could also make the constraint that when looking for the salt values, that not only did the items in the bucket map to an unclaimed slot, you could make sure they mapped to the correct slot that they should be in to be in sorted order. That would increase the time it took to generate the table, and increase the chances that there was no valid solution for any salt values used, but it is possible if you desire being able to know the items in some sorted order.

Interestingly, the generation time of the minimal perfect hash apparently grows linearly with the number of items it acts on. That makes it scale well. On my own computer for instance, I am able to generate the table for 100,000 items in about 4.5 seconds.

Also, in my implementation, if you have N items, it has the next lower power of two number of salt values. You could decrease the number of salt values used if you wanted to use less memory, but that would again come at the cost of increased time to generate the table, as well as increase the chances that there was no valid solution for any salt values used.

## Example Code

Below is a simple implementation of the algorithm described.

The main point of the code (besides functionality) is readability so it isn’t optimized as well as it could be, but still runs very fast (100,000 items processed in about 4.5 seconds on my machine). Debug is quite a bit slower than release for me though – I gave up on those same 100,000 items after a few minutes running in debug.

The code below uses MurmurHash2, but you could drop in whatever hashing algorithm you wanted.

The data file for this code is words.txt and comes to us courtesy of English Wordlists.

#include <vector>
#include <algorithm>
#include <assert.h>
#include <fstream>
#include <string>

unsigned int MurmurHash2(const void * key, int len, unsigned int seed);

//=================================================================================
template <typename VALUE>
class CPerfectHashTable {
public:

typedef std::pair<std::string, VALUE> TKeyValuePair;
typedef std::vector<TKeyValuePair> TKeyValueVector;
struct SKeyValueVectorBucket {
TKeyValueVector m_vector;
size_t          m_bucketIndex;
};
typedef std::vector<struct SKeyValueVectorBucket> TKeyValueVectorBuckets;

// Create the perfect hash table for the specified data items
void Calculate (const TKeyValueVector& data) {

// ----- Step 1: hash each data item and collect them into buckets.
m_numItems = data.size();
m_numBuckets = NumBucketsForSize(m_numItems);
m_salts.resize(m_numBuckets);
m_data.resize(m_numItems);
TKeyValueVectorBuckets buckets;
buckets.resize(m_numBuckets);

for (size_t i = 0; i < m_numBuckets; ++i)
buckets[i].m_bucketIndex = i;

for (const TKeyValuePair& pair : data) {
size_t bucket = FirstHash(pair.first.c_str());
buckets[bucket].m_vector.push_back(pair);
}

// ----- Step 2: sort buckets from most items to fewest items
std::sort(
buckets.begin(),
buckets.end(),
[](const SKeyValueVectorBucket& a, const SKeyValueVectorBucket& b) {
return a.m_vector.size() > b.m_vector.size();
}
);

// ----- Step 3: find salt values for each bucket such that when all items
// are hashed with their bucket's salt value, that there are no collisions.
// Note that we can stop when we hit a zero sized bucket since they are sorted
// by length descending.
std::vector<bool> slotsClaimed;
slotsClaimed.resize(m_numItems);
for (size_t bucketIndex = 0, bucketCount = buckets.size(); bucketIndex < bucketCount; ++bucketIndex) {
if (buckets[bucketIndex].m_vector.size() == 0)
break;
FindSaltForBucket(buckets[bucketIndex], slotsClaimed);
}
}

// Look up a value by key.  Get a pointer back.  null means not found.
// You can modify data if you want, but you can't add/remove/modify keys without recalculating.
VALUE* GetValue (const char* key) {

// do the first hash lookup and get the salt to use
size_t bucket = FirstHash(key);
int salt = m_salts[bucket];

// if the salt is negative, it's absolute value minus 1 is the index to use.
size_t dataIndex;
if (salt < 0)
dataIndex = (size_t)((salt * -1) - 1);
// else do the second hash lookup to get where the key value pair should be
else
dataIndex = MurmurHash2(key, strlen(key), (unsigned int)salt) % m_data.size();

// if the keys match, we found it, else it doesn't exist in the table
if (m_data[dataIndex].first.compare(key) == 0)
return &m_data[dataIndex].second;
return nullptr;
}

private:

unsigned int FirstHash (const char* key) {
return MurmurHash2(key, strlen(key), 435) & m_bucketMask;
}

void FindSaltForBucket (const SKeyValueVectorBucket& bucket, std::vector<bool>& slotsClaimed) {

// if the bucket size is 1, instead of looking for a salt, just encode the index to use in the salt.
// store it as (index+1)*-1 so that we can use index 0 too.
if (bucket.m_vector.size() == 1) {
for (size_t i = 0, c = slotsClaimed.size(); i < c; ++i)
{
if (!slotsClaimed[i])
{
slotsClaimed[i] = true;
m_salts[bucket.m_bucketIndex] = (i + 1)*-1;
m_data[i] = bucket.m_vector[0];
return;
}
}
// we shouldn't ever get here
assert(false);
}

// find the salt value for the items in this bucket that cause these items to take
// only unclaimed slots
for (int salt = 0; ; ++salt) {
// if salt ever overflows to a negative number, that's a problem.
assert(salt >= 0);
std::vector<size_t> slotsThisBucket;
bool success = std::all_of(
bucket.m_vector.begin(),
bucket.m_vector.end(),
[this, &slotsThisBucket, salt, &slotsClaimed](const TKeyValuePair& keyValuePair) -> bool {
const char* key = keyValuePair.first.c_str();
unsigned int slotWanted = MurmurHash2(key, strlen(key), (unsigned int)salt) % m_numItems;
if (slotsClaimed[slotWanted])
return false;
if (std::find(slotsThisBucket.begin(), slotsThisBucket.end(), slotWanted) != slotsThisBucket.end())
return false;
slotsThisBucket.push_back(slotWanted);
return true;
}
);

// When we find a salt value that fits our constraints, remember the salt
// value and also claim all the buckets.
if (success)
{
m_salts[bucket.m_bucketIndex] = salt;
for (size_t i = 0, c = bucket.m_vector.size(); i < c; ++i)
{
m_data[slotsThisBucket[i]] = bucket.m_vector[i];
slotsClaimed[slotsThisBucket[i]] = true;
}
return;
}
}
}

static size_t NumBucketsForSize (size_t size) {
// returns how many buckets should be used for a specific number of elements.
// Just uses the power of 2 lower than the size, or 1, whichever is bigger.
if (!size)
return 1;

size_t ret = 1;
size = size >> 1;
while (size) {
ret = ret << 1;
size = size >> 1;
}
return ret;
}

// When doing a lookup, a first hash is done to find what salt to use
// for the second hash.  This table stores the salt values for that second
// hash.
std::vector<int> m_salts;

// NOTE: this stores both the key and the value.  This is to handle the
// situation where a key is searched for which doesn't exist in the table.
// That key will still hash to some valid index, so we need to detect that
// it isn't the right key for that index.  If you are never going to look for
// nonexistant keys, then you can "throw away the keys" and only store the
// values.  That can be a nice memory savings.
std::vector<TKeyValuePair> m_data;

// useful values
size_t m_numItems;
size_t m_numBuckets;
};

// MurmurHash code was taken from https://sites.google.com/site/murmurhash/
//=================================================================================
// MurmurHash2, by Austin Appleby

// Note - This code makes a few assumptions about how your machine behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.

const unsigned int m = 0x5bd1e995;
const int r = 24;

// Initialize the hash to a 'random' value

unsigned int h = seed ^ len;

// Mix 4 bytes at a time into the hash

const unsigned char * data = (const unsigned char *)key;

while(len >= 4)
{
unsigned int k = *(unsigned int *)data;

k *= m;
k ^= k >> r;
k *= m;

h *= m;
h ^= k;

data += 4;
len -= 4;
}

// Handle the last few bytes of the input array

switch(len)
{
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0];
h *= m;
};

// Do a few final mixes of the hash to ensure the last few
// bytes are well-incorporated.

h ^= h >> 13;
h *= m;
h ^= h >> 15;

return h;
}

//=================================================================================
void WaitForEnter ()
{
printf("Press Enter to quit");
fflush(stdin);
getchar();
}

//=================================================================================
int main (int argc, char **argv)
{
// Read the data entries from a file.  Use the line as the key, and the line
// number as the data.  Limit it to 100,000 entries.
CPerfectHashTable<int> table;
decltype(table)::TKeyValueVector data;
std::ifstream file("words.txt");
std::string str;
int i = 0;
while (std::getline(file, str) && i < 100000)
{
data.push_back(std::make_pair(str, i));
++i;
}
printf("Done\n\n");

// make the perfect hash table
printf("Generating Minimal Perfect Hash Table...");
table.Calculate(data);
printf("Done\n\n");

// Verify results
printf("Verifying results...");
for (decltype(table)::TKeyValuePair& keyValuePair : data) {
int* value = table.GetValue(keyValuePair.first.c_str());
assert(value != nullptr);
if (value == nullptr)
{
printf("Error, could not find data for key \"%s\"\n", keyValuePair.first.c_str());
WaitForEnter();
return 0;
}
assert(*value == keyValuePair.second);
if (*value != keyValuePair.second)
{
printf("  table[\"%s\"] = %i\n", keyValuePair.first.c_str(), *value);
printf("Incorrect value detected, should have gotten %i!\n", keyValuePair.second);
WaitForEnter();
return 0;
}
}
printf("Done\n\n");

WaitForEnter();
return 0;
}

I learned the details of this algorithm from this page: Steve Hanov’s Blog: Throw away the keys: Easy, Minimal Perfect Hashing, after hearing about the technique mentioned occasionally by colleagues.

There are other ways to do minimal perfect hashing however. For instance give this a read: Minimal Perfect Hashing

One place that method is better than this one, is that in this one, when doing a lookup you have to hash the key twice. In the technique described by the last technique, you only have to hash the key once, and use that hash to combine the results of two lookups. The two lookups are not dependant so can be re-ordered or happen concurrently, which makes it faster on modern CPUs.

Apparently there are also some techniques for generating the minimal perfect hash on a large number of items by breaking them apart into smaller sets, which can then be paralelized across threads.

I also wanted to mention that a large part of the in memory representation of this data structure can come from storing the keys along with the data, to verify that you have indeed found the item you are looking for after doing the O(1) lookup. If you are in a situation where you are only ever going to be searching for keys that you know are in the table, you can actually forgo storing the keys, to bring the memory requirements down a significant amount.

Also, The example code implements this as a hash table, but you could also use this as a set, if you wanted fast membership tests. You could either store the keys, to be able to resolve when things were not part of the set, or, you could make a hash table of string to bool, where the bool specified whether something was in the set. Which one is better depends on your specific needs and whether you plan to search for unknown keys or not.

Lastly, as a byproduct of using a data structure like this, you can get a unique ID per key, which is the index that it appears in the data array. This can be super helpful if you have something like a list of filenames, where you would rather work with integers instead of specific file names, since integers are quicker and easier to compare, copy around, etc.

You could even make this data structure support lookups by either key or by unique id. This way, you could do a by key lookup the first time you needed to find something, and then could store off the ID to get it by ID from then on for a faster lookup. Heck, you could even do all your lookups “offline” (at tool time, not when the app is running) and for instance convert all file references in data files to be the each file’s unique ID instead. Then, you could toss out the keys of your data structure, only storing an array of data, and using the unique file ID to look into that table whenever you wanted to read or write meta data associated with that file. That would make lookups faster, and also decrease the memory requirements of memory resident data files.

It’s pretty cool stuff, and a pretty useful algorithm. Hopefully you find it useful as well (:

## Update – 12/22/15

Interestingly, this post has had something like 15,000 readers since it was posted. That is by far the most read post on this blog 😛

Here are three tools for doing minimal perfect hashing that are very likely to give you better results than the algorithm I describe above:

Here’s a conversation talking about gperf and the alternative applications, and pros and cons for each:
Stack Overflow: Making a perfect hash (all consecutive buckets full), gperf or alternatives?

Here’s a research paper on gperf by Doug Schmidt: GPERF – A Perfect Hash Function Generator

I had a thought that maybe there was some play here by using “logical synthesis” to come up with some algorithm to map the inputs (the keys of the hash table) to the outputs (collision free output indices).

I started looking into Karnaugh maps and then the Quine–McCluskey algorithm, and then espresso and espresso-exact (mincov). Where the first two things are decent at solving multi bit input to single bit output, the second two things are decent at solving multi bit input to multi bit output, allowing operations to be shared among bits.

While I haven’t found anyone using those specific algorithms to solve the problem, people have, and definitely are still, trying to also look into the ability to generate code without lookups. From what I’ve read so far, it sounds like finding such a function takes a lot longer to find and also that it runs more slowly in practice than a less perfect solution which has lookups.

Either way, this is still an active area of research, and plenty of folks are working on it so I’m going to leave it to them.

I also sort of have the feeling that if you are in need of minimal perfect hashing, you may be “doing it wrong”. For instance, if you are at all able to, you probably are likely to be better off having a pre-agreed on set of unique IDs per “thing” you want to look up. These unique IDs can be used directly as array indices for the magical always O(1) lookup that minimal perfect hashing is going for, and is actually a quicker lookup in practice since you don’t need to jump through special hoops to calculate the array index.

The only exceptions I can think of are:

1. Real world requirements and not being able to reach the ideal situation – this is the boat I’m in right now. Ideally, systems would be asking about things by an integer ID, but I can’t get there in a single step, so the perfect hashing is a temporary bridge til I can get there.
2. Even with IDs, sparse arrays are still problematic. You may have an ID per file that could be asked about, but say that you have 1,000,000 files, but you want to make an array of data for only 10,000 of them. How do you take the unique ID of a file and do a lookup into that smaller table? Minimal perfect hashing seems to be useful in this situation, but there may be other decent or comparable alternatives if you are already using integer IDs.

## Hiding a Lookup Table in a Modulus Operation

Lookup tables are a tool found in every programmer’s tool belt.

Lookup tables let you pre-calculate a complex calculation in advance, store the results in a table (an array), and then during performance critical parts of your program, you access that table to get quick answers to the calculations, without having to do the complex calculation on the fly.

In this post I’ll show a way to embed a lookup table inside of a single (large) number, where you extract values from that lookup table by taking a modulus of that number with different, specific values.

This technique is slower and takes more memory than an actual lookup table, but it’s conceptually interesting, so I wanted to share.

Also, I stumbled on this known technique while working on my current paper. The paper will make this technique a bit more practical, and I’ll share more info as soon as I am able, but for now you can regard this as a curiosity 😛

Onto the details!

## 1 Bit Input, 1 Bit Output: Pass Through

Let’s learn by example and start with a calculation that takes in an input bit, and gives that same value for an output bit. It’s just a 1 bit pass through lookup table.

$\begin{array}{c|c} \text{Input} & \text{Output} \\ \hline 0 & 0 \\ 1 & 1 \\ \end{array}$

To be able to convert that to something we can decode with modulus we have to solve the following equations:

$x \% k_0 = 0 \\ x \% k_1 = 1$

$x$ is the number that represents our lookup table. $k_0$ and $k_1$ are the values that we modulus x against to get our desired outputs out.

It looks as if we have two equations and three unknowns – which would be unsolvable – but in reality, x is the only unknown. The k values can be whatever values it takes to make the equations true.

I wrote a blog post on how to solve equations like these in a previous post: Solving Simultaneous Congruences (Chinese Remainder Theorem).

You can also use this chinese remainder theorem calculator, which is handy: Chinese Remainder Theorem Calculator

The short answer here is that the k values can be ANY numbers, so long as they are pairwise co-prime to each other – AKA they have a greatest common divisor of 1.

If we pick 3 and 4 for k0 and k1, then using the chinese remainder theorem we find that x can equal 9 and the equations are true. Technically the answer is 9 mod 12, so 21, 33, 45 and many other numbers are also valid values of x, but we are going to use the smallest answer to keep things smaller, and more manageable.

So, in this case, the value representing the lookup table would be 9. If you wanted to know what value it gave as output when you plugged in the value 0, you would modulus the lookup table (9) against k0 (3) to get the output. If you wanted to know what value it gave as output when you plugged in the value 1, you would modulus the lookup table (9) against k1 (4) to get the output. The table below shows that it passes through the value in both cases like it should:

$\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 0 & x \% k_0 & 9 \% 3 & 0\\ 1 & x \% k_1 & 9 \% 4 & 1\\ \end{array}$

## 1 Bit Input, 1 Bit Output: Not Gate

Let’s do something a little more interesting. Let’s make the output bit be the reverse of the input bit. The equations we’ll want to solve are this:

$x \% k_0 = 1 \\ x \% k_1 = 0$

We can use 3 and 4 for k0 and k1 again if we want to. Using the Chinese remainder theorem to solve the equations gives us a value of 4 for x. Check the truth table below to see how this works:

$\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 0 & x \% k_0 & 4 \% 3 & 1\\ 1 & x \% k_1 & 4 \% 4 & 0\\ \end{array}$

## 1 Bit Input, 1 Bit Output: Output Always 1

What if we wanted the output bit to always be 1 regardless of input bit?

$x \% k_0 = 1 \\ x \% k_1 = 1$

Using 3 and 4 for our k values again, we solve and get a value of 1 for x. Check the truth table to see it working below:

$\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 0 & x \% k_0 & 1 \% 3 & 1\\ 1 & x \% k_1 & 1 \% 4 & 1\\ \end{array}$

Hopefully one bit input to one bit output makes sense now. Let’s move on (:

## 2 Bit Input, 1 Bit Output: XOR Gate

Things get a little more interesting when we bump the number of input bits up to 2. If we want to make a number which represents XOR, we now have 4 equations to solve.

$x \% k_{00} = 0 \\ x \% k_{01} = 1 \\ x \% k_{10} = 1 \\ x \% k_{11} = 0$

In general we will have $2^N$ equations, where N is the number of input bits.

You might have noticed that I use subscripts for k corresponding to the input bits that the key represents. This is a convention I’ve found useful when working with this stuff. Makes it much easier to see what’s going on.

Now with four equations, we need 4 pairwise coprime numbers – no number has a common factor with another number besides 1.

Let’s pull them out of the air. Umm… 3, 4, 5, 7

Not too hard with only two bits of input, but you can see how adding input bits makes things a bit more complex. If you wanted to make something that took in two 16 bit numbers as input for example, you would need 2^32 co-prime numbers, since there was a total of 32 bits of input!

When we solve those four equations, we get a value of 21 for x.

Notice how x is larger now that we have more input bits? That is another added complexity as you add more input bits. The number representing your program can get very, very large, and require you to use “multi precision integer” math libraries to store and decode the programs, when the numbers get larger than what can be held in a 64 bit int.

Boost has a decent library for this, check out boost::multiprecision::cpp_int, it’s what I use. You can download boost from here: http://www.boost.org/doc/libs/1_59_0/more/getting_started/windows.html

Anyhow, let’s check the truth table to see if our values work:

$\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 00 & x \% k_{00} & 21 \% 3 & 0 \\ 01 & x \% k_{01} & 21 \% 4 & 1 \\ 10 & x \% k_{10} & 21 \% 5 & 1 \\ 11 & x \% k_{11} & 21 \% 7 & 0 \end{array}$

Woot, it worked.

## 2 Bit Input, 2 Bit Output: OR, AND

What happens when we add another bit of output? Basically we just treat each output bit as it’s own lookup table. This means that if we have two output bits, we will have two numbers representing our program (one for each bit), and that this is true regardless of how many input bits we have.

Let’s make the left output bit ($x_0$) be the OR of the input bits and the right output bit ($x_1$) be the AND of the input bits.

That give us these two sets of equations to solve:

$x_0 \% k_{00} = 0 \\ x_0 \% k_{01} = 1 \\ x_0 \% k_{10} = 1 \\ x_0 \% k_{11} = 1 \\ \\ x_1 \% k_{00} = 0 \\ x_1 \% k_{01} = 0 \\ x_1 \% k_{10} = 0 \\ x_1 \% k_{11} = 1 \\$

We can use the same coprime numbers for our k values as we used in the last section (3,4,5,7). Note that we use the same k values in each set of equations. This is intentional and required for things to work out!

If we solve each set of equations we get 141 for x0, and 120 for x1.

Let’s see if that worked:

$\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 00 & x_0 \% k_{00}, x_1 \% k_{00} & 141 \% 3, 120 \% 3 & 00 \\ 01 & x_0 \% k_{01}, x_1 \% k_{01} & 141 \% 4, 120 \% 4 & 10 \\ 10 & x_0 \% k_{10}, x_1 \% k_{10} & 141 \% 5, 120 \% 5 & 10 \\ 11 & x_0 \% k_{11}, x_1 \% k_{11} & 141 \% 7, 120 \% 7 & 11 \end{array}$

Hey, it worked again. Neat!

## Example Code

Now that we have the basics worked out, here is some sample code.

The lookup table takes in 8 bits as input, mapping 0..255 to 0…2pi and gives the sine of that value as output in a float. So it has 8 bits of input and 32 bits of output.

#include <vector>
#include <boost/multiprecision/cpp_int.hpp>
#include <stdint.h>
#include <string.h>
#include <memory>

typedef boost::multiprecision::cpp_int TINT;
typedef std::vector<TINT> TINTVec;

const float c_pi = 3.14159265359f;

//=================================================================================
void WaitForEnter ()
{
printf("\nPress Enter to quit");
fflush(stdin);
getchar();
}

//=================================================================================
static TINT ExtendedEuclidianAlgorithm (TINT smaller, TINT larger, TINT &s, TINT &t)
{
// make sure A <= B before starting
bool swapped = false;
if (larger < smaller)
{
swapped = true;
std::swap(smaller, larger);
}

// set up our storage for the loop.  We only need the last two values so will
// just use a 2 entry circular buffer for each data item
std::array<TINT, 2> remainders = { larger, smaller };
std::array<TINT, 2> ss = { 1, 0 };
std::array<TINT, 2> ts = { 0, 1 };
size_t indexNeg2 = 0;
size_t indexNeg1 = 1;

// loop
while (1)
{
// calculate our new quotient and remainder
TINT newQuotient = remainders[indexNeg2] / remainders[indexNeg1];
TINT newRemainder = remainders[indexNeg2] - newQuotient * remainders[indexNeg1];

// if our remainder is zero we are done.
if (newRemainder == 0)
{
// return our s and t values as well as the quotient as the GCD
s = ss[indexNeg1];
t = ts[indexNeg1];
if (swapped)
std::swap(s, t);

// if t < 0, add the modulus divisor to it, to make it positive
if (t < 0)
t += smaller;
return remainders[indexNeg1];
}

// calculate this round's s and t
TINT newS = ss[indexNeg2] - newQuotient * ss[indexNeg1];
TINT newT = ts[indexNeg2] - newQuotient * ts[indexNeg1];

// store our values for the next iteration
remainders[indexNeg2] = newRemainder;
ss[indexNeg2] = newS;
ts[indexNeg2] = newT;

// move to the next iteration
std::swap(indexNeg1, indexNeg2);
}
}

//=================================================================================
void MakeKey (TINTVec &keys, TINT &keysLCM, size_t index)
{
// if this is the first key, use 3
if (index == 0)
{
keys[index] = 3;
keysLCM = keys[index];
return;
}

// Else start at the last number and keep checking odd numbers beyond that
// until you find one that is co-prime.
TINT nextNumber = keys[index - 1];
while (1)
{
nextNumber += 2;
if (std::all_of(
keys.begin(),
keys.begin() + index,
[&nextNumber] (const TINT& v) -> bool
{
TINT s, t;
return ExtendedEuclidianAlgorithm(v, nextNumber, s, t) == 1;
}))
{
keys[index] = nextNumber;
keysLCM *= nextNumber;
return;
}
}
}

//=================================================================================
void CalculateLookupTable (
TINT &lut,
const std::vector<uint64_t> &output,
const TINTVec &keys,
const TINT &keysLCM,
const TINTVec &coefficients,
)
{
// figure out how much to multiply each coefficient by to make it have the specified modulus residue (remainder)
lut = 0;
for (size_t i = 0, c = keys.size(); i < c; ++i)
{
// we either want this term to be 0 or 1 mod the key.  if zero, we can multiply by zero, and
// not add anything into the bit value!
if ((output[i] & bitMask) == 0)
continue;

// if 1, use chinese remainder theorem
TINT s, t;
ExtendedEuclidianAlgorithm(coefficients[i], keys[i], s, t);
lut = (lut + ((coefficients[i] * t) % keysLCM)) % keysLCM;
}
}

//=================================================================================
template <typename TINPUT, typename TOUTPUT, typename LAMBDA>
void MakeModulus (TINTVec &luts, TINTVec &keys, LAMBDA &lambda)
{
// to keep things simple, input sizes are being constrained.
// Do this in x64 instead of win32 to make size_t 8 bytes instead of 4
static_assert(sizeof(TINPUT) < sizeof(size_t), "Input too large");
static_assert(sizeof(TOUTPUT) < sizeof(uint64_t), "Output too large");

// calculate some constants
const size_t c_numInputBits = sizeof(TINPUT) * 8;
const size_t c_numInputValues = 1 << c_numInputBits;
const size_t c_numOutputBits = sizeof(TOUTPUT) * 8;

// Generate the keys (coprimes)
TINT keysLCM;
keys.resize(c_numInputValues);
for (size_t index = 0; index < c_numInputValues; ++index)
MakeKey(keys, keysLCM, index);

// calculate co-efficients for use in the chinese remainder theorem
TINTVec coefficients;
coefficients.resize(c_numInputValues);
fill(coefficients.begin(), coefficients.end(), 1);
for (size_t i = 0; i < c_numInputValues; ++i)
{
for (size_t j = 0; j < c_numInputValues; ++j)
{
if (i != j)
coefficients[i] *= keys[j];
}
}

// gather all the input to output mappings by permuting the input space
// and storing the output for each input index
std::vector<uint64_t> output;
output.resize(c_numInputValues);
union
{
TINPUT value;
size_t index;
} input;
union
{
TOUTPUT value;
size_t index;
} outputConverter;

for (input.index = 0; input.index < c_numInputValues; ++input.index)
{
outputConverter.value = lambda(input.value);
output[input.index] = outputConverter.index;
}

// iterate through each possible output bit, since each bit is it's own lut
luts.resize(c_numOutputBits);
for (size_t i = 0; i < c_numOutputBits; ++i)
{
const size_t bitMask = 1 << i;
CalculateLookupTable(
luts[i],
output,
keys,
keysLCM,
coefficients,
);
}
}

//=================================================================================
int main (int argc, char **argv)
{
// Look up tables encodes each bit, keys is used to decode each bit for specific
// input values.
TINTVec luts;
TINTVec keys;

// this is the function that it turns into modulus work
typedef uint8_t TINPUT;
typedef float TOUTPUT;
auto lambda = [] (TINPUT input) -> TOUTPUT
{
return sin(((TOUTPUT)input) / 255.0f * 2.0f * c_pi);
};

MakeModulus<TINPUT, TOUTPUT>(luts, keys, lambda);

// show last lut and key to show what kind of numbers they are
std::cout << "Last Lut: " << *luts.rbegin() << "\n";
std::cout << "Last Key: " << *keys.rbegin() << "\n";

// Decode all input values
std::cout << "\n" << sizeof(TINPUT) << " bytes input, " << sizeof(TOUTPUT) << " bytes output\n";
for (size_t keyIndex = 0, keyCount = keys.size(); keyIndex < keyCount; ++keyIndex)
{
union
{
TOUTPUT value;
size_t index;
} result;

result.index = 0;

for (size_t lutIndex = 0, lutCount = luts.size(); lutIndex < lutCount; ++lutIndex)
{
TINT remainder = luts[lutIndex] % keys[keyIndex];
size_t remainderSizeT = size_t(remainder);
result.index += (remainderSizeT << lutIndex);
}

TINT remainder = luts[0] % keys[keyIndex];
std::cout << "i:" << keyIndex << " o:" << result.value << "\n";
}

WaitForEnter();
return 0;
}

Here is some output from the program. The first is to show what the last (largest) look up table and key look like. Notice how large the look up table number is!

Here it shows some sine values output from the program, using modulus against the large numbers calculated, to get the bits of the result out:

## How to Get Lots of Pairwise Co-Prime Numbers?

You can generate a list of pairwise coprimes using brute force. Have an integer that you increment, and check if it’s pairwise co-prime to the existing items in the list. If it is, add it to the list! Rinse and repeat until you have as many as you want.

That is the most practical way to do it, but there are two other interesting ways I wanted to mention.

The first way is using Fermat numbers. The Fermat numbers are an infinite list of pairwise co-prime numbers and are calculated as $2^{2^n}+1$ where n is an integer. Fermat numbers also have the benefit that you can get the nth item in the list without calculating the numbers that came before it. The only problem is that the numbers grow super huge very fast. The first 7 values are: 3, 5, 17, 257, 65537, 4294967297, 18446744073709551617. If Fermat numbers didn’t grow so quickly, they sure would be useful for things like this technique.

The second way is using something called Sylvester’s sequence. It too is an infinite list of pairwise co-prime numbers, and it too grows very large very quickly unfortunately. I also don’t believe there is a way to calculate the Nth item in the list directly. Every number is based on previous numbers, so you have to calculate them all from the beginning. No random access!

## Beyond Binary

In this post I showed how to work in binary digits, but there is no reason why you have to encode single bits in the lookup tables.

Instead of encoding 0 or 1 in each modulus “lookup table”, you could also perhaps store base 10 numbers in the tables and have 0-9. Or, maybe you encode a byte per lookup table.

Encoding more than one bit effectively makes both your input and your output smaller, which helps the algorithm do more with less.

Your keys will need to be larger though, since the keys have to be larger than the value you plan to store, and your resulting lookup table will be a larger number as well. It might make the technique more worth while though.

I’ll leave that as an exercise for you. If try it and find neat stuff, post a comment and let us know, or drop me an email or something. It’d be neat to hear if people find any practical usage cases of this technique 😛

## The End, For Now!

I want to point out that increasing the number of input bits in this technique is a pretty expensive thing to do, but increasing the number of output bits is a lot cheaper. It kind of makes sense in a way if you think about it. Input bits add information from the outside world that must be dealt with, while output bits are just fluff that can easily be diluted or concentrated by adding or removing bits that are associated with, and calculated from, the input bits.

Another problem you may have noticed with this technique is that if you have a really expensive calculation that you are trying to “flatten” into modulus math like this, that you have to run that calculation many, many times to know what values a lookup table would give you. You have to run it once per possible input to get every possible output. That is expected when making a lookup table, since you are paying a cost up front to make things faster later.

The paper I’m working on changes things a bit though. One of the things it does is it makes it so doing this technique only requires that you evaluate the function once, and it calculates all values simultaneously to give the end result that you can then do modulus against. It’s pretty cool IMO and I will share more details here as soon as I am able – and yes, i have actual working code that does that, believe it or not! I’m looking forward to being able to share it later on. Maybe someone will find some really cool usage case for it.

## Quantum Computing For Programmers Part 2: Multiple Qubits

In part 1 (Quantum Computing For Programmers Part I: One Qubit) we looked at the basics of quantum computing and how to deal with a single qubit. Here we’ll talk about the interesting things that happen when you involve multiple qubits!

## Multiple Qubit Probabilities & Possibilities

In the last post we used the analogy of a coin to describe a qubit. The coin can be either heads or tails, or flipping in the air, waiting to become either a heads or tails when it lands. The same is true of how a qubit works, where it can either be a 0 or a 1 or some superposition of both, only deciding which it is when observed. Furthermore, just like a real coin, there can be a bias towards becoming one value or the other.

We represented a coin by having a probability for each possible state (heads or tails), but represented a qubit by having an AMPLITUDE for each possible state (0 or 1), where you square an amplitude to get the probability.

Going back to the coin analogy, how would probabilities work if we had two coins?

The four possible outcomes with two coins are:
Tails/Tails

Each coin individually could either be sitting on the table heads or tails, or up in the air, waiting to become either a heads or a tails, with some probability of becoming a heads or a tails.

To figure out the probability of each possible outcome, you just multiply the probability of each coin’s outcome together.

For instance, let’s say that the first coin (coin A) is already sitting on the table as a heads, and the second coin (coin B) is flipping in the air, with a 1/3 chance of becoming heads and a 2/3 chance of becoming tails. Here is how you would calculate the probability of each possible state:

$\begin{array}{c|c|c|c} \text{Outcome A/B} & \text{Coin A Probability} & \text{Coin B Probability} & \text{Outcome Probability (A*B)}\\ \hline heads / heads & 100\% & 33\% & 33\% \\ heads / tails & 100\% & 67\% & 67\% \\ tails / heads & 0\% & 33\% & 0\% \\ tails / tails & 0\% & 67\% & 0\% \\ \end{array}$

Qubits actually work the same way! Using the same values as the coins, let’s say qubit A is a 0, and qubit B has a 1/3 chance of becoming a 0, and a 2/3 chance of becoming a 1. Converting those probabilities to amplitudes you could get $A=[1,0], B=[\frac{1}{\sqrt{3}}, \frac{\sqrt{2}}{\sqrt{3}}]$.

$\begin{array}{c|c|c|c|c} \text{Outcome AB} & \text{Qubit A Amplitude} & \text{Qubit B Amplitude} & \text{Outcome Amplitude(A*B)} & \text{Outcome Probability}\\ \hline 00 & 1 & 1/\sqrt{3} & 1/\sqrt{3} & 33\% \\ 01 & 1 & \sqrt{2}/\sqrt{3} & \sqrt{2}/\sqrt{3} & 67\% \\ 10 & 0 & 1/\sqrt{3} & 0 & 0\%\\ 11 & 0 & \sqrt{2}/\sqrt{3} & 0 & 0\% \\ \end{array}$

Note that in both cases, the probabilities add up to 100% like you’d expect.

In the case of qubits, the resulting vector is $[\frac{1}{\sqrt{3}}, \frac{\sqrt{2}}{\sqrt{3}}, 0, 0]$, which is still normalized, and represents the amplitudes for the 4 possible states of those two qubits: 00, 01, 10 and 11.

When working with one qubit (or coin), there are 2 possible outcomes 0 or 1 (heads or tails). When working with two qubits (or coins), there are four possible outcomes 00, 01, 10, 11 (heads/heads, heads/tails, tails/heads, tails/tails). When working with three qubits or coins, there are eight possible outcomes: 000,001,010 … 111.

With both coins and qubits there are $2^N$ possibilities, where $N$ is the number of qubits or coins you have.

If you had 8 qubits, you’d have to use a 256 dimensional vector to describe the possibilities, since $2^8$ is 256. When performing quantum computing with 8 qubits, you only have to deal with the 8 qubits. When simulating quantum computing on a regular, classical computer, you have to deal with the 256 amplitudes. This kind of gives a glimpse at how quantum computers can be faster than regular computers at certain things. There is an economy of scale working against us on regular computers simulating quantum computing.

The method we used to combine the probabilities of single qubits into an amplitude vector representing multiple qubits is called the Kronecker product. That’s just a fancy way of saying we have to multiply everything from the first vector by everything from the second vector, to get a third vector that is bigger than the first two. You’ll see it represented like this: $A \otimes B$ and while it’s similar to the “outer product” and even uses the same symbol (!), it gives a slightly different result versus if you did an outer product and then vectorized the matrix.

The Kronecker product of vectors works like this:

$\begin{bmatrix} A_1 \\ A_2 \\ ... \\ A_M \\ \end{bmatrix} \otimes \begin{bmatrix} B_1 \\ B_2 \\ ... \\ B_N \\ \end{bmatrix} = \begin{bmatrix} A_1 B_1 \\ A_1 B_2 \\ ... \\ A_1 B_N \\ A_2 B_1 \\ A_2 B_2 \\ ... \\ A_2 B_N \\ ... \\ A_M B_1 \\ A_M B_2 \\ ... \\ A_M B_N \\ \end{bmatrix}$

Let’s show an example involving two qubits.

The first qubit has a 75% chance of being heads so it’s amplitude is $[\sqrt{3}/2,1/2]$. The second qubit has a 2/3 chance of being heads so has an amplitude of $[\sqrt{2}/\sqrt{3}, 1/\sqrt{3}]$.

To calculate the amplitude vector representing these two qubits, we do a kronecker product:
$\begin{bmatrix} \cfrac{\sqrt{3}}{2} \\ \cfrac{1}{2} \\ \end{bmatrix} \otimes \begin{bmatrix} \cfrac{\sqrt{2}}{\sqrt{3}} \\ \cfrac{1}{\sqrt{3}} \\ \end{bmatrix} = \begin{bmatrix} \cfrac{\sqrt{3}}{2}\cfrac{\sqrt{2}}{\sqrt{3}} \\ \cfrac{\sqrt{3}}{2}\cfrac{1}{\sqrt{3}} \\ \cfrac{1}{2}\cfrac{\sqrt{2}}{\sqrt{3}} \\ \cfrac{1}{2}\cfrac{1}{\sqrt{3}} \\ \end{bmatrix}$

If you simplify that, you get:
$\begin{bmatrix} \cfrac{1}{\sqrt{2}} \\ \cfrac{1}{2} \\ \cfrac{1}{\sqrt{6}} \\ \cfrac{1}{2\sqrt{3}} \\ \end{bmatrix}$

$\begin{bmatrix} \cfrac{1}{\sqrt{2}} & \cfrac{1}{2} & \cfrac{1}{\sqrt{6}} & \cfrac{1}{2\sqrt{3}} & \end{bmatrix}$

Squaring those values to get the probabilities of each state, we get the below, which you might notice adds up to 100%, meaning the vector is still normalized!
$\begin{bmatrix} 50\% & 25\% & 17\% & 8\% \end{bmatrix}$

This generalizes for larger numbers of qubits, so you could use the Kronecker product to combine a vector describing 4 qubits with a vector describing 3 qubits, to come up with a vector that describes 7 qubits (which would have 128 amplitudes in it, since 2^7 = 128!).

The kronecker product also generalizes to matrices, which we will talk about shortly.

To be able to work with multiple qubits in a quantum circuit, you need to represent all qubits involved in a single vector. Doing this, while being the correct thing to do, also allows entanglement to happen, which we will talk about next!

## Entanglement – Simpler Than You Might Think!

Time to demystify quantum entanglement!

Quantum entanglement is the property whereby two qubits – which can be separated by any distance, such as several light years – can be observed and come up with the same value (0 or 1). Whether they are 0s or 1s is random, but they will both agree, when entangled in the way that causes them to agree (you could alternately entangle them to always disagree).

Interestingly, in 1965 John Bell proved that the mechanism that makes this happen is NOT that the qubits share information in advance. You can read more about that here – it’s near the end: Ars Technica – A tale of two qubits: how quantum computers work.

A common misconception is that this means that we can have faster than light (instantaneous) communication across any distance. That turns out not to be true, due to the No-Communication-Theorem (Wikipedia).

Interestingly though, you can use separated entangled qubits in separated quantum circuits to do something called Quantum Pseudo Telepathy (Wikipedia), which lets you do some things that would otherwise be impossible – like reliably winning a specially designed game that would otherwise be a game of chance.

I don’t yet understand enough about Quantum Pseudo Telepathy to see why it isn’t considered communication. I also have no idea how entanglement is actually “implemented” in the universe, but nobody seems to (or if they do, they aren’t sharing!). How can it be that two coins flipped on opposite sides of the galaxy can be guaranteed to both land with the same side facing up?

Despite those mysteries, the math is DEAD SIMPLE, so let me share it with you, it’ll take like one short paragraph. Are you ready?

Quantum computing works by manipulating the probabilities of the states of qubits. If you have two qubits, there are four possible sets of values when you observe them: 00, 01, 10, 11. If you use quantum computing to set the probabilities of the 01 and 10 states to a 0% chance, and set the probabilities of the 00 and 11 states to a 50% chance each, you’ve now entangled the qubits such that when you observe them, they will always agree, because you’ve gotten rid of the chances that they could ever DISAGREE. Similarly, if instead of setting the 01 and 10 states to 0%, you set the probability of the 00 and 11 states to 0%, you’d have entangled qubits which would always disagree when you observed them, because you’ve gotten rid of the chances that they could ever AGREE.

That’s all there is to it. Strange how simple it is, isn’t it? The table below shows how the only possible outcomes are that the qubits agree, but it is a 50/50 chance whether they are a 0 or a 1:

$\begin{array}{c|c|c} \text{Outcome} & \text{Amplitude} & \text{Probability}\\ \hline 00 & 1/\sqrt{2} & 50\% \\ 01 & 0 & 0\% \\ 10 & 0 & 0\% \\ 11 & 1/\sqrt{2} & 50\% \\ \end{array}$

Entanglement isn’t limited to just two qubits, you can entangle any number of qubits together.

Entanglement has a special mathematical meaning. If you can represent a state of a group of qubits by a kronecker product, they are not entangled. If you CAN’T represent a state of a group of qubits by a kronecker product, they ARE entangled. These two things – entanglement and lack of kronecker product factorability (made that term up) – are the same thing.

As an example, what two vectors could you use the kronecker product on to get the entangled two qubit state $1/\sqrt{2}(|00\rangle+|11\rangle)$ (or in vector form $[1/\sqrt{2}, 0, 0, 1/\sqrt{2}]$)? You’ll find there aren’t any! That state is the entangled state where the two qubits will always have the same value when you observe them.

Entangled qubits are nothing special in quantum circuits. You don’t have to take any special precautions when working with them. They are an interesting byproduct of quantum computing, and so basically are something neat that comes out of your quantum circuit, but they don’t require any extra thought when using them within your circuits. Don’t worry about them too much (:

## Multi Qubit Gates

Let’s have a look at some multi qubit gates! (These are again from Wikipedia: Quantum Gate)

### Swap Gate

Given two qubits, with possible states $|00\rangle, |01\rangle, |10\rangle, |11\rangle$, this gate swaps the amplitudes (probabilities) of $|01\rangle, |10\rangle$ and leaves the probabilities of the states $|00\rangle$ and $|11\rangle$ alone.

That might seem weird, but the probability of the $|00\rangle$ state is the probability of the first qubit being 0 added to the probability of the second qubit being 0. If those probabilities swap, they still add to the same value, so this probability is unaffected. It’s the same situation for the $|11\rangle$ state.

Here’s the matrix for the swap gate:
$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}$

### Square Root of Swap Gate

I don’t really understand a usage case for this gate myself, but it apparently does a half way swap between two qubits. That once again modifies the probabilities of the $|01\rangle, |10\rangle$ states, but leaves state $|00\rangle$ and $|11\rangle$ alone again, for the same reason as the swap gate.

Wikipedia says that if you have this gate, and the single qubit gates, that you can do universal quantum computation. In other words, you can build a fully functional quantum computer using just this and the single qubit gates.

Here’s the matrix:
$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \frac{1}{2}(1+i) & \frac{1}{2}(1-i) & 0 \\ 0 & \frac{1}{2}(1-i) & \frac{1}{2}(1+i) & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}$

### Controlled Not Gate

The controlled not (CNOT) gate flips the value of the second qubit if the first qubit is true. This is a logic / flow control type of gate.

Interestingly, this gate is also useful for creating entangled qubits, which you’ll be able to see lower down!

Here is the matrix:
$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \\ \end{bmatrix}$

The controlled not gate is also referred to as the quantum XOR gate. To see why, check out the truth table below for this gate:
$\begin{array}{c|c} \text{Input} & \text{Output} \\ \hline 00 & 00 \\ 01 & 01 \\ 10 & 11 \\ 11 & 10 \\ \end{array}$

If you look at the right most bit of the outputs, you’ll notice that it’s the XOR of the two input bits!

All quantum gates need to be reversible, and having these two qubits be the output of a hypothetical quantum XOR gate allows that to happen. If you look at the right most bit of the inputs, you’ll notice that it is also the XOR of the two output bits. It’s bidirectional, which is kind of weird and kind of interesting 😛

### Generalized Control Gate

You can actually convert any single qubit gate into a controlled gate. That makes a 2 qubit gate which only does work on the second qubit if the first qubit is true.

How you do that is you make a 4×4 identity matrix, and make the lower 2×2 sub-matrix into the single qubit matrix you want to use.

In other words, if your single qubit matrix is this:
$\begin{bmatrix} U_{00} & U_{01} \\ U_{10} & U_{11} \\ \end{bmatrix}$

Then the controlled version of the matrix would be this:
$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & U_{00} & U_{01} \\ 0 & 0 & U_{10} & U_{11} \\ \end{bmatrix}$

Pretty simple right?

### Toffoli Gate

The Tofolli gate is a 3 qubit gate that is also known as the CCNOT gate or controlled controlled not gate. It flips the third qubit if the first two qubits are true.

It’s matrix looks pretty uninteresting:
$\begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ \end{bmatrix}$

This is also known as the quantum AND gate. If you look at the truth table below, you’ll see that when the input has the right most qubit of 0, that in the output, the right most qubit will be the AND value of the two left qubits. We need the three bits of input mapping to three bits of output like this so that the gate is reversible.

$\begin{array}{c|c} \text{Input} & \text{Output} \\ \hline 000 & 000 \\ 001 & 001 \\ 010 & 010 \\ 011 & 011 \\ 100 & 100 \\ 101 & 101 \\ 110 & 111 \\ 111 & 110 \\ \end{array}$

### Fredkin Gate

The Fredkin gate is a controlled swap gate. Here’s the matrix:
$\begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{bmatrix}$

Just like we talked about how to make a generalized controlled gate acting on two qubits, you should be able to notice that this gate is just a specific version of a generalized controlled 3 qubit gate. You take an 8×8 identity matrix, and make the lower right 4×4 matrix be the 2 qubit gate that you want to add a control on. In this case, the lower right 4×4 matrix is just the swap gate we mentioned earlier in the post (:

## Circuit To Entangle Qubits

The circuit to entangle qubits is pretty simple, so let’s start with that as the first quantum circuit we look at. It takes in two qubits. The first qubit is put through a Hadamard gate, and then both qubits are put through a controlled not gate. That circuit looks like this (image courtesy of Quantum Circuit Simulator):

The value you plug in for the second qubit determines what type of entanglement you get out. Setting the second qubit to 0, you will get entangled qubits which always agree when they are observed. Setting it to 1, you will get entangled qubits which always disagree when they are observed. The first qubit controls whether the phase of each state matches or mismatches. Note that these are the four Bell States (Wikipedia: Bell State).

$\begin{array}{c|c|c} \text{Input} & \text{Output In Ket Notation} & \text{Output As Vector} \\ \hline 00 & \frac{1}{\sqrt{2}}(|00\rangle+|11\rangle) & [1/\sqrt{2},0,0,1/\sqrt{2}] \\ 01 & \frac{1}{\sqrt{2}}(|01\rangle+|10\rangle) & [0,1/\sqrt{2},1/\sqrt{2},0] \\ 10 & \frac{1}{\sqrt{2}}(|00\rangle-|11\rangle) & [1/\sqrt{2},0,0,-1/\sqrt{2}] \\ 11 & \frac{1}{\sqrt{2}}(|01\rangle-|10\rangle) & [0,1/\sqrt{2},-1/\sqrt{2},0]\\ \end{array}$

In a quantum circuit, you can’t just apply a gate to an individual quabit at a time though. You have to make the matrix of your gate such that it applies to all qubits, applying the “identity” matrix to the qubits you don’t want to be affected by the gate.

So, how do we make a matrix that applies the Hadamard gate to qubit 1, and identity to qubit 2? You use the kronecker product!

Since we want to apply the Hadamard matrix to qubit 1 and identity to qubit 2, we are going to calculate $H \otimes I$ (If we wanted to apply the Hadamard gate to qubit 2 and identity to qubit 1 we would calculate $I \otimes H$ instead).

$H \otimes I = 1/\sqrt{2}* \begin{bmatrix} 1 & 1 \\ 1 & -1 \\ \end{bmatrix} \otimes \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} = 1/\sqrt{2}* \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & -1 \\ \end{bmatrix}$

If you notice, the result of the kronecker product is just every value in the left matrix, multiplied by every value in the right matrix. Basically, the result is a 2×2 grid of identity matrices, where each of those identity matrices is multiplied by the corresponding value from the same cell in the left matrix. Since the left matrix has a 1 in all cells except the lower right, the same is true of the result… it’s a positive identity matrix in each cell, except the lower right one, which is a negative identity matrix. Hopefully that makes sense, it’s a lot easier than it sounds…

The second gate in the quantum circuit is the CNOT gate. We can actually multiply the gate we just made by the CNOT gate to represent the full quantum circuit as a single matrix. This uses regular matrix multiplication.

$1/\sqrt{2}* \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & -1 \\ \end{bmatrix} * \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \\ \end{bmatrix} = 1/\sqrt{2}* \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & -1 \\ 0 & 1 & -1 & 0 \\ \end{bmatrix} = \begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 1/\sqrt{2} & 0 \\ 1/\sqrt{2} & 0 & 0 & -1/\sqrt{2} \\ 0 & 1/\sqrt{2} & -1/\sqrt{2} & 0 \\ \end{bmatrix}$

Lets plug some values into the circuit and see what comes out!

Lets start by plugging in 00. Our input qubits are $[1, 0]$ and $[1, 0]$. The kronecker product of those two qubit vectors is $[1, 0, 0, 0]$. Now let’s multiply that vector by the matrix of our quantum circuit.

$\begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix} * \begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 1/\sqrt{2} & 0 \\ 1/\sqrt{2} & 0 & 0 & -1/\sqrt{2} \\ 0 & 1/\sqrt{2} & -1/\sqrt{2} & 0 \\ \end{bmatrix} = \begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \end{bmatrix}$

Comparing this to the table showing how our input maps to output, we got the right answer! If you plugged in the other values listed, you would get the rest of the bell state entanglements.

## Unentangling Qubits

All gates are reversible, so all circuits are reversible. To get the reverse circuit for a given matrix, you just get the inverse of the matrix. Since quantum gates (and circuits) are unitary matrices, taking the inverse of one of these matrices just means taking the conjugate transpose of the matrix. In other words, you take the transpose of the matrix, and then just negate the imaginary component of any complex numbers. In this example, there are no imaginary numbers, so you just take the transpose of the matrix.

Since our circuit matrix is this:

$\begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 1/\sqrt{2} & 0 \\ 1/\sqrt{2} & 0 & 0 & -1/\sqrt{2} \\ 0 & 1/\sqrt{2} & -1/\sqrt{2} & 0 \\ \end{bmatrix}$

That means that the inverse must be this, since this is the (conjugate) transpose.

$\begin{bmatrix} 1/\sqrt{2} & 0 & 1/\sqrt{2} & 0 \\ 0 & 1/\sqrt{2} & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 0 & -1/\sqrt{2} \\ 1/\sqrt{2} & 0 & -1/\sqrt{2} & 0 \\ \end{bmatrix}$

Let’s try it out by putting the output from last section in and seeing what comes out the other side:

$\begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \end{bmatrix} * \begin{bmatrix} 1/\sqrt{2} & 0 & 1/\sqrt{2} & 0 \\ 0 & 1/\sqrt{2} & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 0 & -1/\sqrt{2} \\ 1/\sqrt{2} & 0 & -1/\sqrt{2} & 0 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix}$

It worked! We got our original input back.

## Making Matrices for More Complex Circuits

We talked about how to make single qubit gates apply to specific qubits by using the kronecker product with identity to move the gate to the right location.

If you have 4 qubits and you want to apply some single qubit gate (say Hadamard) to the 3rd qubit, you would just do this to get the matrix:
$I \otimes I \otimes H \otimes I$

What if you want to do complex things with multi qubit gates though? Like what if you want to do a controlled not gate where you want the 2nd qubit to flip it’s value if the 4th qubit was true?

The answer is actually pretty simple… you use swaps gates to flip the values of the qubits so that your qubit values line up with the inputs of the gate, then you do the gate, and finally undo all the swaps to get the qubit values back to the right positions.

We need to swap qubits so that the 4th qubit value is in the 1st qubit slot, and the 2nd qubit value is in the 2nd qubit slot. We can do that with the following swaps:

Once we have done those swaps, we can do our controlled not, then do the swaps in reverse order to return the qubits to their original positions.

Here’s what the circuit looks like:

You’ll see the simpler version in circuit diagrams, but at least now you’ll know how to make things that look like this:

Below is the mathematical way that you would get the matrix representing this gate.

$I$ is the identity matrix, $S$ is the swap gate, and $C$ is the controlled not gate.

$M = \\ (I \otimes I \otimes S) * \\ (I \otimes S \otimes I) * \\ (S \otimes I \otimes I) * \\ (I \otimes S \otimes I) * \\ (C \otimes I \otimes I) * \\ (I \otimes S \otimes I) * \\ (S \otimes I \otimes I) * \\ (I \otimes S \otimes I) * \\ (I \otimes I \otimes S) \\$

## Code

Here is some simple C++ to show both the entanglement circuit and the more complicated controlled not gate we described.

#include <stdio.h>
#include <vector>
#include <complex>
#include <assert.h>

typedef std::complex<float> TComplex;

//=================================================================================
struct SComplexVector
{
public:
TComplex& Get (size_t i) { return v[i]; }

const TComplex& Get (size_t i) const { return v[i]; }

size_t Size() const
{
return v.size();
}

void Resize (size_t s)
{
v.resize(s);
}

void Print () const
{
printf("[");
for (size_t i = 0, c = v.size(); i < c; ++i)
{
if (i > 0)
printf(", ");
const TComplex& val = Get(i);
if (val.imag() == 0.0f)
{
if (val.real() == 0.0f)
printf("0");
else if (val.real() == 1.0f)
printf("1");
else
printf("%0.2f", val.real());
}
else
printf("%0.2f + %0.2fi", val.real(), val.imag());
}
printf("]\n");
}

std::vector<TComplex> v;
};

//=================================================================================
struct SComplexMatrix
{
public:
TComplex& Get (size_t x, size_t y)
{
return v[y*Size() + x];
}

const TComplex& Get (size_t x, size_t y) const
{
return v[y*Size() + x];
}

// For an MxM matrix, this returns M
size_t Size () const
{
size_t ret = (size_t)sqrt(v.size());
assert(ret*ret == v.size());
return ret;
}

// For an MxM matrix, this sets M
void Resize(size_t s)
{
v.resize(s*s);
}

void Print() const
{
const size_t size = Size();

for (size_t y = 0; y < size; ++y)
{
printf("[");
for (size_t x = 0; x < size; ++x)
{
if (x > 0)
printf(", ");

const TComplex& val = Get(x, y);
if (val.imag() == 0.0f)
{
if (val.real() == 0.0f)
printf("0");
else if (val.real() == 1.0f)
printf("1");
else
printf("%0.2f", val.real());
}
else
printf("%0.2f + %0.2fi", val.real(), val.imag());
}
printf("]\n");
}
}

std::vector<TComplex> v;
};

//=================================================================================
static const SComplexVector c_qubit0 = { { 1.0f, 0.0f } };  // false aka |0>
static const SComplexVector c_qubit1 = { { 0.0f, 1.0f } };  // true aka |1>

// 2x2 identity matrix
static const SComplexMatrix c_identity2x2 =
{
{
1.0f, 0.0f,
0.0f, 1.0f,
}
};

// Given the states |00>, |01>, |10>, |11>, swaps the |01> and |10> state
// If swapping the probabilities of two qubits, it won't affect the probabilities
// of them both being on or off since those add together.  It will swap the odds of
// only one of them being on.
static const SComplexMatrix c_swapGate =
{
{
1.0f, 0.0f, 0.0f, 0.0f,
0.0f, 0.0f, 1.0f, 0.0f,
0.0f, 1.0f, 0.0f, 0.0f,
0.0f, 0.0f, 0.0f, 1.0f
}
};

// Controlled not gate
// If the first qubit is true, flips the value of the second qubit
static const SComplexMatrix c_controlledNotGate =
{
{
1.0f, 0.0f, 0.0f, 0.0f,
0.0f, 1.0f, 0.0f, 0.0f,
0.0f, 0.0f, 0.0f, 1.0f,
0.0f, 0.0f, 1.0f, 0.0f
}
};

// Takes a pure |0> or |1> state and makes a 50/50 superposition between |0> and |1>.
// Put a 50/50 superposition through and get the pure |0> or |1> back.
// Encodes the origional value in the phase information as either matching or
// mismatching phase.
{
{
1.0f / std::sqrt(2.0f), 1.0f / std::sqrt(2.0f),
1.0f / std::sqrt(2.0f), 1.0f / -std::sqrt(2.0f)
}
};

//=================================================================================
void WaitForEnter ()
{
printf("\nPress Enter to quit");
fflush(stdin);
getchar();
}

//=================================================================================
SComplexVector KroneckerProduct (const SComplexVector& a, const SComplexVector& b)
{
const size_t aSize = a.Size();
const size_t bSize = b.Size();

SComplexVector ret;
ret.Resize(aSize*bSize);

for (size_t i = 0, ic = aSize; i < ic; ++i)
{
for (size_t j = 0, jc = bSize; j < jc; ++j)
{
size_t n = i * bSize + j;
ret.Get(n) = a.Get(i)*b.Get(j);
}
}
return ret;
}

//=================================================================================
SComplexMatrix KroneckerProduct (const SComplexMatrix& a, const SComplexMatrix& b)
{
const size_t aSize = a.Size();
const size_t bSize = b.Size();

SComplexMatrix ret;
ret.Resize(aSize*bSize);

for (size_t ax = 0; ax < aSize; ++ax)
{
for (size_t ay = 0; ay < aSize; ++ay)
{
const TComplex& aValue = a.Get(ax, ay);

for (size_t bx = 0; bx < bSize; ++bx)
{
for (size_t by = 0; by < bSize; ++by)
{
const TComplex& bValue = b.Get(bx, by);

size_t nx = ax*bSize + bx;
size_t ny = ay*bSize + by;

ret.Get(nx,ny) = aValue * bValue;
}
}
}
}

return ret;
}

//=================================================================================
SComplexMatrix operator* (const SComplexMatrix& a, const SComplexMatrix& b)
{
assert(a.Size() == b.Size());
const size_t size = a.Size();

SComplexMatrix ret;
ret.Resize(size);

for (size_t nx = 0; nx < size; ++nx)
{
for (size_t ny = 0; ny < size; ++ny)
{
TComplex& val = ret.Get(nx, ny);
val = 0.0f;
for (size_t i = 0; i < size; ++i)
val += a.Get(i, ny) * b.Get(nx, i);
}
}

return ret;
}

//=================================================================================
SComplexVector operator* (const SComplexVector& a, const SComplexMatrix& b)
{
assert(a.Size() == b.Size());
const size_t size = a.Size();

SComplexVector ret;
ret.Resize(size);

for (size_t i = 0; i < size; ++i)
{
TComplex& val = ret.Get(i);
val = 0;
for (size_t j = 0; j < size; ++j)
val += a.Get(j) * b.Get(i, j);
}

return ret;
}

//=================================================================================
int main (int argc, char **argv) {

// 2 qubit entanglement circuit demo
{
// make the circuit
const SComplexMatrix H1 = KroneckerProduct(c_hadamardGate, c_identity2x2);
const SComplexMatrix circuit = H1 * c_controlledNotGate;

// display the circuit
printf("Entanglement circuit:\n");
circuit.Print();

// permute the inputs and see what comes out when we pass them through the circuit!
SComplexVector input = KroneckerProduct(c_qubit0, c_qubit0);
SComplexVector output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(c_qubit0, c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(c_qubit1, c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(c_qubit1, c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();
}

// 4 qubit demo: flip the second qubit if the fourth qubit is true
{
// make the circuit
const SComplexMatrix cnot4qubit = KroneckerProduct(KroneckerProduct(c_controlledNotGate, c_identity2x2), c_identity2x2);
const SComplexMatrix swap12 = KroneckerProduct(KroneckerProduct(c_swapGate, c_identity2x2), c_identity2x2);
const SComplexMatrix swap23 = KroneckerProduct(KroneckerProduct(c_identity2x2, c_swapGate), c_identity2x2);
const SComplexMatrix swap34 = KroneckerProduct(KroneckerProduct(c_identity2x2, c_identity2x2), c_swapGate);
const SComplexMatrix circuit =
swap34 *
swap23 *
swap12 *
swap23 *
cnot4qubit *
swap23 *
swap12 *
swap23 *
swap34;

// display the circuit
printf("\nFlip 2nd qubit if 4th qubit true circuit:\n");
circuit.Print();

// permute the inputs and see what comes out when we pass them through the circuit!
SComplexVector input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit0), c_qubit0);
SComplexVector output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit0), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit1), c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit1), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit0), c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit0), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit1), c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit1), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit0), c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit0), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit1), c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit1), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit0), c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit0), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit1), c_qubit0);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();

input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit1), c_qubit1);
output = input * circuit;
printf("\ninput:");
input.Print();
printf("output:");
output.Print();
}

WaitForEnter();

return 0;
}

Here is the first part of the output, which shows the results of the entangling circuit. You can see that the input gave the expected output Bell states:

Below is the second part of the output, which is the circuit that flips the 2nd qubit if the 4th qubit is true.

Each entry in the input and output vectors is the amplitude (probability) that the state it represents is true. The states start at the left with 0000, then 0001, then 0010 and continue until the very right most value which represents 1111.

If you look at the second input/output pair the input is [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0], which means it has a 100% chance of the 4 qubits being 0001 (aka the qubits represent the number 1). The output vector is [0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0], which means that it has a 100% chance of being 0101 (aka the qubits represent the number 5). Since the input did have the 4th bit set (4th from the left), it flipped the 2nd bit. So, you can see that it worked!

If you check all the input/output pairs, you’ll see that they all follow this rule.

We used only whole, real numbers, no using fractional probabilities, or imaginary amplitudes. What it does in those situations is a little bit harder to get intuition for, but rest assured that it does “the right thing” in those situations as well.

## Next Up

Now that we have the basics of quantum computing down pretty well, it’s time to analyze a couple quantum algorithms to see how they work!