Evaluating Polynomials with the GPU Texture Sampler

This is an extension of a paper I wrote which shows how to use the linear texture sampling capabilities of the GPU to calculate points on Bezier curves. You store the control points in the texture, then sample along the texture’s diagonal to get points on the curve:
GPU Texture Sampler Bezier Curve Evaluation

I’ve been thinking about the items in the “future work” section and found some interesting things regarding polynomials, logic gates, surfaces and volumes. This is the first post, which deals with evaluating polynomials.

Evaluating Polynomials

One of the main points of my paper was that N-linear interpolation (linear, bilinear, trilinear, etc) can be used to evaluate the De Casteljau algorithm since both things are just linear interpolations of linear interpolations. (Details on bilinear interpolation here: Bilinear Filtering & Bilinear Interpolation).

This meant that it was also able to calculate Bernstein Polynomials (aka the algebraic form of Bezier curves), since Bernstein polynomials are equivalent to the De Casteljau algorithm.

I started looking around to see what would happen if you messed around with the De Casteljau algorithm a bit, like interpolate at one level by
t^2 or t*0.5+0.5 or by a constant or by another variable completely. My hope was that I’d be able to make the technique more generic and open it up to a larger family of equations, so people weren’t limited to just Bernstein polynomials.

That opened up a pretty deep rabbit hole on polynomial blossoming and something called Symmetric Multiaffine Functions. There are some great links in the answer here:
Math Stack Exchange: Modifying and Generalizing the De Casteljau Algorithm

In the end, it turned out to be pretty simple though. It turns out that any polynomial can be converted back and forth from “Power Basis” (which looks like Ax^2+Bx+C) to “Bernstein Basis” (which looks like A(1-t)^2+B(1-t)t+Ct^2) so long as they are the same degree.

This isn’t the result I was expecting but it is a nice result because it’s simple. I think there is more to be explored by sampling off the diagonal, and using different t values at different stages of interpolation, but this result is worth sharing.

By the way, you could also use curve fitting to try and approximate a higher degree function with a lower degree one, but for this post, I’m only going to be talking about exact conversion from Bernstein polynomials to Power polynomials.

Since we can convert power basis polynomials to Bernstein polynomials, and the technique already works for Bernstein polynomials, that means that if we have some random polynomial, say y=2x^3+4x+2, that we can make this technique work for that too. The technique got a little closer to arbitrary equation evaluation. Neat!

Converting Power Basis to Bernstein Basis

I found the details of the conversion process at Polynomial Evaluation and Basis Conversion which was linked to by Math Stack Exchange: Convert polynomial curve to Bezier Curve control points.

This is best explained working through examples, so let’s start by converting a quadratic polynomial from power basis to Bernstein basis.

Quadratic Function

y=2x^2+8x+3

The first thing we do is write the coefficients vertically, starting with the x^0 coefficient, then the x^1 coefficient and continuing on to the highest value x^n:

\begin{array}{c} 3 \\ 8 \\ 2 \\ \end{array}

Next, we need to divide by the Binomial Coefficients (aka the row of Pascal’s Triangle which has the same number of items as we have coefficients). In this case we need to divide by: 1,2,1.

\begin{array}{c|c} 3 & 3 / 1 = 3 \\  8 & 8 / 2 = 4 \\ 2 & 2 / 1 = 2 \\ \end{array}

Now we generate a difference table backwards. it’s hard to explain what that is in words, but if you notice, each value is the sum of the value to the left of it, and the one below that.

\begin{array}{c|c|c|c} 3 & 3 / 1 = 3 & 7 & 13 \\  8 & 8 / 2 = 4 & 6 & \\ 2 & 2 / 1 = 2 &   & \\ \end{array}

We are all done. The control points for the Bezier curve are on the top row (ignoring the left most column). They are 3,7,13 which makes it so we have the following two equations being equal. The first is in power basis, the second is in Bernstein basis.

y=2x^2+8x+3
y=3(1-x)^2+14(1-x)x+13x^2

Note: don’t forget that Bezier curves multiply the control points by the appropriate row in Pascal’s triangle. That’s where the 14 comes from in the middle term of the Bernstein polynomial. We are multiplying the control points 3,7,13 by the row in Pascal’s triangle 1,2,1 to get the final coefficients of 3,14,13.

Let’s have Wolfram Alpha help us verify that they are equal.

Wolfram Alpha: graph y=2x^2+8x+3, y=3*(1-x)^2+14x*(1-x)+13x^2, from 0 to 1

Yep, they are equal! If you notice the legend of the graph, wolfram actually converted the Bernstein form back to power basis, and you can see that they are exactly equivalent.

You can also write the Bernstein form like the below, which i prefer, using t instead of x and also setting s=1-t.

y=3s^2+14st+13t^2

Cubic Function

A cubic function is not that much harder than a quadratic function. After this, you should see the pattern and be able to convert any degree easily.

y=5x^3+9x-4

Again, the first thing we do is write the coefficients vertically, starting with the constant term. Note that we don’t have an x^2 term, so it’s coefficient is 0.

\begin{array}{c} -4 \\  9 \\  0 \\  5 \\ \end{array}

We next divide by the Pascal’s triangle row 1,3,3,1.

\begin{array}{c|c} -4 & -4 / 1 = -4 \\  9 &  9 / 3 =  3 \\  0 &  0 / 3 =  0 \\  5 &  5 / 1 =  5 \\ \end{array}

Now, make the difference table going backwards again:

\begin{array}{c|c|c|c|c} -4 & -4 / 1 = -4 & -1 & 2 & 10 \\  9 &  9 / 3 =  3 &  3 & 8 & \\  0 &  0 / 3 =  0 &  5 &   & \\  5 &  5 / 1 =  5 &    &   & \\ \end{array}

Our Bezier control points are along the top: -4,-1,2,10. Keeping in mind that the coefficients for a cubic bezier curve are multiplied by 1,3,3,1 we can make the Bernstein form and put it next to our original formula:

y=5x^3+9x-4
y=-4(1-x)^3-3(1-x)^2x+6(1-x)x^2+10x^3

Let’s check in wolfram alpha again:
Wolfram Alpha: graph y=5x^3+9x-4, y=-4(1-x)^3-3x(1-x)^2+6x^2(1-x)+10x^3, from 0 to 1

And here it is in the cleaner form:

y=-4s^3-3s^2t+6st^2+10t^3

Some Notes On Calculating Polynomials with the Texture Sampler

You may notice that in the comparison graphs i only plotted the graphs from 0 to 1 on the x axis (aka the t axis). The equations are actually equivalent outside of that range as well, but the technique from my paper only works from the 0 to 1 range because it relies on built in hardware pixel interpolation. This may sound like a big limitation, but if you know the minimum and maximum value of x that you want to plug into your equation at runtime, you can convert your x into a percent between those values, get the resulting polynomial, convert it to Bernstein form, set up the texture, and then at runtime convert your input parameter into that percent when you do the lookup. In other words, you squeeze the parts of the function you care about into the 0 to 1 range.

Another issue you will probably hit is that standard RGBA8 textures have only 8 bits per channel and can only store values between 0 and 1. Since the texture is supposed to be storing your control points, that is bad news.

One way to get around this is to find the largest coefficient value and divide the others by this value. This will put the coefficients into the 0 to 1 range, which will be able to be stored in your texture. After sampling the texture, you multiply the result by that scaling value to get the correct answer.

Scaling won’t help having both negative and positive coefficients though. To handle negative coefficients, you could map the 0-1 space to be from -1 to 1, similar to how we often do it with normal maps and other signed data stored in textures. After doing the lookup you’d have to unmap it too of course.

You could also solve negative values and scaling problems by squishing the y axis into the 0 to 1 space by subtracting the minimum and dividing by the maximum minus the minimum, similarly to how we squished the x range into 0 to 1.

If you instead move to an RGBAF32 texture, you’ll have a full 32 bit float per color channel and won’t have problems with either large values or negative values. You will still have to deal with x only going from 0 to 1 though.

I also want to mention that the hardware texture interpolation works in a X.8 fixed point format. There are more details in my paper, but that means that you’ll get some jagged looking artifacts on your curve instead of a smoothly varying value. If that is a problem for you in practice, my paper talks about a few ways to mitigate that issue.

Before moving on, I wanted to mention that it’s easy to support rational polynomials using this method as well. A rational polynomial is when you divide one polynomial by another polynomial, and relates to rational Bezier curves, where you divide one curve by another curve (aka you give weights to control points). Rational curves are more powerful and in fact you can perfectly represent sine and cosine with a quadratic rational polynomial. More info on that in my paper.

To calculate rational polynomials, you just encode the numerator polynomial in one color channel, and the denominator polynomial in another color channel. After you sample the texture and get the result of your calculation, you divide the numerator value by the denominator value. It costs one division in your shader code, but that’s pretty cheap for the power it gives you!

Regarding the texture size requirements to store a polynomial of a specific degree…

Every dimension of the texture, and every color channel in that texture, adds a degree.

However, to get the benefit of the degree increase from the color channel, you need to do a little more math in the shader – check my paper for more details!

So, if you wanted to store a quadratic polynomial in a texture, you would need either a 2d texture with 1 color channel, or you could do it with a 1d texture that had 2 color channels.

If you wanted to store a cubic polynomial in a texture, you could use a 3d texture with 1 color channel, or a 2d texture with two color channels (there would be some waste here) or a 1d texture with three color channels.

For a polynomial that had a maximum degree term of 6, you could use a 3d volume texture that had 3 color channels: RGB.

If you need to evaluate a very high degree polynomial, you can actually take multiple texture samples and combine them.

For instance, if you had a 2d texture with a single color channel, you could do a single texture read to get a quadratic.

If you did two texture reads, you would have two quadratics.

If you linearly interpolated between those two quadratics, you would end up with a cubic.

That isn’t a very high degree curve but is easier to grasp how they combine.

Taking this up to RGBA 3d volume textures, a single texture read will get you a curve of degree 6. If you do another read, it will take it to degree 7. Another read gets you to 8, another to 9, etc.

With support for 4d textures, an RGBA texture read would give you a degree 7 curve. Another read would boost it to 8, another to 9, another to 10, etc.

Regarding the specific sizes of the textures, in all cases the texture size is “2” on each dimension because we are always just linearly interpolating within a hyper cube of pixel values. You can increase the size of the texture for piecewise curves, check out the paper for more details on that and other options.

Closing

Hopefully you found this useful or interesting!

There may not have been much new information in here for the more math inclined people, but I still think it’s worth while to explicitly show how the technique works for both Bernstein polynomials as well as the more common power basis polynomials.

I still think it would be interesting to look at what happens when you sample off of the diagonal, and also what happens if you use different values at different stages of the interpolation. As an example, instead of just looking up a texture at (t,t) for the (u,v) value to get a quadratic curve point, what if we look up by (t,t^2)? At first blush, it seems like by doing that we may be able to boost a curve to a higher degree, maybe at the cost of some reduced flexibility for the specific equations we can evaluate?

Next up I’ll be writing up some more extensions to the paper involving logic gates, surfaces, and volumes.

Have any feedback, questions or interesting ideas? Let me know!

Incremental Averaging

This is a super short post about something I want to be able to reference again in the future (:

Let’s say that you need to average N things, where you don’t know what the final N will be, but you want to keep an average as you go.

For instance, let’s say you are doing monte carlo path tracing rendering, where the pixel you are showing is the average of however many samples you’ve had so far, but you are continuing to get new samples and want to show the updated average as you get new samples.

The formula for doing this is super simple.

NewAverage = OldAverage + (NewValue - OldAverage) / NewSampleCount;

// Or:

Average += (NewValue - Average) / NewSampleCount;

One way of thinking of the above equations is that you are adjusting the average by how much the new value would adjust the average.

Another way of thinking of the above two equations is this:

First figure out how far the new sample is from the average, then move towards that new amount by an ever decreasing amount, as the number of samples grow.

Because of this, if you are in a language such as glsl or hlsl which has linear interpolation built in (mix in glsl, lerp in hlsl), you can use linear interpolation as well:

Average = mix(Average, NewValue, 1.0 / NewSampleCount);

To see how this works out, check out the formula for lerp:

Lerp(a,b,t) = a + (b-a)*t

Substituting Average for a, NewValue for b, and 1/NewSampleCount for t, we get this:

Lerp(Average, NewValue, 1/NewSampleCount)
= Average + (NewValue-Average)/NewSampleCount

Which is the exact same formula as above. So, using lerp in that way is mathematically equivalent.

Here’s a link with more discussion on this:
Math Stack Exchange: Incremental averageing

Here’s an awesome post on this same subject, with a different resulting formula, by Christer Ericson (@ChristerEricson), author of the famous “Real Time Collision Detection” book, and VP of technology at Activision. His solution looks like it might be more numerically robust, but I haven’t analyzed it well enough yet to know for sure:
Robustly computing the centroid for a point set

A somewhat related topic, here’s a method for keeping a sum with floating point numbers, that is more accurate than normal summation. Useful when you need to sum numbers of different magnitudes, which would normally be a problem in floating point.
Wikipedia: Kahan Summation

Here’s an incremental (aka online) algorithm that gives you the average as well as the variance, which you can square root to get the standard deviation. This is useful for when you need to know how volatile data samples are – like if reporting results from profiling code, or maybe useful for path tracing for finding how much variance is in a pixel, so you know whether you need more samples or can get away with fewer.
Wikipedia: Standard deviation – Online algorithm

Understanding The Discrete Fourier Transform

I’ve been working on getting a better understanding of the Discrete Fourier Transform. I’ve figured out some things which have really helped my intuition, and made it a lot simpler in my head, so I wanted to write these up for the benefit of other folks, as well as for my future self when I need a refresher.

The discrete Fourier transform takes in data and gives out the frequencies that the data contains. This is useful if you want to analyze data, but can also be useful if you want to modify the frequencies then use the inverse discrete Fourier transform to generate the frequency modified data.

Multiplying By Sinusoids (Sine / Cosine)

If you had a stream of data that had a single frequency in it at a constant amplitude, and you wanted to know the amplitude of that frequency, how could you figure that out?

One way would be to make a cosine wave of that frequency and multiply each sample of your wave by the corresponding samples.

For instance, let’s say that we have a data stream of four values that represent a 1hz cosine wave: 1, 0, -1, 0.

We could then multiply each point by a corresponding point on a cosine wave, add sum them together:

We get.. 1*1 + 0*0 + -1*-1 + 0*0 = 2.

The result we get is 2, which is twice the amplitude of the data points we had. We can divide by two to get the real amplitude.

To show that this works for any amplitude, here’s the same 1hz cosine wave data with an amplitude of five: 5, 0, -5, 0.

Multiplying by the 1hz cosine wave, we get… 5*1 + 0*0 + -5*-1 + 0*0 = 10.

The actual amplitude is 5, so we can see that our result was still twice the amplitude.

In general, you will need to divide by N / 2 where N is the number of samples you have.

What happens if our stream of data has something other than a cosine wave in it though?

Let’s try a sine wave: 0, 1, 0, -1

When we multiply those values by our cosine wave values we get: 0*1 + 1*0 + 0*-1 + -1*0 = 0.

We got zero. Our method broke!

In this case, if instead of multiplying by cosine, we multiply by sine, we get what we expect: 0*0 + 1*1 + 0*0 + -1*-1 = 2. We get results consistent with before, where our answer is the amplitude times two.

That’s not very useful if we have to know whether we are looking for a sine or cosine wave though. We might even have some other type of wave that is neither one!

The solution to this problem is actually to multiply the data points by both cosine and sine waves, and keep both results.

Let’s see how that works.

For this example We’ll take a cosine wave and shift it 0.25 radians to give us samples: 0.97, -0.25, -0.97, 0.25. (The formula is cos(x*2*pi/4+0.25) from 0 to 3)

When we multiply that by a cosine wave we get: 0.97*1 + -0.25*0 + -0.97*-1 + 0.25*0 = 1.94
When we multiply it by a sine wave we get: 0.97*0 + -0.25*1 + -0.97*0 + 0.25*-1 = -0.5

Since we know both of those numbers are twice the actual amplitude, we can divide them both by two to get:
cosine: 0.97
sine: -0.25

Those numbers kind of makes sense if you think about it. We took a cosine wave and shifted it over a little bit so our combined answer is that it’s mostly a cosine wave, but is heading towards being a sine wave (technically a sine wave that has an amplitude of -1, so is a negative sine wave, but is still a sine wave).

To get the amplitude of our phase shifted wave, we treat those values as a vector, and get the magnitude. sqrt((0.97*0.97)+(-0.25*-0.25)) = 1.00. That is correct! Our wave’s amplitude is 1.0.

To get the starting angle (phase) of the wave, we use inverse tangent of sine over cosine. We want to take atan2(sine, cosine), aka atan2(-0.25, 0.97) which gives us a result of -0.25 radians. That’s the amount that we shifted our wave, so it is also correct!

Formulas So Far

Let’s take a look at where we are at mathematically. For N samples of data, we multiply each sample of data by a sample from a wave with the frequency we are looking for and sum up the results (Note that this is really just a dot product between N dimensional vectors!). We have to do this both with a cosine wave and a sine wave, and keep the two resulting values to be able to get our true amplitude and phase information.

For now, let’s drop the “divide by two” part and look at the equations we have.

Value_{cosine} = \sum\limits_{n=0}^{N-1} Sample_n * Cos(2\pi*Frequency*n/N)

Value_{sine} = \sum\limits_{n=0}^{N-1} Sample_n * Sin(2\pi*Frequency*n/N)

Those equations above do exactly what we worked through in the samples.

The part that might be confusing is the part inside of the Cos() and Sin() so I’ll explain that real quick.

Looking at the graph cos(x) and sin(x) you’ll see that they both make one full cycle between the x values of 0 and 2*pi (aprox. 6.28):

If instead we change it to a graph of cos(2*pi*x) and sin(2*pi*x) you’ll notice that they both make one full cycle between the x values of 0 and 1:

This means that we can think of the wave forms in terms of percent (0 to 1) instead of in radians (0 to 2pi).

Next, we multiply by the frequency we are looking for. We do this because that multiplication makes the wave form repeat that many times between 0 and 1. It makes us a wave of the desired frequency, still remaining in “percent space” where we can go from 0 to 1 to get samples on our cosine and sine waves.

Here’s a graph of cos(2*pi*x*2) and sin(2*pi*x*2), to show some 2hz frequency waves:

Lastly, we multiply by n/N. In our sum, n is our index variable and it goes from 0 to N-1. This is very much like a for loop. n/N is our current percent done we are in the for loop, so when we multiply by n/N, we are just sampling at a different location (by percentage done) on our cosine and sine waves that are at the specified frequencies.

Not too difficult right?

Imaginary Numbers

Wouldn’t it be neat if instead of having to do two separate calculations for our cosine and sine values, we could just do a single calculation that would give us both values?

Well, interestingly, we can! This is where imaginary numbers come in. Don’t get scared, they are here to make things simpler and more convenient, not to make things more complicated or harder to understand (:

There is something called Euler’s Identity which states the below:

e^{ix} = cos(x) + i*sin(x)

That looks pretty useful for our usage case doesn’t it? If you notice that our equations both had the same parameters for cos and sin, that means that we can just use this identity instead!

We can now take our two equations and combine them into a single equation:

Value = \sum\limits_{n=0}^{N-1} Sample_n * e^{i *2\pi * Frequency*n/N}

When we do the above calculation, we will get a complex number out with a real and imaginary part. The real part of the complex number is just the cosine value, while the imaginary part of the complex number is the sine value. Nothing scary has happened, we’ve just used/abused complex numbers a bit to give us what we want in a more compact form.

Multiple Frequencies

Now we know how to look for a single, specific frequency.

What if you want to look for multiple frequencies though? Further more, what if you don’t even know what frequencies to look for?

When doing the discrete Fourier transform on a stream of samples, there are a specific set of frequencies that it checks for. If you have N samples, it only checks for N different frequencies. You could ask about other frequencies, but these frequencies are enough to reconstruct the original signal from only the frequency data.

The first N/2 frequencies are going to be the frequencies 0 (DC) through N/2. N/2 is the Nyquist frequency and is the highest frequency that your signal is able of holding.

After N/2 comes the negative frequencies, counting from a large negative frequency (N/2-1) up to the last frequency which is -1.

As an example, if you had 8 samples and ran the DFT, you’d get 8 complex numbers as outputs, which represent these frequencies:

[0hz, 1hz, 2hz, 3hz, 4hz, -3hz, -2hz, -1hz]

The important take away from this section is that if you have N samples, the DFT asks only about N very specific frequencies, which will give enough information to go from frequency domain back to time domain and get the signal data back.

That then leads to this equation below, where we just put a subscript on Value, to signify which frequency we are probing for.

Value_{Frequency} = \sum\limits_{n=0}^{N-1} Sample_n * e^{2\pi i*Frequency*n/N}

Frequency is an integer, and is between 0 and N-1. We can say that mathematically by listing this information along with the equation:

Frequency \in [0,N), Frequency \in \mathbb Z

Final Form

Math folk use letters and symbols instead of words in their equations.

The letter k is used instead of frequency.

Instead of Value_{Frequency} , the symbol is X_{k} .

Instead of Sample_n , the symbol is x_n.

This gives us a more formal version of the equation:

X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

We are close to the final form, but we aren’t quite done yet!

Remember the divide by two? I had us take out the divide by two so that I could re-introduce it now in it’s correct form. Basically, since we are querying for N frequencies, we need to divide each frequency’s result by N. I mentioned that we had to divide by N/2 to make the amplitude information correct, but since we are checking both positive AND negative frequencies, we have to divide by 2*(N/2), or just N. That makes our equation become this:

X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

Lastly, we actually want to make the waves go backwards, so we make the exponent negative. The reason for this is a deeper topic than I want to get into right now, but you can read more about why here if you are curious: DFT – Why are the definitions for inverse and forward commonly switched?

That brings us to our final form of the formula for the discrete Fourier transform:

X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{-2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

Taking a Test Drive

If we use that formula to calculate the DFT of a simple, 4 sample, 1hz cosine wave (1, 0, -1, 0), we get as output [0, 0.5, 0, -0.5].

That means the following:

  • 0hz : 0 amplitude
  • 1hz : 0.5 amplitude
  • 2hz : 0 amplitude
  • -1hz : -0.5 amplitude

It is a bit strange that the simple 1hz, 1.0 amplitude cosine wave was split in half, and made into a 1hz cosine wave and -1hz cosine wave, each contributing half amplitude to the result, but that is “just how it works”. I don’t have a very good explanation for this though. My personal understanding just goes that if we are checking for both positive AND negative frequencies, that makes it ambiguous whether it’s a 1hz wave with 1 amplitude, or -1hz wave with -1 amplitude. Since it’s ambiguous, both must be true, and they get half the amplitude each. If I get a better understanding, I’ll update this paragraph, or make a new post on it.

Making The Inverse Discrete Fourier Transform

Making the formula for the Inverse DFT is really simple.

We start with our DFT formula, drop the 1/N from the front, and also make the exponent to e positive instead of negative. That is all! The intuition here for me is that we are just doing the reverse of the DFT process, to do the inverse DFT process.

x_k= \sum\limits_{n=0}^{N-1} X_n * e^{2\pi ikn/N}

k\in [0,N), k \in \mathbb Z

While the DFT takes in real valued signals and gives out complex valued frequency data, the IDFT takes in complex valued frequency data, and gives out real valued signals.

A fun thing to try is to take some data, DFT it, modify the frequency data, then IDFT it to see what comes out the other side.

Other Notes

Here are some other things of note about the discrete Fourier transform and it’s inverse.

Data Repeating Forever

When you run the DFT on a stream of data, the math is such that it assumes that the stream of data you gave it repeats forever both forwards and backwards in time. This is important because if you aren’t careful, you can add frequency content to your data that you didn’t intend to be there.

For instance, if you tile a 1hz sine wave, it’s continuous, and there is only the 1hz frequency present:

However, if you tile a 0.9hz sine wave, there is a discontinuity, which means that there will be other, unintended frequencies present when you do a DFT of the data, to be able to re-create the discontinuity:

Fast Fourier Transform

There are a group of algorithms called “Fast Fourier Transforms”. You might notice that if we have N samples, taking the DFT is an O(N^2) operation. Fast Fourier transforms can bring it down to O(N log N).

DFT / IDFT Formula Variations

The formula we came up with is one possible DFT formula, but there are a handful of variations that are acceptable, even though different variations come up with different values!

The first option is whether to make the e exponent negative or not. Check out these two formulas to see what I mean.

X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}
vs
X_k= \sum\limits_{n=0}^{N-1} x_n * e^{-2\pi ikn/N}

Either one is acceptable, but when providing DFT’d data, you should mention which one you did, and make sure and use the opposite one in your inverse DFT formula.

The next options is whether to divide by N or not, like the below:

X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}
vs
X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}

Again, either one is acceptable, but you need to make sure and let people know which one you did, and also make sure and use the opposite one in your inverse DFT formula.

Instead of dividing by N, some people actually divide by 1/sqrt(N) in both the DFT and inverse DFT. Wolfram alpha does this for instance!

You can read more about this stuff here: DFT – Why are the definitions for inverse and forward commonly switched?

One thing to note though is that if doing 1/N on the DFT (my personal preference), the 0hz (DC) frequency bin gives you the average value of the signal, and the amplitude data you get out of the other bins is actually correct (keeping in mind the amplitudes are split in half between positive and negative frequencies).

Why Calculate Negative Frequencies

The negative frequencies are able to be calculated on demand from the positive frequency information (eg complex conjugation), so why should we even bother calculating them? Sure it’d be more computationally efficient not to calculate them, especially when just doing frequency analysis, right?!

Here’s a discussion about that: Why calculate negative frequencies of DFT?

Higher Dimensions

It’s possible to DFT in 2d, 3d, and higher. The last blog post shows how to do this with 2d images, but I’d also like to write a blog post like this one specifically talking about the intuition behind multi dimensional DFTs.

Example Program Source Code

Here’s some simple C++ source code which calculates the DFT and inverse DFT, optionally showing work in case you want to try to work some out by hand to better understand this. Working a few out by hand really helped me get a better intuition for all this stuff.

Program Output:

Source Code:

#include <stdio.h>
#include <complex>
#include <vector>

#include <fcntl.h>
#include <io.h>
 
// set to 1 to have it show you the steps performed.  Set to 0 to hide the work.
// useful if checking work calculated by hand.
#define SHOW_WORK 1
 
#if SHOW_WORK
    #define PRINT_WORK(...) wprintf(__VA_ARGS__)
#else
    #define PRINT_WORK(...)
#endif

// Use UTF-16 encoding for Greek letters
static const wchar_t kPi = 0x03C0;
static const wchar_t kSigma = 0x03A3;
 
typedef float TRealType;
typedef std::complex<TRealType> TComplexType;
 
const TRealType c_pi = (TRealType)3.14159265359;
const TRealType c_twoPi = (TRealType)2.0 * c_pi;
 
//=================================================================================
TComplexType DFTSample (const std::vector<TRealType>& samples, int k)
{
    size_t N = samples.size();
    TComplexType ret;
    for (size_t n = 0; n < N; ++n)
    {
        TComplexType calc = TComplexType(samples[n], 0.0f) * std::polar<TRealType>(1.0f, -c_twoPi * TRealType(k) * TRealType(n) / TRealType(N));
        PRINT_WORK(L"    n = %i : (%f, %f)n", n, calc.real(), calc.imag());
        ret += calc;
    }
    ret /= TRealType(N);
    PRINT_WORK(L"    Sum the above and divide by %in", N);
    return ret;
}
 
//=================================================================================
std::vector<TComplexType> DFTSamples (const std::vector<TRealType>& samples)
{
    PRINT_WORK(L"DFT:  X_k = 1/N %cn[0,N) x_k * e^(-2%cikn/N)n", kSigma, kPi);
 
    size_t N = samples.size();
    std::vector<TComplexType> ret;
    ret.resize(N);
    for (size_t k = 0; k < N; ++k)
    {
        PRINT_WORK(L"  k = %in", k);
        ret[k] = DFTSample(samples, k);
        PRINT_WORK(L"  X_%i = (%f, %f)n", k, ret[k].real(), ret[k].imag());
    }
    PRINT_WORK(L"n");
    return ret;
}
 
//=================================================================================
TRealType IDFTSample (const std::vector<TComplexType>& samples, int k)
{
    size_t N = samples.size();
    TComplexType ret;
    for (size_t n = 0; n < N; ++n)
    {
        TComplexType calc = samples[n] * std::polar<TRealType>(1.0f, c_twoPi * TRealType(k) * TRealType(n) / TRealType(N));
        PRINT_WORK(L"    n = %i : (%f, %f)n", n, calc.real(), calc.imag());
        ret += calc;
    }
    PRINT_WORK(L"    Sum the above and take the real componentn");
    return ret.real();
}
 
//=================================================================================
std::vector<TRealType> IDFTSamples (const std::vector<TComplexType>& samples)
{
    PRINT_WORK(L"IDFT:  x_k = %cn[0,N) X_k * e^(2%cikn/N)n", kSigma, kPi);
 
    size_t N = samples.size();
    std::vector<TRealType> ret;
    ret.resize(N);
    for (size_t k = 0; k < N; ++k)
    {
        PRINT_WORK(L"  k = %in", k);
        ret[k] = IDFTSample(samples, k);
        PRINT_WORK(L"  x_%i = %fn", k, ret[k]);
    }
    PRINT_WORK(L"n");
    return ret;
}
 
//=================================================================================
template<typename LAMBDA>
std::vector<TRealType> GenerateSamples (int numSamples, LAMBDA lambda)
{
    std::vector<TRealType> ret;
    ret.resize(numSamples);
    for (int i = 0; i < numSamples; ++i)
    {
        TRealType percent = TRealType(i) / TRealType(numSamples);
        ret[i] = lambda(percent);
    }
    return ret;
}
 
//=================================================================================
int main (int argc, char **argv)
{
	// Enable Unicode UTF-16 output to console
	_setmode(_fileno(stdout), _O_U16TEXT);

    // You can test specific data samples like this:
    //std::vector<TRealType> sourceData = { 1, 0, 1, 0 }; 
    //std::vector<TRealType> sourceData = { 1, -1, 1, -1 };
 
    // Or you can generate data samples from a function like this
    std::vector<TRealType> sourceData = GenerateSamples(
        4,
        [] (TRealType percent)
        {
            const TRealType c_frequency = TRealType(1.0);
            return cos(percent * c_twoPi * c_frequency);
        }
    );
 
    // Show the source data
    wprintf(L"nSource = [ ");
    for (TRealType v : sourceData)
        wprintf(L"%f ",v);
    wprintf(L"]nn");
 
    // Do a dft and show the results
    std::vector<TComplexType> dft = DFTSamples(sourceData);
    wprintf(L"dft = [ ");
    for (TComplexType v : dft)
        wprintf(L"(%f, %f) ", v.real(), v.imag());
    wprintf(L"]nn");
 
    // Do an inverse dft of the dft data, and show the results
    std::vector<TRealType> idft = IDFTSamples(dft);
    wprintf(L"idft = [ ");
    for (TRealType v : idft)
        wprintf(L"%f ", v);
    wprintf(L"]n");
     
    return 0;
}

Links

Explaining how to calculate the frequencies represented by the bins of output of DFT:
How do I obtain the frequencies of each value in an FFT?

Another good explanation of the Fourier transform if it isn’t quite sinking in yet:
An Interactive Guide To The Fourier Transform

Some nice dft calculators that also have inverse dft equivelants:
DFT Calculator 1
IDFT Calculator 1
DFT Calculator 2
IDFT Calculator 2

Wolfram alpha can also do DFT and IDFT, but keep in mind that the formula used there is different and divides the results by 1/sqrt(N) in both DFT and IDFT so will be different values than you will get if you use a different formula.

Wolfram Alpha: Fourier[{1, 0, -1, 0}] = [0,1,0,1]

Wolfram Alpha: Inverse Fourier[{0, 1 , 0, 1}] = [1, 0, -1, 0]

A Sixth Way To Calculate Sine Without Trig

I have another item to add to the pile of ways to calculate sine without trig. Here are the previous ways before we start:

Four Ways to Calculate Sine Without Trig

A Fifth Way to Calculate Sine Without Trig

This method is called Bhaskara I’s sine approximation formula and it’s just a numerical way of approximating sine.

The below is some glsl code from @paniq that has been adapted to take 0 to 1 as input, which corresponds to 0 to 2*pi radians, or 0 to 360 degrees, and returns the normalized vector of that angle. The x component of the vector is the cosine of the angle and the y component of the vector is the sine of the angle. Useful for packing 2d normals into a color channel (;

// https://en.wikipedia.org/wiki/Bhaskara_I%27s_sine_approximation_formula
// x is 0..1 corresponding to 0..360 degrees
vec2 CosSin(float x) {
    vec2 si = fract(vec2(0.5,1.0) - x*2.0)*2.0 - 1.0;
   	vec2 so = sign(0.5-fract(vec2(0.25,0.5) - x));    
    return (20.0 / (si*si + 4.0) - 4.0) * so;
}

Here’s a shadertoy to compare/contrast this technique versus reality (or, reality as per the video card). Spoiler alert – the shadertoy is ridiculous, they are basically the same.
Shadertoy: Bhaskara Cos Sin Approximation

More from paniq:

i also wrote an approximate atan to go with it Shadertoy: Pseudo-Polar Mapping see the ALTMETHOD branch. also changed the sin/cos computation to ensure the sin/cos vector is perfectly normalized.

Matrix Form of Bezier Curves

Bezier curves are most often talked about either in terms of the De Casteljau algorithm, or in terms of a mathematical function (Bernstein Polynomials).

Every now and then though, you see people talking about Bezier curves being calculated via matrices. If you ever wondered what that was all about, this post should hopefully explain and demystify that a bit.

If you don’t know how to come up with the equation of a Bezier curve for any number of control points, you should give this a read first:
Easy Binomial Expansion & Bezier Curve Formulas

And if you are curious about the De Casteljau algorithm, you can learn about that here:
The De Casteljau Algorithm for Evaluating Bezier Curves

Ok, all read up on that stuff? Let’s get talking about Bezier curves in matrix form! There are shadertoy links at the end with working wegl glsl demos that include source code.

Making the Matrix Form of Bezier Curves

Coming up with the matrix for a Bezier curve is surprisingly easy. Keep in mind the matrix we are making is for glsl which is a column major matrix order, so you might have to adjust things if you are using a row major matrix order setup (mostly, just transpose the matrix).

The first step is to get the formula for a Bezier curve. We’ll work through the example using a quadratic Bezier curve with 3 control points A,B,C, so we start with the formula below:

f(t) = A*(1-t)^2 + B*2t(1-t) + C*t^2

The next step is to break the equation into one equation per term. Each term has a control point, so we are basically splitting the formula up so that we have one formula per control point.

A*(1-t)^2 \\ B*2t(1-t) \\ C*t^2

Next, we remove the control points and expand each term to get:

1-2t+t^2 \\ 2t-2t^2 \\ t^2

Now, explicitly values of all powers of t that are present:
1*t^0-2*t^1+1*t^2 \\ 0*t^0+2*t^1-2*t^2 \\ 0*t^0+0*t^1+1*t^2

Now the final step. Take the constants that multiply your powers of t and make a matrix out of them. You are done!

\begin{bmatrix} 1 & -2 & 1 \\ 0 & 2 & -2 \\ 0 & 0 & 1 \\ \end{bmatrix}

Using the Matrix Form

Using the matrix form of Bezier curves is also pretty simple.

First, we need to make a vector of the power series of our t value:

powerSeries = \begin{bmatrix} t^0 & t^1 & t^2 \\ \end{bmatrix}

Which can also be written as:

powerSeries = \begin{bmatrix} 1 & t & t^2 \\ \end{bmatrix}

You also need a vector of your control points:

controlPoints = \begin{bmatrix} A & B & C \\ \end{bmatrix}

You next perform this operation to get a result vector:

result = powerSeries * curveMatrix * controlPoints

Then, you add up all components of result to get the value of the curve at time t.

value = result[0] + result[1] + result[2]

All done!

Note that this is a one dimensional Bezier curve. You need to do this operation once per axis to get your final multi dimensional Bezier curve point.

If you are confused by that last line, check out this post: One Dimensional Bezier Curves

Multiplying the Control Points In

You might notice that if you are evaluating several points on the same curve that you are going to be multiplying the curveMatrix matrix by the controlPoints vector over and over. You can multiply the control points into the Bezier curve matrix to make the specific matrix for those control points if you want to. You multiply the columns of the matrix by the control points, and adjust the result calculation like the below.

// Multiply the control points into the curve matrix
curveMatrix[0] *= A;
curveMatrix[1] *= B;
curveMatrix[2] *= C;

// Use the curve matrix that has the control points baked in, to do less math to get the result vector.
// You would calculate the curve matrix once and re-use it multiple times of course!
vec3 result = powerSeries * curveMatrix;
float value = result.x + result.y + result.z;

Closing

You might wonder when you’d use the matrix form. One time to use the matrix form would be when you had fast matrix math support (like on the GPU). Another time to use the matrix form though is if you ever want to cut up a Bezier curve into multiple smaller sub curves. The matrix form can help make that easier, and you can read more about that here if you want: A Matrix Formulation of the Cubic Bezier Curve

Here are some shadertoys that show this all working in webgl/glsl pixel shaders, along with source code:

Shadertoy: 1D Linear Bezier Matrix Form
Shadertoy: 1D Quadratic Bezier Matrix Form
Shadertoy: 1D Cubic Bezier Matrix Form

Actually Making Signed Distance Field Textures With JFA

This post is an addendum to the last post where I say that you can make distance field textures with JFA but don’t fully explain how to make SIGNED distance field textures, which is what you really want.

If you want to go straight to a working demo with webgl pixel shader source code, here is the shadertoy: Shadertoy: JFA SDF Texture

If you naively use a distance transform to make a distance field texture, you’ll get an UNSIGNED distance field texture, where you only have the distance to the surface of the object from the outside, but won’t have the distance to the surface of the object from the inside.

This is important because signed distance field textures have both, and use bilinear interpolation of distance on each side of the shape surface to make a nice smooth line. Below is what happens when you try to use an unsigned distance field texture (aka the distance transform of the image, using JFA / Voronoi information), using the zero distance line as the surface of the object:

It looks ok (if not fairly pixelated), but you can really see it break down when you zoom in:

So you might say to yourself, maybe i need to keep the surface line at distance 0.5 instead of 0.0 so that there is distance information to interpolate? If you do that, the first thing you might notice is that the objects get fatter:

But it does look better when you zoom in, which is a plus:

The real issue is that you really just need the distance from each pixel to the surface of the object from both the inside and the outside. In our case, our Voronoi diagram we make with JFA only gives the distance from the outside. So what is the solution? At first I was thinking maybe you can get the gradient of this data at the point of each pixel and “push the zero line in” a little bit to give at least one pixel layer worth of inside data. However, a brilliant friend of mine came up with the actual solution: You invert your source data so empty space becomes seed, and seed becomes empty space, and you run JFA again to get the distance from the inside!

That actually works very well. It’s also very easy to combine them. You make a pixel shader that reads the data from the outside Voronoi diagram and the inside Voronoi diagram, calculate the output distance (0.5 + outsideDistance * 0.5 – insideDistance * 0.5), and output that 0 to 1 distance value in one or more of the color channels.

Here’s a glsl excerpt below, note that we divide the distance by 8 and clamp between 0 and 1 so that the data is suitable for a normalized color image (normalized as in the color channels can store values between 0 and 1):

// calculate distances from seed coordinates
float outsideDist = clamp(length(outsideSeedCoord-fragCoord) / 8.0, 0.0, 1.0);
float insideDist  = clamp(length(insideSeedCoord-fFragCoord)  / 8.0, 0.0, 1.0);
    
// calculate output distance
float signedDistance = 0.5 + outsideDist * 0.5 - insideDist * 0.5;
        
// set the color based on that distance
fragColor = vec4(signedDistance);

It actually looks a lot like the first image where we use the zero distance line of the unsigned distance field texture, so we still aren’t quite there:

When you zoom in, it looks a little better, but something still seems a bit off:

The final step to making this look good is to realize that the power of signed distance textures is in their ability to interpolate distance information well. When we have a full resolution texture, there is no interpolation going on. We actually need to decrease the size of our distance field texture to make it look better. If only all problems were solved by making textures smaller!

Here is the resulting image when making the distance field texture 1/4 as large on each axis (1/16th as big total):

And zooming in you can see that it scales very well. The zoom is a 20x magnification, on top of the magnification we already get from it being a larger texture:

And just to show the intermediary textures, here is the outside distance Voronoi diagram:

And the inside distance Voronoi diagram (The seed is in bright green, the dim green is the empty space that has distance information):

And here is the final distance field texture used to render the final result I showed above.

Zoomed in to show just how low resolution it is! This is the thing that looks like a + or a sword just left of middle.

Again, here is the shadertoy that does this technique, generating a signed distance field texture on the fly for randomly placed objects, and then using that signed distance field to render a larger image that you can further zoom in to:
Shadertoy: JFA SDF Texture

GPU Texture Sampler Bezier Curve Evaluation

Below is a paper I submitted to jcgt.org that unfortunately did not get accepted. Maybe next time!

The main idea of this paper is that bilinear interpolation can be equivalent to the De Casteljau algorithm, which means that if you set up a texture in a specific way, and sample from it at specific texture coordinates, that it will in fact give you Bezier curve points as output! It scales up for higher dimensional textures, as well as higher order curves.

The image below shows this in action for a cubic Bezier curve (3 control points) being stored and recalled from a 2×2 texture (there is actually a curve stored in each color channel).

This image is from an extension linked to lower down which applies the technique to surfaces and volumes:

The primary feedback from the reviewers and editor was that:

  • It was an interesting technique and they thought it was a paper worth reading.
  • The usage case was fairly limited though – basically only when your are compute bound in your shader program, and have some curve calculations to offload to the texture sampler. Or if you are already using a lookup texture and would benefit from fewer instructions and smaller lookup textures.
  • It could have been shorter due to the writing being shorter, but also it could have been less thorough. For instance, it didn’t need to show equivalence to both the De Casteljau’s algorithm as well as Bernstein polynomials, since it’s already known that those are equivalent.
  • They wanted some more performance details

I agree with the feedback, and don’t feel like taking the time to change and resubmit or submit else where, so I’m sharing it here on my blog. I hope you enjoy it and find it interesting (:

Here is the paper:
GPUBezier2016.pdf

Here is the supplemental materials (opengl and webgl source code):
SupplementalMaterials.zip

Here is the webgl demo from the supplemental materials, hosted on my site:
GPU Efficient Texture Based Bezier Curve Evaluation

Here are some working shadertoy demos of the technique:
Shadertoy: Mystery Curves – Quadratic
Shadertoy: Mystery Curves – Cubic
Shadertoy: Mystery Curves – Quartic
Shadertoy: Mystery Curves – Quintic

Extensions

Continuations of this work:

Failed Experiments

Continuations that didn’t work out:

What are your thoughts? Is this cool? Is it lame? Got some ideas to improve it? Leave a comment! (:

Normalized Vector Interpolation TL;DR

My blog posts often serve as “external memory”, allowing me to go back and remember how specific things work months or years after I spent the time to learn about them.

So far it’s worked amazingly well! Instead of having a hazy memory of “oh um… i did bicubic interpolation once, how does that work again?” I can go back to my blog post, find the details with an explanation and simple working code, and can very rapidly come back up to speed. I seriously recommend keeping a blog if you are a programmer or similar. Plus, you know you really understand something when you can explain it to someone else, so it helps you learn to a deeper level than you would otherwise.

Anyways, this is going to be a very brief post on vector interpolation that I want to commit to my “external memory” for the future.

This is an answer to the question… “How do I interpolate between two normalized vectors?” or “How do i bilinearly or bicubically interpolate between normalized vectors?”

As an answer I found the three most common ways to do vector interpolation:

  • Slerp – short for “spherical interpolation”, this is the most correct way, but is also the costliest. In practice you likely do not need the precision.
  • lerp – short for “linear interpolation”, you just do a regular linear interpolation between the vectors and use that as a result.
  • nlerp – short for “normalized linear interpolation” you just normalize the result of a lerp. Useful if you need your interpolated vector to be a normalized vector.

In practice, lerp/nlerp are pretty good at getting a pretty close interpolated direction so long as the angle they are interpolating between is sufficiently small (say, 90 degrees), and nlerp is of course good at keeping the right length, if you need a normalized vector. If you want to preserve the length while interpolating between non normalized vectors, you could always interpolate the length and direction separately.

Here is an example of the three interpolations on a large angle. Dark grey = start vector, light grey = end vector. Green = slerp, blue = lerp, orange = nlerp.

Here is an example of a medium sized angle (~90 degrees) interpolating the same time t between the angles.

Lastly, here’s a smaller angle (~35 degrees). You can see that the results of lerp / nlerp are more accurate as the angle between the interpolated vectors gets smaller.

If you do lerp or nlerp, you can definitely do both bilinear as well as bicubic interpolation since they are just regularly interpolated values (and then optionally normalized)

Using slerp, you can do bilinear interpolation, but I’m not sure how bicubic would translate.

I miss something important? Leave a comment and let me know!

Code

Here’s some glsl code for slerp, lerp and nlerp. This code is for vec2’s specifically but the same code works for vectors of any dimension.

//============================================================
// adapted from source at:
// https://keithmaggio.wordpress.com/2011/02/15/math-magician-lerp-slerp-and-nlerp/
vec2 slerp(vec2 start, vec2 end, float percent)
{
     // Dot product - the cosine of the angle between 2 vectors.
     float dot = dot(start, end);     
     // Clamp it to be in the range of Acos()
     // This may be unnecessary, but floating point
     // precision can be a fickle mistress.
     dot = clamp(dot, -1.0, 1.0);
     // Acos(dot) returns the angle between start and end,
     // And multiplying that by percent returns the angle between
     // start and the final result.
     float theta = acos(dot)*percent;
     vec2 RelativeVec = normalize(end - start*dot); // Orthonormal basis
     // The final result.
     return ((start*cos(theta)) + (RelativeVec*sin(theta)));
}

vec2 lerp(vec2 start, vec2 end, float percent)
{
     return mix(start,end,percent);    
}

vec2 nlerp(vec2 start, vec2 end, float percent)
{
     return normalize(mix(start,end,percent));    
}

Links

An interactive shadertoy demo I made, that is also where the above images came from:
Shadertoy: Vector Interpolation

Further discussion on this topic may be present here:
Computer Graphics Stack Exchange: Interpolating vectors on a grid

Good reads that go deeper:
Math Magician โ€“ Lerp, Slerp, and Nlerp
Understanding Slerp, Then Not Using It

Hiding a Lookup Table in a Modulus Operation

Lookup tables are a tool found in every programmer’s tool belt.

Lookup tables let you pre-calculate a complex calculation in advance, store the results in a table (an array), and then during performance critical parts of your program, you access that table to get quick answers to the calculations, without having to do the complex calculation on the fly.

In this post I’ll show a way to embed a lookup table inside of a single (large) number, where you extract values from that lookup table by taking a modulus of that number with different, specific values.

This technique is slower and takes more memory than an actual lookup table, but it’s conceptually interesting, so I wanted to share.

Also, I stumbled on this known technique while working on my current paper. The paper will make this technique a bit more practical, and I’ll share more info as soon as I am able, but for now you can regard this as a curiosity ๐Ÿ˜›

Onto the details!

1 Bit Input, 1 Bit Output: Pass Through

Let’s learn by example and start with a calculation that takes in an input bit, and gives that same value for an output bit. It’s just a 1 bit pass through lookup table.

\begin{array}{c|c} \text{Input} & \text{Output} \\ \hline 0 & 0 \\ 1 & 1 \\ \end{array}

To be able to convert that to something we can decode with modulus we have to solve the following equations:

x \% k_0 = 0 \\ x \% k_1 = 1

x is the number that represents our lookup table. k_0 and k_1 are the values that we modulus x against to get our desired outputs out.

It looks as if we have two equations and three unknowns – which would be unsolvable – but in reality, x is the only unknown. The k values can be whatever values it takes to make the equations true.

I wrote a blog post on how to solve equations like these in a previous post: Solving Simultaneous Congruences (Chinese Remainder Theorem).

You can also use this chinese remainder theorem calculator, which is handy: Chinese Remainder Theorem Calculator

The short answer here is that the k values can be ANY numbers, so long as they are pairwise co-prime to each other – AKA they have a greatest common divisor of 1.

If we pick 3 and 4 for k0 and k1, then using the chinese remainder theorem we find that x can equal 9 and the equations are true. Technically the answer is 9 mod 12, so 21, 33, 45 and many other numbers are also valid values of x, but we are going to use the smallest answer to keep things smaller, and more manageable.

So, in this case, the value representing the lookup table would be 9. If you wanted to know what value it gave as output when you plugged in the value 0, you would modulus the lookup table (9) against k0 (3) to get the output. If you wanted to know what value it gave as output when you plugged in the value 1, you would modulus the lookup table (9) against k1 (4) to get the output. The table below shows that it passes through the value in both cases like it should:

\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 0 & x \% k_0 & 9 \% 3 & 0\\ 1 & x \% k_1 & 9 \% 4 & 1\\ \end{array}

1 Bit Input, 1 Bit Output: Not Gate

Let’s do something a little more interesting. Let’s make the output bit be the reverse of the input bit. The equations we’ll want to solve are this:

x \% k_0 = 1 \\ x \% k_1 = 0

We can use 3 and 4 for k0 and k1 again if we want to. Using the Chinese remainder theorem to solve the equations gives us a value of 4 for x. Check the truth table below to see how this works:

\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 0 & x \% k_0 & 4 \% 3 & 1\\ 1 & x \% k_1 & 4 \% 4 & 0\\ \end{array}

1 Bit Input, 1 Bit Output: Output Always 1

What if we wanted the output bit to always be 1 regardless of input bit?

x \% k_0 = 1 \\ x \% k_1 = 1

Using 3 and 4 for our k values again, we solve and get a value of 1 for x. Check the truth table to see it working below:

\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 0 & x \% k_0 & 1 \% 3 & 1\\ 1 & x \% k_1 & 1 \% 4 & 1\\ \end{array}

Hopefully one bit input to one bit output makes sense now. Let’s move on (:

2 Bit Input, 1 Bit Output: XOR Gate

Things get a little more interesting when we bump the number of input bits up to 2. If we want to make a number which represents XOR, we now have 4 equations to solve.

x \% k_{00} = 0 \\ x \% k_{01} = 1 \\ x \% k_{10} = 1 \\ x \% k_{11} = 0

In general we will have 2^N equations, where N is the number of input bits.

You might have noticed that I use subscripts for k corresponding to the input bits that the key represents. This is a convention I’ve found useful when working with this stuff. Makes it much easier to see what’s going on.

Now with four equations, we need 4 pairwise coprime numbers – no number has a common factor with another number besides 1.

Let’s pull them out of the air. Umm… 3, 4, 5, 7

Not too hard with only two bits of input, but you can see how adding input bits makes things a bit more complex. If you wanted to make something that took in two 16 bit numbers as input for example, you would need 2^32 co-prime numbers, since there was a total of 32 bits of input!

When we solve those four equations, we get a value of 21 for x.

Notice how x is larger now that we have more input bits? That is another added complexity as you add more input bits. The number representing your program can get very, very large, and require you to use “multi precision integer” math libraries to store and decode the programs, when the numbers get larger than what can be held in a 64 bit int.

Boost has a decent library for this, check out boost::multiprecision::cpp_int, it’s what I use. You can download boost from here: http://www.boost.org/doc/libs/1_59_0/more/getting_started/windows.html

Anyhow, let’s check the truth table to see if our values work:

\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 00 & x \% k_{00} & 21 \% 3 & 0 \\ 01 & x \% k_{01} & 21 \% 4 & 1 \\ 10 & x \% k_{10} & 21 \% 5 & 1 \\ 11 & x \% k_{11} & 21 \% 7 & 0 \end{array}

Woot, it worked.

2 Bit Input, 2 Bit Output: OR, AND

What happens when we add another bit of output? Basically we just treat each output bit as it’s own lookup table. This means that if we have two output bits, we will have two numbers representing our program (one for each bit), and that this is true regardless of how many input bits we have.

Let’s make the left output bit (x_0 ) be the OR of the input bits and the right output bit (x_1 ) be the AND of the input bits.

That give us these two sets of equations to solve:

x_0 \% k_{00} = 0 \\ x_0 \% k_{01} = 1 \\ x_0 \% k_{10} = 1 \\ x_0 \% k_{11} = 1 \\ \\ x_1 \% k_{00} = 0 \\ x_1 \% k_{01} = 0 \\ x_1 \% k_{10} = 0 \\ x_1 \% k_{11} = 1 \\

We can use the same coprime numbers for our k values as we used in the last section (3,4,5,7). Note that we use the same k values in each set of equations. This is intentional and required for things to work out!

If we solve each set of equations we get 141 for x0, and 120 for x1.

Let’s see if that worked:

\begin{array}{c|c|c|c} \text{Input} & \text{Symbolic} & \text{Numeric} & \text{Output} \\ \hline 00 & x_0 \% k_{00}, x_1 \% k_{00} & 141 \% 3, 120 \% 3 & 00 \\ 01 & x_0 \% k_{01}, x_1 \% k_{01} & 141 \% 4, 120 \% 4 & 10 \\ 10 & x_0 \% k_{10}, x_1 \% k_{10} & 141 \% 5, 120 \% 5 & 10 \\ 11 & x_0 \% k_{11}, x_1 \% k_{11} & 141 \% 7, 120 \% 7 & 11 \end{array}

Hey, it worked again. Neat!

Example Code

Now that we have the basics worked out, here is some sample code.

The lookup table takes in 8 bits as input, mapping 0..255 to 0…2pi and gives the sine of that value as output in a float. So it has 8 bits of input and 32 bits of output.

#include <vector>
#include <boost/multiprecision/cpp_int.hpp>
#include <stdint.h>
#include <string.h>
#include <memory>

typedef boost::multiprecision::cpp_int TINT;
typedef std::vector<TINT> TINTVec;

const float c_pi = 3.14159265359f;

//=================================================================================
void WaitForEnter ()
{
    printf("nPress Enter to quit");
    fflush(stdin);
    getchar();
}

//=================================================================================
static TINT ExtendedEuclidianAlgorithm (TINT smaller, TINT larger, TINT &s, TINT &t)
{
    // make sure A <= B before starting
    bool swapped = false;
    if (larger < smaller)
    {
        swapped = true;
        std::swap(smaller, larger);
    }

    // set up our storage for the loop.  We only need the last two values so will
    // just use a 2 entry circular buffer for each data item
    std::array<TINT, 2> remainders = { larger, smaller };
    std::array<TINT, 2> ss = { 1, 0 };
    std::array<TINT, 2> ts = { 0, 1 };
    size_t indexNeg2 = 0;
    size_t indexNeg1 = 1;

    // loop
    while (1)
    {
        // calculate our new quotient and remainder
        TINT newQuotient = remainders[indexNeg2] / remainders[indexNeg1];
        TINT newRemainder = remainders[indexNeg2] - newQuotient * remainders[indexNeg1];

        // if our remainder is zero we are done.
        if (newRemainder == 0)
        {
            // return our s and t values as well as the quotient as the GCD
            s = ss[indexNeg1];
            t = ts[indexNeg1];
            if (swapped)
                std::swap(s, t);

            // if t < 0, add the modulus divisor to it, to make it positive
            if (t < 0)
                t += smaller;
            return remainders[indexNeg1];
        }

        // calculate this round's s and t
        TINT newS = ss[indexNeg2] - newQuotient * ss[indexNeg1];
        TINT newT = ts[indexNeg2] - newQuotient * ts[indexNeg1];

        // store our values for the next iteration
        remainders[indexNeg2] = newRemainder;
        ss[indexNeg2] = newS;
        ts[indexNeg2] = newT;

        // move to the next iteration
        std::swap(indexNeg1, indexNeg2);
    }
}

//=================================================================================
void MakeKey (TINTVec &keys, TINT &keysLCM, size_t index)
{
    // if this is the first key, use 3
    if (index == 0)
    {
        keys[index] = 3;
        keysLCM = keys[index];
        return;
    }

    // Else start at the last number and keep checking odd numbers beyond that
    // until you find one that is co-prime.
    TINT nextNumber = keys[index - 1];
    while (1)
    {
        nextNumber += 2;
        if (std::all_of(
            keys.begin(),
            keys.begin() + index,
            [&nextNumber] (const TINT& v) -> bool
            {
                TINT s, t;
                return ExtendedEuclidianAlgorithm(v, nextNumber, s, t) == 1;
            }))
        {
            keys[index] = nextNumber;
            keysLCM *= nextNumber;
            return;
        }
    }
}

//=================================================================================
void CalculateLookupTable (
    TINT &lut,
    const std::vector<uint64_t> &output,
    const TINTVec &keys,
    const TINT &keysLCM,
    const TINTVec &coefficients,
    size_t bitMask
)
{
    // figure out how much to multiply each coefficient by to make it have the specified modulus residue (remainder)
    lut = 0;
    for (size_t i = 0, c = keys.size(); i < c; ++i)
    {
        // we either want this term to be 0 or 1 mod the key.  if zero, we can multiply by zero, and
        // not add anything into the bit value!
        if ((output[i] & bitMask) == 0)
            continue;

        // if 1, use chinese remainder theorem
        TINT s, t;
        ExtendedEuclidianAlgorithm(coefficients[i], keys[i], s, t);
        lut = (lut + ((coefficients[i] * t) % keysLCM)) % keysLCM;
    }
}

//=================================================================================
template <typename TINPUT, typename TOUTPUT, typename LAMBDA>
void MakeModulus (TINTVec &luts, TINTVec &keys, LAMBDA &lambda)
{
    // to keep things simple, input sizes are being constrained.
    // Do this in x64 instead of win32 to make size_t 8 bytes instead of 4
    static_assert(sizeof(TINPUT) < sizeof(size_t), "Input too large");
    static_assert(sizeof(TOUTPUT) < sizeof(uint64_t), "Output too large");

    // calculate some constants
    const size_t c_numInputBits = sizeof(TINPUT) * 8;
    const size_t c_numInputValues = 1 << c_numInputBits;
    const size_t c_numOutputBits = sizeof(TOUTPUT) * 8;

    // Generate the keys (coprimes)
    TINT keysLCM;
    keys.resize(c_numInputValues);
    for (size_t index = 0; index < c_numInputValues; ++index)
        MakeKey(keys, keysLCM, index);

    // calculate co-efficients for use in the chinese remainder theorem
    TINTVec coefficients;
    coefficients.resize(c_numInputValues);
    fill(coefficients.begin(), coefficients.end(), 1);
    for (size_t i = 0; i < c_numInputValues; ++i)
    {
        for (size_t j = 0; j < c_numInputValues; ++j)
        {
            if (i != j)
                coefficients[i] *= keys[j];
        }
    }

    // gather all the input to output mappings by permuting the input space
    // and storing the output for each input index
    std::vector<uint64_t> output;
    output.resize(c_numInputValues);
    union
    {
        TINPUT value;
        size_t index;
    } input;
    union
    {
        TOUTPUT value;
        size_t index;
    } outputConverter;

    for (input.index = 0; input.index < c_numInputValues; ++input.index)
    {
        outputConverter.value = lambda(input.value);
        output[input.index] = outputConverter.index;
    }

    // iterate through each possible output bit, since each bit is it's own lut
    luts.resize(c_numOutputBits);
    for (size_t i = 0; i < c_numOutputBits; ++i)
    {
        const size_t bitMask = 1 << i;
        CalculateLookupTable(
            luts[i],
            output,
            keys,
            keysLCM,
            coefficients,
            bitMask
        );
    }
}

//=================================================================================
int main (int argc, char **argv)
{
    // Look up tables encodes each bit, keys is used to decode each bit for specific
    // input values.
    TINTVec luts;
    TINTVec keys;

    // this is the function that it turns into modulus work
    typedef uint8_t TINPUT;
    typedef float TOUTPUT;
    auto lambda = [] (TINPUT input) -> TOUTPUT
    {
        return sin(((TOUTPUT)input) / 255.0f * 2.0f * c_pi);
    };

    MakeModulus<TINPUT, TOUTPUT>(luts, keys, lambda);

    // show last lut and key to show what kind of numbers they are
    std::cout << "Last Lut: " << *luts.rbegin() << "n";
    std::cout << "Last Key: " << *keys.rbegin() << "n";

    // Decode all input values
    std::cout << "n" << sizeof(TINPUT) << " bytes input, " << sizeof(TOUTPUT) << " bytes outputn";
    for (size_t keyIndex = 0, keyCount = keys.size(); keyIndex < keyCount; ++keyIndex)
    {
        union
        {
            TOUTPUT value;
            size_t index;
        } result;

        result.index = 0;

        for (size_t lutIndex = 0, lutCount = luts.size(); lutIndex < lutCount; ++lutIndex)
        {
            TINT remainder = luts[lutIndex] % keys[keyIndex];
            size_t remainderSizeT = size_t(remainder);
            result.index += (remainderSizeT << lutIndex);
        }

        TINT remainder = luts[0] % keys[keyIndex];
        std::cout << "i:" << keyIndex << " o:" << result.value << "n";
    }

    WaitForEnter();
    return 0;
}

Here is some output from the program. The first is to show what the last (largest) look up table and key look like. Notice how large the look up table number is!

Here it shows some sine values output from the program, using modulus against the large numbers calculated, to get the bits of the result out:

How to Get Lots of Pairwise Co-Prime Numbers?

You can generate a list of pairwise coprimes using brute force. Have an integer that you increment, and check if it’s pairwise co-prime to the existing items in the list. If it is, add it to the list! Rinse and repeat until you have as many as you want.

That is the most practical way to do it, but there are two other interesting ways I wanted to mention.

The first way is using Fermat numbers. The Fermat numbers are an infinite list of pairwise co-prime numbers and are calculated as 2^{2^n}+1 where n is an integer. Fermat numbers also have the benefit that you can get the nth item in the list without calculating the numbers that came before it. The only problem is that the numbers grow super huge very fast. The first 7 values are: 3, 5, 17, 257, 65537, 4294967297, 18446744073709551617. If Fermat numbers didn’t grow so quickly, they sure would be useful for things like this technique.

The second way is using something called Sylvester’s sequence. It too is an infinite list of pairwise co-prime numbers, and it too grows very large very quickly unfortunately. I also don’t believe there is a way to calculate the Nth item in the list directly. Every number is based on previous numbers, so you have to calculate them all from the beginning. No random access!

Beyond Binary

In this post I showed how to work in binary digits, but there is no reason why you have to encode single bits in the lookup tables.

Instead of encoding 0 or 1 in each modulus “lookup table”, you could also perhaps store base 10 numbers in the tables and have 0-9. Or, maybe you encode a byte per lookup table.

Encoding more than one bit effectively makes both your input and your output smaller, which helps the algorithm do more with less.

Your keys will need to be larger though, since the keys have to be larger than the value you plan to store, and your resulting lookup table will be a larger number as well. It might make the technique more worth while though.

I’ll leave that as an exercise for you. If try it and find neat stuff, post a comment and let us know, or drop me an email or something. It’d be neat to hear if people find any practical usage cases of this technique ๐Ÿ˜›

The End, For Now!

I want to point out that increasing the number of input bits in this technique is a pretty expensive thing to do, but increasing the number of output bits is a lot cheaper. It kind of makes sense in a way if you think about it. Input bits add information from the outside world that must be dealt with, while output bits are just fluff that can easily be diluted or concentrated by adding or removing bits that are associated with, and calculated from, the input bits.

Another problem you may have noticed with this technique is that if you have a really expensive calculation that you are trying to “flatten” into modulus math like this, that you have to run that calculation many, many times to know what values a lookup table would give you. You have to run it once per possible input to get every possible output. That is expected when making a lookup table, since you are paying a cost up front to make things faster later.

The paper I’m working on changes things a bit though. One of the things it does is it makes it so doing this technique only requires that you evaluate the function once, and it calculates all values simultaneously to give the end result that you can then do modulus against. It’s pretty cool IMO and I will share more details here as soon as I am able – and yes, i have actual working code that does that, believe it or not! I’m looking forward to being able to share it later on. Maybe someone will find some really cool usage case for it.

Quantum Computing For Programmers Part 2: Multiple Qubits

In part 1 (Quantum Computing For Programmers Part I: One Qubit) we looked at the basics of quantum computing and how to deal with a single qubit. Here we’ll talk about the interesting things that happen when you involve multiple qubits!

Multiple Qubit Probabilities & Possibilities

In the last post we used the analogy of a coin to describe a qubit. The coin can be either heads or tails, or flipping in the air, waiting to become either a heads or tails when it lands. The same is true of how a qubit works, where it can either be a 0 or a 1 or some superposition of both, only deciding which it is when observed. Furthermore, just like a real coin, there can be a bias towards becoming one value or the other.

We represented a coin by having a probability for each possible state (heads or tails), but represented a qubit by having an AMPLITUDE for each possible state (0 or 1), where you square an amplitude to get the probability.

Going back to the coin analogy, how would probabilities work if we had two coins?

The four possible outcomes with two coins are:
Heads/Heads
Heads/Tails
Tails/Heads
Tails/Tails

Each coin individually could either be sitting on the table heads or tails, or up in the air, waiting to become either a heads or a tails, with some probability of becoming a heads or a tails.

To figure out the probability of each possible outcome, you just multiply the probability of each coin’s outcome together.

For instance, let’s say that the first coin (coin A) is already sitting on the table as a heads, and the second coin (coin B) is flipping in the air, with a 1/3 chance of becoming heads and a 2/3 chance of becoming tails. Here is how you would calculate the probability of each possible state:

\begin{array}{c|c|c|c} \text{Outcome A/B} & \text{Coin A Probability} & \text{Coin B Probability} & \text{Outcome Probability (A*B)}\\ \hline heads / heads & 100\% & 33\% & 33\% \\ heads / tails & 100\% & 67\% & 67\% \\ tails / heads & 0\% & 33\% & 0\% \\ tails / tails & 0\% & 67\% & 0\% \\ \end{array}

Qubits actually work the same way! Using the same values as the coins, let’s say qubit A is a 0, and qubit B has a 1/3 chance of becoming a 0, and a 2/3 chance of becoming a 1. Converting those probabilities to amplitudes you could get A=[1,0], B=[\frac{1}{\sqrt{3}}, \frac{\sqrt{2}}{\sqrt{3}}].

\begin{array}{c|c|c|c|c} \text{Outcome AB} & \text{Qubit A Amplitude} & \text{Qubit B Amplitude} & \text{Outcome Amplitude(A*B)} & \text{Outcome Probability}\\ \hline 00 & 1 & 1/\sqrt{3} & 1/\sqrt{3} & 33\% \\ 01 & 1 & \sqrt{2}/\sqrt{3} & \sqrt{2}/\sqrt{3} & 67\% \\ 10 & 0 & 1/\sqrt{3} & 0 & 0\%\\ 11 & 0 & \sqrt{2}/\sqrt{3} & 0 & 0\% \\ \end{array}

Note that in both cases, the probabilities add up to 100% like you’d expect.

In the case of qubits, the resulting vector is [\frac{1}{\sqrt{3}}, \frac{\sqrt{2}}{\sqrt{3}}, 0, 0] , which is still normalized, and represents the amplitudes for the 4 possible states of those two qubits: 00, 01, 10 and 11.

When working with one qubit (or coin), there are 2 possible outcomes 0 or 1 (heads or tails). When working with two qubits (or coins), there are four possible outcomes 00, 01, 10, 11 (heads/heads, heads/tails, tails/heads, tails/tails). When working with three qubits or coins, there are eight possible outcomes: 000,001,010 … 111.

With both coins and qubits there are 2^N possibilities, where N is the number of qubits or coins you have.

If you had 8 qubits, you’d have to use a 256 dimensional vector to describe the possibilities, since 2^8 is 256. When performing quantum computing with 8 qubits, you only have to deal with the 8 qubits. When simulating quantum computing on a regular, classical computer, you have to deal with the 256 amplitudes. This kind of gives a glimpse at how quantum computers can be faster than regular computers at certain things. There is an economy of scale working against us on regular computers simulating quantum computing.

The method we used to combine the probabilities of single qubits into an amplitude vector representing multiple qubits is called the Kronecker product. That’s just a fancy way of saying we have to multiply everything from the first vector by everything from the second vector, to get a third vector that is bigger than the first two. You’ll see it represented like this: A \otimes B and while it’s similar to the “outer product” and even uses the same symbol (!), it gives a slightly different result versus if you did an outer product and then vectorized the matrix.

The Kronecker product of vectors works like this:

\begin{bmatrix} A_1 \\ A_2 \\ ... \\ A_M \\ \end{bmatrix} \otimes \begin{bmatrix} B_1 \\ B_2 \\ ... \\ B_N \\ \end{bmatrix} = \begin{bmatrix} A_1 B_1 \\ A_1 B_2 \\ ... \\ A_1 B_N \\ A_2 B_1 \\ A_2 B_2 \\ ... \\ A_2 B_N \\ ... \\ A_M B_1 \\ A_M B_2 \\ ... \\ A_M B_N \\ \end{bmatrix}

Let’s show an example involving two qubits.

The first qubit has a 75% chance of being heads so it’s amplitude is [\sqrt{3}/2,1/2]. The second qubit has a 2/3 chance of being heads so has an amplitude of [\sqrt{2}/\sqrt{3}, 1/\sqrt{3}].

To calculate the amplitude vector representing these two qubits, we do a kronecker product:
\begin{bmatrix} \cfrac{\sqrt{3}}{2} \\ \cfrac{1}{2} \\ \end{bmatrix} \otimes \begin{bmatrix} \cfrac{\sqrt{2}}{\sqrt{3}} \\ \cfrac{1}{\sqrt{3}} \\ \end{bmatrix} = \begin{bmatrix} \cfrac{\sqrt{3}}{2}\cfrac{\sqrt{2}}{\sqrt{3}} \\ \cfrac{\sqrt{3}}{2}\cfrac{1}{\sqrt{3}} \\ \cfrac{1}{2}\cfrac{\sqrt{2}}{\sqrt{3}} \\ \cfrac{1}{2}\cfrac{1}{\sqrt{3}} \\ \end{bmatrix}

If you simplify that, you get:
\begin{bmatrix} \cfrac{1}{\sqrt{2}} \\ \cfrac{1}{2} \\ \cfrac{1}{\sqrt{6}} \\ \cfrac{1}{2\sqrt{3}} \\ \end{bmatrix}

or horizontally for easier reading:
\begin{bmatrix} \cfrac{1}{\sqrt{2}} & \cfrac{1}{2} & \cfrac{1}{\sqrt{6}} & \cfrac{1}{2\sqrt{3}} & \end{bmatrix}

Squaring those values to get the probabilities of each state, we get the below, which you might notice adds up to 100%, meaning the vector is still normalized!
\begin{bmatrix} 50\% & 25\% & 17\% & 8\% \end{bmatrix}

This generalizes for larger numbers of qubits, so you could use the Kronecker product to combine a vector describing 4 qubits with a vector describing 3 qubits, to come up with a vector that describes 7 qubits (which would have 128 amplitudes in it, since 2^7 = 128!).

The kronecker product also generalizes to matrices, which we will talk about shortly.

To be able to work with multiple qubits in a quantum circuit, you need to represent all qubits involved in a single vector. Doing this, while being the correct thing to do, also allows entanglement to happen, which we will talk about next!

Entanglement – Simpler Than You Might Think!

Time to demystify quantum entanglement!

Quantum entanglement is the property whereby two qubits – which can be separated by any distance, such as several light years – can be observed and come up with the same value (0 or 1). Whether they are 0s or 1s is random, but they will both agree, when entangled in the way that causes them to agree (you could alternately entangle them to always disagree).

Interestingly, in 1965 John Bell proved that the mechanism that makes this happen is NOT that the qubits share information in advance. You can read more about that here – it’s near the end: Ars Technica โ€“ A tale of two qubits: how quantum computers work.

A common misconception is that this means that we can have faster than light (instantaneous) communication across any distance. That turns out not to be true, due to the No-Communication-Theorem (Wikipedia).

Interestingly though, you can use separated entangled qubits in separated quantum circuits to do something called Quantum Pseudo Telepathy (Wikipedia), which lets you do some things that would otherwise be impossible – like reliably winning a specially designed game that would otherwise be a game of chance.

I don’t yet understand enough about Quantum Pseudo Telepathy to see why it isn’t considered communication. I also have no idea how entanglement is actually “implemented” in the universe, but nobody seems to (or if they do, they aren’t sharing!). How can it be that two coins flipped on opposite sides of the galaxy can be guaranteed to both land with the same side facing up?

Despite those mysteries, the math is DEAD SIMPLE, so let me share it with you, it’ll take like one short paragraph. Are you ready?

Quantum computing works by manipulating the probabilities of the states of qubits. If you have two qubits, there are four possible sets of values when you observe them: 00, 01, 10, 11. If you use quantum computing to set the probabilities of the 01 and 10 states to a 0% chance, and set the probabilities of the 00 and 11 states to a 50% chance each, you’ve now entangled the qubits such that when you observe them, they will always agree, because you’ve gotten rid of the chances that they could ever DISAGREE. Similarly, if instead of setting the 01 and 10 states to 0%, you set the probability of the 00 and 11 states to 0%, you’d have entangled qubits which would always disagree when you observed them, because you’ve gotten rid of the chances that they could ever AGREE.

That’s all there is to it. Strange how simple it is, isn’t it? The table below shows how the only possible outcomes are that the qubits agree, but it is a 50/50 chance whether they are a 0 or a 1:

\begin{array}{c|c|c} \text{Outcome} & \text{Amplitude} & \text{Probability}\\ \hline 00 & 1/\sqrt{2} & 50\% \\ 01 & 0 & 0\% \\ 10 & 0 & 0\% \\ 11 & 1/\sqrt{2} & 50\% \\ \end{array}

Entanglement isn’t limited to just two qubits, you can entangle any number of qubits together.

Entanglement has a special mathematical meaning. If you can represent a state of a group of qubits by a kronecker product, they are not entangled. If you CAN’T represent a state of a group of qubits by a kronecker product, they ARE entangled. These two things – entanglement and lack of kronecker product factorability (made that term up) – are the same thing.

As an example, what two vectors could you use the kronecker product on to get the entangled two qubit state 1/\sqrt{2}(|00\rangle+|11\rangle) (or in vector form [1/\sqrt{2}, 0, 0, 1/\sqrt{2}])? You’ll find there aren’t any! That state is the entangled state where the two qubits will always have the same value when you observe them.

Entangled qubits are nothing special in quantum circuits. You don’t have to take any special precautions when working with them. They are an interesting byproduct of quantum computing, and so basically are something neat that comes out of your quantum circuit, but they don’t require any extra thought when using them within your circuits. Don’t worry about them too much (:

Multi Qubit Gates

Let’s have a look at some multi qubit gates! (These are again from Wikipedia: Quantum Gate)

Swap Gate

Given two qubits, with possible states |00\rangle, |01\rangle, |10\rangle, |11\rangle, this gate swaps the amplitudes (probabilities) of |01\rangle,  |10\rangle and leaves the probabilities of the states |00\rangle and |11\rangle alone.

That might seem weird, but the probability of the |00\rangle state is the probability of the first qubit being 0 added to the probability of the second qubit being 0. If those probabilities swap, they still add to the same value, so this probability is unaffected. It’s the same situation for the |11\rangle state.

Here’s the matrix for the swap gate:
\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

Square Root of Swap Gate

I don’t really understand a usage case for this gate myself, but it apparently does a half way swap between two qubits. That once again modifies the probabilities of the |01\rangle,  |10\rangle states, but leaves state |00\rangle and |11\rangle alone again, for the same reason as the swap gate.

Wikipedia says that if you have this gate, and the single qubit gates, that you can do universal quantum computation. In other words, you can build a fully functional quantum computer using just this and the single qubit gates.

Here’s the matrix:
\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \frac{1}{2}(1+i) & \frac{1}{2}(1-i) & 0 \\ 0 & \frac{1}{2}(1-i) & \frac{1}{2}(1+i) & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

Controlled Not Gate

The controlled not (CNOT) gate flips the value of the second qubit if the first qubit is true. This is a logic / flow control type of gate.

Interestingly, this gate is also useful for creating entangled qubits, which you’ll be able to see lower down!

Here is the matrix:
\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \\ \end{bmatrix}

The controlled not gate is also referred to as the quantum XOR gate. To see why, check out the truth table below for this gate:
\begin{array}{c|c} \text{Input} & \text{Output} \\ \hline 00 & 00 \\ 01 & 01 \\ 10 & 11 \\ 11 & 10 \\ \end{array}

If you look at the right most bit of the outputs, you’ll notice that it’s the XOR of the two input bits!

All quantum gates need to be reversible, and having these two qubits be the output of a hypothetical quantum XOR gate allows that to happen. If you look at the right most bit of the inputs, you’ll notice that it is also the XOR of the two output bits. It’s bidirectional, which is kind of weird and kind of interesting ๐Ÿ˜›

Generalized Control Gate

You can actually convert any single qubit gate into a controlled gate. That makes a 2 qubit gate which only does work on the second qubit if the first qubit is true.

How you do that is you make a 4×4 identity matrix, and make the lower 2×2 sub-matrix into the single qubit matrix you want to use.

In other words, if your single qubit matrix is this:
\begin{bmatrix} U_{00} & U_{01} \\ U_{10} & U_{11} \\ \end{bmatrix}

Then the controlled version of the matrix would be this:
\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & U_{00} & U_{01} \\ 0 & 0 & U_{10} & U_{11} \\ \end{bmatrix}

Pretty simple right?

Toffoli Gate

The Tofolli gate is a 3 qubit gate that is also known as the CCNOT gate or controlled controlled not gate. It flips the third qubit if the first two qubits are true.

It’s matrix looks pretty uninteresting:
\begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ \end{bmatrix}

This is also known as the quantum AND gate. If you look at the truth table below, you’ll see that when the input has the right most qubit of 0, that in the output, the right most qubit will be the AND value of the two left qubits. We need the three bits of input mapping to three bits of output like this so that the gate is reversible.

\begin{array}{c|c} \text{Input} & \text{Output} \\ \hline 000 & 000 \\ 001 & 001 \\ 010 & 010 \\ 011 & 011 \\ 100 & 100 \\ 101 & 101 \\ 110 & 111 \\ 111 & 110 \\ \end{array}

Fredkin Gate

The Fredkin gate is a controlled swap gate. Here’s the matrix:
\begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{bmatrix}

Just like we talked about how to make a generalized controlled gate acting on two qubits, you should be able to notice that this gate is just a specific version of a generalized controlled 3 qubit gate. You take an 8×8 identity matrix, and make the lower right 4×4 matrix be the 2 qubit gate that you want to add a control on. In this case, the lower right 4×4 matrix is just the swap gate we mentioned earlier in the post (:

Circuit To Entangle Qubits

The circuit to entangle qubits is pretty simple, so let’s start with that as the first quantum circuit we look at. It takes in two qubits. The first qubit is put through a Hadamard gate, and then both qubits are put through a controlled not gate. That circuit looks like this (image courtesy of Quantum Circuit Simulator):

The value you plug in for the second qubit determines what type of entanglement you get out. Setting the second qubit to 0, you will get entangled qubits which always agree when they are observed. Setting it to 1, you will get entangled qubits which always disagree when they are observed. The first qubit controls whether the phase of each state matches or mismatches. Note that these are the four Bell States (Wikipedia: Bell State).

\begin{array}{c|c|c} \text{Input} & \text{Output In Ket Notation} & \text{Output As Vector} \\ \hline 00 & \frac{1}{\sqrt{2}}(|00\rangle+|11\rangle) & [1/\sqrt{2},0,0,1/\sqrt{2}] \\ 01 & \frac{1}{\sqrt{2}}(|01\rangle+|10\rangle) & [0,1/\sqrt{2},1/\sqrt{2},0] \\ 10 & \frac{1}{\sqrt{2}}(|00\rangle-|11\rangle) & [1/\sqrt{2},0,0,-1/\sqrt{2}] \\ 11 & \frac{1}{\sqrt{2}}(|01\rangle-|10\rangle) & [0,1/\sqrt{2},-1/\sqrt{2},0]\\ \end{array}

In a quantum circuit, you can’t just apply a gate to an individual quabit at a time though. You have to make the matrix of your gate such that it applies to all qubits, applying the “identity” matrix to the qubits you don’t want to be affected by the gate.

So, how do we make a matrix that applies the Hadamard gate to qubit 1, and identity to qubit 2? You use the kronecker product!

Since we want to apply the Hadamard matrix to qubit 1 and identity to qubit 2, we are going to calculate H \otimes I (If we wanted to apply the Hadamard gate to qubit 2 and identity to qubit 1 we would calculate I \otimes H instead).

H \otimes I =  1/\sqrt{2}* \begin{bmatrix} 1 & 1 \\ 1 & -1 \\ \end{bmatrix} \otimes \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} = 1/\sqrt{2}* \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & -1 \\ \end{bmatrix}

If you notice, the result of the kronecker product is just every value in the left matrix, multiplied by every value in the right matrix. Basically, the result is a 2×2 grid of identity matrices, where each of those identity matrices is multiplied by the corresponding value from the same cell in the left matrix. Since the left matrix has a 1 in all cells except the lower right, the same is true of the result… it’s a positive identity matrix in each cell, except the lower right one, which is a negative identity matrix. Hopefully that makes sense, it’s a lot easier than it sounds…

The second gate in the quantum circuit is the CNOT gate. We can actually multiply the gate we just made by the CNOT gate to represent the full quantum circuit as a single matrix. This uses regular matrix multiplication.

1/\sqrt{2}* \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & -1 \\ \end{bmatrix} * \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \\ \end{bmatrix} = 1/\sqrt{2}* \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & -1 \\ 0 & 1 & -1 & 0 \\ \end{bmatrix} = \begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 1/\sqrt{2} & 0 \\ 1/\sqrt{2} & 0 & 0 & -1/\sqrt{2} \\ 0 & 1/\sqrt{2} & -1/\sqrt{2} & 0 \\ \end{bmatrix}

Lets plug some values into the circuit and see what comes out!

Lets start by plugging in 00. Our input qubits are [1, 0] and [1, 0]. The kronecker product of those two qubit vectors is [1, 0, 0, 0]. Now let’s multiply that vector by the matrix of our quantum circuit.

\begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix} * \begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 1/\sqrt{2} & 0 \\ 1/\sqrt{2} & 0 & 0 & -1/\sqrt{2} \\ 0 & 1/\sqrt{2} & -1/\sqrt{2} & 0 \\ \end{bmatrix} = \begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \end{bmatrix}

Comparing this to the table showing how our input maps to output, we got the right answer! If you plugged in the other values listed, you would get the rest of the bell state entanglements.

Unentangling Qubits

All gates are reversible, so all circuits are reversible. To get the reverse circuit for a given matrix, you just get the inverse of the matrix. Since quantum gates (and circuits) are unitary matrices, taking the inverse of one of these matrices just means taking the conjugate transpose of the matrix. In other words, you take the transpose of the matrix, and then just negate the imaginary component of any complex numbers. In this example, there are no imaginary numbers, so you just take the transpose of the matrix.

Since our circuit matrix is this:

\begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 1/\sqrt{2} & 0 \\ 1/\sqrt{2} & 0 & 0 & -1/\sqrt{2} \\ 0 & 1/\sqrt{2} & -1/\sqrt{2} & 0 \\ \end{bmatrix}

That means that the inverse must be this, since this is the (conjugate) transpose.

\begin{bmatrix} 1/\sqrt{2} & 0 & 1/\sqrt{2} & 0 \\ 0 & 1/\sqrt{2} & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 0 & -1/\sqrt{2} \\ 1/\sqrt{2} & 0 & -1/\sqrt{2} & 0 \\ \end{bmatrix}

Let’s try it out by putting the output from last section in and seeing what comes out the other side:

\begin{bmatrix} 1/\sqrt{2} & 0 & 0 & 1/\sqrt{2} \end{bmatrix} * \begin{bmatrix} 1/\sqrt{2} & 0 & 1/\sqrt{2} & 0 \\ 0 & 1/\sqrt{2} & 0 & 1/\sqrt{2} \\ 0 & 1/\sqrt{2} & 0 & -1/\sqrt{2} \\ 1/\sqrt{2} & 0 & -1/\sqrt{2} & 0 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix}

It worked! We got our original input back.

Making Matrices for More Complex Circuits

We talked about how to make single qubit gates apply to specific qubits by using the kronecker product with identity to move the gate to the right location.

If you have 4 qubits and you want to apply some single qubit gate (say Hadamard) to the 3rd qubit, you would just do this to get the matrix:
I \otimes I \otimes H \otimes I

What if you want to do complex things with multi qubit gates though? Like what if you want to do a controlled not gate where you want the 2nd qubit to flip it’s value if the 4th qubit was true?

The answer is actually pretty simple… you use swaps gates to flip the values of the qubits so that your qubit values line up with the inputs of the gate, then you do the gate, and finally undo all the swaps to get the qubit values back to the right positions.

We need to swap qubits so that the 4th qubit value is in the 1st qubit slot, and the 2nd qubit value is in the 2nd qubit slot. We can do that with the following swaps:

Once we have done those swaps, we can do our controlled not, then do the swaps in reverse order to return the qubits to their original positions.

Here’s what the circuit looks like:

You’ll see the simpler version in circuit diagrams, but at least now you’ll know how to make things that look like this:

Below is the mathematical way that you would get the matrix representing this gate.

I is the identity matrix, S is the swap gate, and C is the controlled not gate.

M = \\ (I \otimes I \otimes S) * \\ (I \otimes S \otimes I) * \\ (S \otimes I \otimes I) * \\ (I \otimes S \otimes I) * \\ (C \otimes I \otimes I) * \\ (I \otimes S \otimes I) * \\ (S \otimes I \otimes I) * \\ (I \otimes S \otimes I) * \\ (I \otimes I \otimes S) \\

Code

Here is some simple C++ to show both the entanglement circuit and the more complicated controlled not gate we described.

#include <stdio.h>
#include <vector>
#include <complex>
#include <assert.h>

typedef std::complex<float> TComplex;

//=================================================================================
struct SComplexVector
{
public:
    TComplex& Get (size_t i) { return v[i]; }

    const TComplex& Get (size_t i) const { return v[i]; }

    size_t Size() const
    {
        return v.size();
    }

    void Resize (size_t s)
    {
        v.resize(s);
    }

    void Print () const
    {
        printf("[");
        for (size_t i = 0, c = v.size(); i < c; ++i)
        {
            if (i > 0)
                printf(", ");
            const TComplex& val = Get(i);
            if (val.imag() == 0.0f)
            {
                if (val.real() == 0.0f)
                    printf("0");
                else if (val.real() == 1.0f)
                    printf("1");
                else
                    printf("%0.2f", val.real());
            }
            else
                printf("%0.2f + %0.2fi", val.real(), val.imag());
        }
        printf("]n");
    }

    std::vector<TComplex> v;
};

//=================================================================================
struct SComplexMatrix
{
public:
    TComplex& Get (size_t x, size_t y)
    {
        return v[y*Size() + x];
    }

    const TComplex& Get (size_t x, size_t y) const
    {
        return v[y*Size() + x];
    }

    // For an MxM matrix, this returns M
    size_t Size () const
    {
        size_t ret = (size_t)sqrt(v.size());
        assert(ret*ret == v.size());
        return ret;
    }

    // For an MxM matrix, this sets M
    void Resize(size_t s)
    {
        v.resize(s*s);
    }

    void Print() const
    {
        const size_t size = Size();

        for (size_t y = 0; y < size; ++y)
        {
            printf("[");
            for (size_t x = 0; x < size; ++x)
            {
                if (x > 0)
                    printf(", ");
                
                const TComplex& val = Get(x, y);
                if (val.imag() == 0.0f)
                {
                    if (val.real() == 0.0f)
                        printf("0");
                    else if (val.real() == 1.0f)
                        printf("1");
                    else
                        printf("%0.2f", val.real());
                }
                else
                    printf("%0.2f + %0.2fi", val.real(), val.imag());
            }
            printf("]n");
        }
    }

    std::vector<TComplex> v;
};

//=================================================================================
static const SComplexVector c_qubit0 = { { 1.0f, 0.0f } };  // false aka |0>
static const SComplexVector c_qubit1 = { { 0.0f, 1.0f } };  // true aka |1>

// 2x2 identity matrix
static const SComplexMatrix c_identity2x2 =
{
    {
        1.0f, 0.0f,
        0.0f, 1.0f,
    }
};

// Given the states |00>, |01>, |10>, |11>, swaps the |01> and |10> state
// If swapping the probabilities of two qubits, it won't affect the probabilities
// of them both being on or off since those add together.  It will swap the odds of
// only one of them being on.
static const SComplexMatrix c_swapGate =
{
    {
        1.0f, 0.0f, 0.0f, 0.0f,
        0.0f, 0.0f, 1.0f, 0.0f,
        0.0f, 1.0f, 0.0f, 0.0f,
        0.0f, 0.0f, 0.0f, 1.0f
    }
};

// Controlled not gate
// If the first qubit is true, flips the value of the second qubit
static const SComplexMatrix c_controlledNotGate =
{
    {
        1.0f, 0.0f, 0.0f, 0.0f,
        0.0f, 1.0f, 0.0f, 0.0f,
        0.0f, 0.0f, 0.0f, 1.0f,
        0.0f, 0.0f, 1.0f, 0.0f
    }
};

// Hadamard gate
// Takes a pure |0> or |1> state and makes a 50/50 superposition between |0> and |1>.
// Put a 50/50 superposition through and get the pure |0> or |1> back.
// Encodes the origional value in the phase information as either matching or
// mismatching phase.
static const SComplexMatrix c_hadamardGate =
{
    {
        1.0f / std::sqrt(2.0f), 1.0f / std::sqrt(2.0f),
        1.0f / std::sqrt(2.0f), 1.0f / -std::sqrt(2.0f)
    }
};

//=================================================================================
void WaitForEnter ()
{
    printf("nPress Enter to quit");
    fflush(stdin);
    getchar();
}

//=================================================================================
SComplexVector KroneckerProduct (const SComplexVector& a, const SComplexVector& b)
{
    const size_t aSize = a.Size();
    const size_t bSize = b.Size();

    SComplexVector ret;
    ret.Resize(aSize*bSize);

    for (size_t i = 0, ic = aSize; i < ic; ++i)
    {
        for (size_t j = 0, jc = bSize; j < jc; ++j)
        {
            size_t n = i * bSize + j;
            ret.Get(n) = a.Get(i)*b.Get(j);
        }
    }
    return ret;
}

//=================================================================================
SComplexMatrix KroneckerProduct (const SComplexMatrix& a, const SComplexMatrix& b)
{
    const size_t aSize = a.Size();
    const size_t bSize = b.Size();

    SComplexMatrix ret;
    ret.Resize(aSize*bSize);

    for (size_t ax = 0; ax < aSize; ++ax)
    {
        for (size_t ay = 0; ay < aSize; ++ay)
        {
            const TComplex& aValue = a.Get(ax, ay);

            for (size_t bx = 0; bx < bSize; ++bx)
            {
                for (size_t by = 0; by < bSize; ++by)
                {
                    const TComplex& bValue = b.Get(bx, by);

                    size_t nx = ax*bSize + bx;
                    size_t ny = ay*bSize + by;

                    ret.Get(nx,ny) = aValue * bValue;
                }
            }
        }
    }

    return ret;
}

//=================================================================================
SComplexMatrix operator* (const SComplexMatrix& a, const SComplexMatrix& b)
{
    assert(a.Size() == b.Size());
    const size_t size = a.Size();

    SComplexMatrix ret;
    ret.Resize(size);

    for (size_t nx = 0; nx < size; ++nx)
    {
        for (size_t ny = 0; ny < size; ++ny)
        {
            TComplex& val = ret.Get(nx, ny);
            val = 0.0f;
            for (size_t i = 0; i < size; ++i)
                val += a.Get(i, ny) * b.Get(nx, i);
        }
    }

    return ret;
}

//=================================================================================
SComplexVector operator* (const SComplexVector& a, const SComplexMatrix& b)
{
    assert(a.Size() == b.Size());
    const size_t size = a.Size();

    SComplexVector ret;
    ret.Resize(size);

    for (size_t i = 0; i < size; ++i)
    {
        TComplex& val = ret.Get(i);
        val = 0;
        for (size_t j = 0; j < size; ++j)
            val += a.Get(j) * b.Get(i, j);
    }

    return ret;
}

//=================================================================================
int main (int argc, char **argv) {

    // 2 qubit entanglement circuit demo
    {
        // make the circuit
        const SComplexMatrix H1 = KroneckerProduct(c_hadamardGate, c_identity2x2);
        const SComplexMatrix circuit = H1 * c_controlledNotGate;

        // display the circuit
        printf("Entanglement circuit:n");
        circuit.Print();

        // permute the inputs and see what comes out when we pass them through the circuit!
        SComplexVector input = KroneckerProduct(c_qubit0, c_qubit0);
        SComplexVector output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(c_qubit0, c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(c_qubit1, c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(c_qubit1, c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();
    }

    // 4 qubit demo: flip the second qubit if the fourth qubit is true
    {
        // make the circuit
        const SComplexMatrix cnot4qubit = KroneckerProduct(KroneckerProduct(c_controlledNotGate, c_identity2x2), c_identity2x2);
        const SComplexMatrix swap12 = KroneckerProduct(KroneckerProduct(c_swapGate, c_identity2x2), c_identity2x2);
        const SComplexMatrix swap23 = KroneckerProduct(KroneckerProduct(c_identity2x2, c_swapGate), c_identity2x2);
        const SComplexMatrix swap34 = KroneckerProduct(KroneckerProduct(c_identity2x2, c_identity2x2), c_swapGate);
        const SComplexMatrix circuit = 
            swap34 *
            swap23 *
            swap12 *
            swap23 *
            cnot4qubit *
            swap23 *
            swap12 *
            swap23 *
            swap34;

        // display the circuit
        printf("nFlip 2nd qubit if 4th qubit true circuit:n");
        circuit.Print();

        // permute the inputs and see what comes out when we pass them through the circuit!
        SComplexVector input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit0), c_qubit0);
        SComplexVector output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit0), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit1), c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit0), c_qubit1), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit0), c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit0), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit1), c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit0, c_qubit1), c_qubit1), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit0), c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit0), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit1), c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit0), c_qubit1), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit0), c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit0), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit1), c_qubit0);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();

        input = KroneckerProduct(KroneckerProduct(KroneckerProduct(c_qubit1, c_qubit1), c_qubit1), c_qubit1);
        output = input * circuit;
        printf("ninput:");
        input.Print();
        printf("output:");
        output.Print();
    }


    WaitForEnter();

    return 0;
}

Here is the first part of the output, which shows the results of the entangling circuit. You can see that the input gave the expected output Bell states:

Below is the second part of the output, which is the circuit that flips the 2nd qubit if the 4th qubit is true.

Each entry in the input and output vectors is the amplitude (probability) that the state it represents is true. The states start at the left with 0000, then 0001, then 0010 and continue until the very right most value which represents 1111.

If you look at the second input/output pair the input is [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0], which means it has a 100% chance of the 4 qubits being 0001 (aka the qubits represent the number 1). The output vector is [0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0], which means that it has a 100% chance of being 0101 (aka the qubits represent the number 5). Since the input did have the 4th bit set (4th from the left), it flipped the 2nd bit. So, you can see that it worked!

If you check all the input/output pairs, you’ll see that they all follow this rule.

We used only whole, real numbers, no using fractional probabilities, or imaginary amplitudes. What it does in those situations is a little bit harder to get intuition for, but rest assured that it does “the right thing” in those situations as well.

Next Up

Now that we have the basics of quantum computing down pretty well, it’s time to analyze a couple quantum algorithms to see how they work!