# Prefix Sums and Summed Area Tables

Prefix sums and summed area tables let you sum up regions of arrays or grids in constant time.

If that sounds like it might not have many uses, that is another way of saying that it does discrete integration in constant time, and can also be made to do some kinds of convolution.

These things come up quite a bit in game development and graphics so is pretty interesting for things like depth of field, glossy reflections, and maybe image based lighting. Check the links at the end of the post to see these things in action in some pretty interesting ways.

# One Dimension – Prefix Sums

Say that you have 10 numbers:

$\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{index} & \textbf{0} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} & \textbf{6} & \textbf{7} & \textbf{8} & \textbf{9} \\ \hline \textbf{value} & 8 & 3 & 7 & 4 & 12 & 6 & 4 & 10 & 1 & 2 \\ \hline \end{array}$

To sum up numbers in a given range you have to manually add up the numbers in that range.

Summing the numbers at index 2 through 5 inclusively takes 3 adds and gives you the answer 29. (index 2 + index 3 + index 4 + index 5)

Summing the numbers at index 0 through index 9 inclusively (the whole table) takes 9 adds to get the answer 57.

Interestingly there is a way to preprocess this data such that summing any range takes only a single subtraction. The technique is called a prefix sum table and you make the table by having the number at each index be the sum from index 0 to that index inclusively.

Here is the prefix sum table for the numbers above:

$\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{index} & \textbf{0} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} & \textbf{6} & \textbf{7} & \textbf{8} & \textbf{9} \\ \hline \textbf{value} & 8 & 11 & 18 & 22 & 34 & 40 & 44 & 54 & 55 & 57 \\ \hline \end{array}$

Now, to find the sum of range a to b inclusively, you start with the value at index b, and subtract the value at index (a-1).

So, to sum the numbers at index 2 through 5 like we did before, we’d start with the value at index 5 which is 40, and we subtract the value at index (2-1) aka index 1, which is 11. That gives us a result of 29 like our manual summing did before.

To sum the numbers at index 0 through index 9, we’d start with the value at index 9, which is 57, and subtract the value at index -1. Since we don’t have anything before index 0, the sum for anything before index 0 is 0. That makes our result be 57-0 or 57, which we calculated before.

Let’s move on to 2D!

## Two Dimensions – Making a Summed Area Table

In two dimensions, the same technique is called a summed area table, and things get only a little more complicated.

$i = \begin{array}{|c|c|c|c|} \hline 3 & 2 & 1 & 8 \\ \hline 9 & 11 & 15 & 0 \\ \hline 8 & 4 & 7 & 6 \\ \hline 12 & 7 & 8 & 3 \\ \hline \end{array}$

Then you make a grid of the same size, where the value at a location is the sum of all the values in the rectangle going from (0,0) to (x,y) inclusive. Assuming that (0,0) is in the top left, that would give us this summed area table:

$I = \begin{array}{|c|c|c|c|} \hline 3 & 5 & 6 & 14 \\ \hline 12 & 25 & 41 & 49 \\ \hline 20 & 37 & 60 & 74 \\ \hline 32 & 56 & 87 & 104 \\ \hline \end{array}$

You can literally sum up all the values for each index to make the table if you want to, but you can also use this formula which lets you iteratively create the table by starting at (0,0) and expand outwards from there. As before, when reading out of bounds values, just use zero.

$I(x,y)=i(x,y)+I(x,y-1)+I(x-1,y)-I(x-1,y-1)$

## Two Dimensions – Using a Summed Area Table

So we know that $I(x,y)$ is the sum of all the values in the rectangle from $(0,0)$ to $(x,y)$ inclusively, but what if we want to find the sum of a different rectangle? What if we have 4 points A,B,C,D and we want to know the sum of the numbers within that sub-rectangle?

With some cleverness we can calculate the sum inside this exact region.

First we get the value at point D, which gives us the sum of this rectangle:

Next, we subtract the value at point B, which gives us the sum of this rectangle:

The next step is to subtract the value at point C. The red area is a problem though as it has been subtracted out twice.

This is a problem that’s easily solved by adding the value at point A in, to give us our final result:

So, to summarize, using a summed area table to get the sum of all values in the rectangle defined by the points A,B,C,D is done by reading the values at points A,B,C,D and calculating: A+D-B-C

## Storage Costs

When you want to store numbers added together, you are going to need storage larger than what you are storing the numbers in.

For instance, if you have the table below using 3 bits per value:
$I = \begin{array}{|c|c|} \hline 7 & 7 \\ \hline 7 & 7 \\ \hline \end{array}$

Turning that into a summed area table, you are going to hit overflow problems:
$I = \begin{array}{|c|c|} \hline 7 & 6 \\ \hline 6 & 4 \\ \hline \end{array}$

For summing up N items, you need $log_2{(N)}$ more bits of storage which means we would need 2 more bits of storage in this case for the 2×2 grid (4 samples), making it be 5 bits total per value (3 bits of storage + 2 extra bits to hold the sum of 4 values). That would let us store the proper table:

$I = \begin{array}{|c|c|} \hline 7 & 14 \\ \hline 14 & 28 \\ \hline \end{array}$

Using the previously shown 2×2 table of 3bit 7’s as an example, what this means is that if you are only ever going to want to ask about 1×1 ranges (which is pointless to use summed area tables for, but makes a nice simple example), you don’t need 2 extra bits, and in fact don’t need any extra bits in this case since a 1×1 range is just 1 sample, and $log_2{(1)}$ is 0.

Looking back at the summed area table that had roll over problems:
$I = \begin{array}{|c|c|} \hline 7 & 6 \\ \hline 6 & 4 \\ \hline \end{array}$

Let’s ask about the range (1,1) to (1,1). So we start with the value at index (1,1) which is 4. Next we add in the value at index (0,0) which is 7 and get 11. Keeping that in 3 bits (eg mod 8), that gives us a value of 3. Next we subtract the value at index (0,1) aka 6, which keeping it in 3 bits gives us 5. Subtracting index (1,0) from that (6 again) and keeping it in 3 bits gives us 7.

So, the sum of the numbers from (1,1) to (1,1) – aka the VALUE in the original table at (1,1) – is 7. Since we made the table, we know this is correct.

It works interestingly!

If we did a 2×2 lookup instead, it would fall apart. we’d need those 2 extra bits since we’d be summing 4 samples, and $log_2{(4)}$ is 2.

So, just to re-iterate… summed area tables do need increased storage per data item to store the sums. However, while most descriptions base that increased storage on the size of the image being made into a summed area table, it is actually based on the largest range you want to sum from that table, which may be smaller than the total size.

I have an idea I’d like to try (next blog post?) where instead of storing the sum of the rectangle at each position, you store the sum divided by the area. In other words, you store the average value for the rectangle.

Calculating the sum for a specific rectangle then becomes getting the 4 values, multiplying by their area, and then doing the usual math.

Apparently this is similar to an idea of using floating point numbers in SAT, which also sounds interesting! Thread from Bart Wronski (https://twitter.com/BartWronsk):

While my idea is similar to using floating point, a handful of people (especially Tom Forsyth! https://twitter.com/tom_forsyth) have made sure I know that using floating point with large textures (~screen sized and above) is not a good idea.

Tom says:
“The entries in the bottom-right of the table start having very similar magnitudes, so the difference between them is very noisy. This is super obvious with float16s where you only have 10 bits of precision, which is less than most current screen widths.”

## Other Stuff

Bilinear Interpolation
If you are wondering whether you should use bilinear interpolation when using this technique (sample between pixels) or not, the answer is that you should. Bilinear interpolation is compatible with this technique and gives you the correct values for sub pixel sample points.

Higher Dimensions
This technique extends to 3 dimensions and beyond. The table still contains the sum of the numbers for the (hyper)rectangle from the origin to that specific index. The way you calculate the sum of a specific range is different in each dimension, but it’s similar, and you should be able to figure it out using the logic described in the 2d case!

Integrating / Summing Over Other Shapes

I had a thought on this that might not be so bad.

My thought was that if you had some shape you wanted to sum values over (aka integrate values over), that you could sum over the bounding box of the shape, divide by the area of the bounding box to get an average sum per unit for that area, and then multiply by the area of the shape you want to sum over.

This makes the assumption that the bounding box is representative of the data inside of the shape, so that makes this an approximation, but it might be good enough depending on your needs.

You might even try having a couple different summed area tables made from rotated versions of the image. That would allow you to get a tighter fitting bounding box in some situations.

I’m definitely not the first to think about how to do this though, and this is not the only way to do it. There is a link in the next section that talks about a different way to do it “Fast and Exact Convolution With Polygonal Filters” that also references a few other ways to do it.

Uses in Graphics / Other Links

Here is the paper from Franklin Crow in 1984 that introduces summed area tables as a way to get box filtered mipmapping on the fly without having to generate mipmaps in advance:
http://www.florian-oeser.de/wordpress/wp-content/2012/10/crow-1984.pdf

Here is a neat paper that talks about how to generate summed area tables efficiently on the GPU, and some interesting ways to use them for things like depth of field, glossy reflections, and refraction through frosted glass:

Here are some great reads from Fabien Giesen (https://twitter.com/rygorous) on doing fast blurs when the radius is very large. The second post also shows you how to do repeated box blurs to get tent filters, quadratic filters, cubic, etc and how they tend towards Gaussian. I’m sure there is some way to mix this concept with summed area tables to get higher order filters, but I haven’t found or worked out the details yet.
https://fgiesen.wordpress.com/2012/07/30/fast-blurs-1/
https://fgiesen.wordpress.com/2012/08/01/fast-blurs-2/

Here are some blog posts I made up explaining and demonstrating box blurs and Gaussian blurs:
https://blog.demofox.org/2015/08/18/box-blur/
https://blog.demofox.org/2015/08/19/gaussian-blur/

Bart also shared these really interesting links

“Fast and Exact Convolution With Polygonal Filters”
https://www.researchgate.net/publication/269699690_Fast_and_Exact_Convolution_with_Polygonal_Filters

“Fast Filter Spreading and its Applications”
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-54.pdf

“Filtering by repeated integration”
https://www.researchgate.net/publication/220721661_Filtering_by_repeated_integration

“Cinematic Depth Of Field: How to make big filters cheap”

# C++ Differentiable Programming: Searching For An Optimal Dither Pattern

The simple standalone C++ source code that implements this blog post and replicates the results shown is on github at: https://github.com/Atrix256/DitherFindGradientDescent

Neural networks are a hot topic right now. There is a lot of mystery and mystique surrounding them, but at their core, they are just simple programs where parameters are tuned using gradient descent.

(If curious about neural networks, you might find this interesting: How to Train Neural Networks With Backpropagation)

Gradient descent can be used in a lot of other situations though, and in fact, you can even generalize the core functionality of neural networks to work on other types of programs. That is exactly what we are doing in this post.

To be able to use gradient descent to optimize parameters of a program, your program has to be roughly of the form of:

1. It has parameters that specify how it processes some other data
2. There is some way for you to give a score to how well it did

Beyond those two points, much like as a shader program or a SIMD program, you want your program to be as branchless as possible. The reason for this is because ideally your entire program should be made up of differentiable operations. Branches (if statements) cause discontinuities and are not differentiable. There are ways to deal with branches, and some branches don’t actually impact the result, but it’s a good guideline to keep in mind. Because of this, you also want to stay away from non differentiable functions – such as a “step” function which you might be tempted to use instead of an if statement.

This post is going to go into detail about using differentiable programming in C++ for a specific goal. Results are shown, and the simple / no external dependency C++ code that generated them are at https://github.com/Atrix256/DitherFindGradientDescent.

First, let’s have a short introduction to gradient descent.

If you have a function of the form $f(x)$, it takes one input so is one dimensional.

You can think of a function like this as having a value for every point on the number line.

You can visualize those values as a height, which gives you a function of the form $y=f(x)$ which we are still going to call one dimensional, despite it now having two dimensions.

Let’s look at a function $y=3x+1$

You might remember that the equation of a line is $y=mx+b$ where m is the slope of the line ($\frac{\text{rise}}{\text{run}}$ or $\frac{y}{x}$) and b is where the line crosses the y axis.

In calculus, you learn that the slope m is also the derivative of the function: $\frac{dy}{dx}$

The slope / derivative tells you how much is added to y for every 1 you add to x.

Let’s say that you were on this graph at the point $x=1$ (which puts you at $y=4$), and let’s say that you want to go downhill from where you were at. You could do that by looking at the slope / derivative at that point, which is 3 (it’s 3 for every point on the line). Since the derivative is positive, that means going to the right will make the y value larger (you’ll go up hill) and going to the left will make the y value smaller (you’ll go down hill).

So, if you want to go downhill to a smaller y value, you know that you need to subtract values from x.

A simpler way to think of this is that you need to subtract the derivative from your x value to make your y value smaller.

That is a core fact that will help guide you through things as they get more difficult: subtract the derivative (later, subtract the gradient) to make your value smaller. The value subtracted is often multiplied by some scalar value to make it move faster or slower.

What happens if you have a more complex function, such as $y=(x-2)^2$?

Let’s say that you are on this graph at the point $x=1$, which puts you at $y=1$. Now, which way do you move to go downhill?

The derivative of this function is $y=2x-4$, which you can plug your x value into to get the slope / derivative at that point: -2.

Remembering that we subtract that derivative to go down hill, that means we need to subtract a negative value from our x; aka we need to ADD a value to our x.

As you can see, adding a value to x and making it move to the right does in fact make us go down hill.

The rule works, hooray!

Things do get a little more complex when there’s more than one dimension, but not really that much more complex, so hang in there!

Let’s look at the function $z=xy$

Let’s say that we are at the (x,y) point (1,1) – in the upper right corner – which puts us at $z=1$, and let’s say that we want to go down hill. Instead of just having one variable to take the derivative of (x), we now have two variables (x and y). How are we going to deal with this?

First up, we are going to pretend that y is a constant value, and not actually a variable. This will give us the partial derivative for x: $\frac{\partial z}{\partial x}$. That tells us how much we would add to z if we added one to x. It’s a slope that is specifically down the x axis.

In this case, the partial derivative of z with respect to x is just: y.

Doing the same thing for the other variable, the partial derivative of z with respect to y is just: x.

Now that we have partial derivatives for each variable, we put them into a vector. This vector is called the gradient, and has some intimidating notation that looks like this:

$\nabla z = \nabla f(x,y) = (\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y})$

For this function, the gradient is:

$\nabla z = \nabla f(x,y) = (y,x)$

That makes the gradient at our specific point:

$\nabla z = \nabla f(1,1) = (1,1)$

In the last section we saw that the derivative / slope pointed to where the function got larger. The same thing is true of gradients, they point in the direction where the function gets larger too!

So, if we want to go downhill, we need to subtract values from our x and our y to go there. In fact, we know that the steepest way down from our current point is when we subtract the same value from both x and y. This is because the gradient doesn’t just point to where it gets larger, it points to where it gets larger the FASTEST. So, the reverse of the gradient also points to where it gets smaller the fastest.

Pretty cool huh?

You can confirm this visually by looking at the graph of the function.

One last things about slopes, derivatives and gradients before moving on. While they do point in the direction of greatest increase, they are only valid for an infinitely small point on the graph for functions that are non linear. This will be important later when we move in the opposite direction of the gradients, but do so with very small steps to help make sure we find the lowest points on the graph.

Why do we want to use gradient descent? Imagine that we have a function:

$w=f(x,y,z)$

Sure, we can pick some random starting values for x,y and z, and then use gradient descent to find the smallest w, but who cares?

Let’s give some other names to these variables and see if the value becomes a little more apparent:

$DamageTakenMultiplier = CalculateDamageTakenMultiplier(Armor, Dodge, Resist)$

Now, by only changing the names of the variables, we can see that we could use gradient descent to find what amount of Armor, Dodge and Resist would make it so our character takes the least amount of damage. This can now tell you how to distribute stat points to a character to get the best results 😛

Note that if you are ever trying to find the highest number possible, instead of the lowest, you can just multiply your function by -1 and do everything else the same way. You could also do gradient ASCENT, but it’s equivalent to multiplying by -1 and doing gradient descent.

## Problems

Here are a few common problems you can encounter when doing gradient descent.

• Local minima – when you get to the bottom of a bowl, but it isn’t the deepest bowl.
• Flat derivatives – these make it hard to escape a local area because the derivatives are very small, which will make each movement also very small.
• Discontinuities – The problem space (graph) changes abruptly without warning, making gradient descent do the wrong thing

Here’s an example of a local minima versus a global minima. You can see that depending on where you start on this graph, you might end up in the deeper bowl, or the shallower bowl if your only rule is “move downhill”.

(Image from wikipedia By KSmrq – http://commons.wikimedia.org/wiki/File:Extrema_example.svg, GFDL 1.2, https://commons.wikimedia.org/w/index.php?curid=6870865)

Here’s an example of a flat derivative. You can imagine that if you were at x=1, that you could see that the derivative would tell you to go to the left to decrease the y value, but it’s a very, very small number. This is a problem because it’s common to multiply the derivative or gradient by a multiplier before subtracting it, so you’d only take a very small step towards the goal.

It’s also possible to hit a perfectly flat derivative, which will be exactly 0. In this case, no matter how big or small of a number you multiply the derivative by, you won’t move AT ALL!

Below is a discontinuous function where if x is less than 0.5, the value is 1, otherwise the value is x. This essentially shows you what happens when you use if statements in differentiable programming. If you start on the right side, it’s going to correctly tell you that you should move left to improve your score. However, it’ll keep telling you to move left, until you get to x being less than 0.5, at which point your score will suddenly get a lot worse and your derivative will become 0. You will now be stuck!

There are ways to deal with these problems, but they are deep topics. If nothing else, you should know these problems exist, so you can know when they are affecting you, and/or why you should avoid them if you have a choice.

## What If I Want to Avoid Calculus?

Let’s say that you don’t get a kick out of calculating all these partial derivatives. Or, more pragmatically, you don’t want to sit down and manually calculate the gradient function of some generic C++ code!

I have some great news for you.

While we do need partial derivatives for our gradients, we aren’t going to have to do all this calculus to get them!

Here are a few other ways to get partial derivatives:

• Finite Differences – Conceptually super simple, but slow to calculate and not always very precise. More info: Finite Differences
• Backpropagation – What neural networks use. Also called backwards mode automatic differentiation. Fast but a bit complex mentally. I linked this already but for more info: How to Train Neural Networks With Backpropagation
• Dual Numbers – Also called forward mode automatic differentiation. Not as fast as backwards mode, but in the same neighborhood for speed. Super, super convinient and awesome for programmers. I love these. More info: Dual Numbers & Automatic Differentiation

Care to guess which one we are going to use? Yep, Dual Numbers!

In a nutshell, if you have code that uses floats, you can change it to use a templated type instead. Then, you put dual numbers through your code instead of floats. The output you get will be the specific value output from your code, but also the GRADIENT of your code at that value. Even better, this isn’t a numerical method (it’s not an approximation), it’s analytical (it’s exact).

That is seriously all there is to it. Dual numbers are amazing!

Since you made the code templated, you can still use it for floats when you don’t want or need the gradient.

## Differentiable Programming / Gradient Descent Skeleton

Here’s the general skeleton we are going to be following for using gradient descent with our differentiable program.

1. Initialize the parameters to random (but valid) values, storing them in dual numbers.
2. Run the code that does our work, taking dual numbers as input for the parameters of how it does the work.
3. Put the result (which is dual numbers) into a scoring function to give us a score. Usually the score is such that smaller numbers are better. If not, just multiply the score by -1 so it is.
4. Since we did the work and calculated the score using dual numbers, we now have a gradient which describes how we need to adjust the parameters to make our score better.
5. Adjust our parameters using the gradient and go back to step 2. Repeating until whatever exit condition we want is hit: maybe when a certain number of iterations happen, or maybe when our score gets below a certain value.

That’s our game plan. Let’s dive into the specific problem we are going to be attacking.

## Searching For an Ideal Dithering Pattern

Here is the problem we want to tackle:

We want to find a 3×3 dithering pattern such that when we use it to dither an image (by repeating the 3×3 pattern over and over across the image), and then blur the result by a specific amount, that it’s as close as possible to the original image being blurred by that same amount.

That sounds a bit challenging right? It’s not actually that bad, don’t worry (:

The steps the code has to do (differentiably) are:

1. Dither the source image
2. Blur the results
3. Blur the source image
4. Calculate a score for how similar they are
5. Use all this with Gradient Descent to optimize the dither pattern

Once again, we need to do this stuff differentiably, using dual numbers, so that we get a gradient for how to modify the dither pattern to better our score.

### Step 1 – Dither Source Image

Dithering an image is a pretty simple process.

We are going to be dithering it such that we take a greyscale image as input and convert it to a black and white image using the dither pattern.

(If you are starting with a color image, this shows how to convert it to greyscale: Converting RGB to Grayscale)

For every pixel (x,y) in the source image, you look at pixel (x%3, y%3) in the dither pattern, and if the dither pattern pixel is less than the source, you write a black pixel out, else you write a white pixel out.

if (sourcePixel(x,y) < ditherPixel(x%3, y%3))
pixelOut(x,y) = 0.0;
else
pixelOut(x,y) = 1.0;


There’s a problem though… this is a branch, which makes a discontinuity, which will make it so we can’t have good derivatives to help us get to the goal.

Another way to write the dithering operation above is to write it like this:

difference = ditherPixel(x%3, y%3) - sourcePixel(x,y);
pixelOut(x,y) = step(difference);


Where “step” is the “heaviside step function”, which is 1 if x >= 0, otherwise is 0.

(Image from Wikipedia By Omegatron (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)%5D, via Wikimedia Commons)

That got rid of the branch (if statement), but we still have a discontinuous function.

Luckily we can approximate a step function with other functions. I decided to use the formula $0.5+atan(100*x)/pi$ which looks like this:

Unfortunately, I found that my results weren’t that good, so i switched it to $0.5+atan(10000*x)/pi$ which ended up working better for me:

This function does have the problem of having flat derivatives, but I found that it worked pretty well anyways. The flat derivatives don’t seem to be a big problem in this case luckily.

To put it all together, the differentiable version of dithering a pixel that I use looks like this:

difference = ditherPixel(x%3, y%3) - sourcePixel(x,y);
pixelOut(x,y) = 0.5+atan(10000.0f * difference) / pi;


As input to this dithering process, we take:

• The source image
• a 3×3 dither pattern, where each pixel is a dual number

As output this dithering process gives us:

• A dithered image that is converted to black and white (either a 1.0 or 0.0 value per pixel)
• It’s the same size as the source image
• Each pixel is a dual number with 9 derivatives. There is one derivative per dither pixel.

### Step 2 – Blur the Results

Blurring the results of the dither wasn’t that difficult. I used a Gaussian blur, but other blurs could be used easily.

I had some Gaussian blur code laying around (from this blog post: Gaussian Blur) and I converted it to using a templated type instead of floats/pixels where appropriate, also making sure there were no branches or anything discontinuous.

It turned out there wasn’t a whole lot to fix up here luckily so wasn’t too difficult.

This allowed me to take the dithered results which are a dual number per pixel, and do a Gaussian blur on them, preserving and correctly modifying the gradient (derivatives) as it did the Blur.

### Step 3 – Blur the Source Image

Blurring the source image was easy since the last step made a generic gaussian blur function. I used the generic Gaussian blur function to blur the image. This doesn’t need to be done as dual numbers, so it was regular pixels in and regular pixels out.

You might wonder why this part doesn’t need to be done as dual numbers.

The simple answer is because these values are in no way dependant on the dither pattern (which are what we are tracking with the derivatives).

The more mathematical explanation is that you could in fact consider these dual numbers, which just have a gradient of zero because they are essentially constants that have nothing to do (yet) with the parameters of the function. The gradient would just implicitly be zero, like any other constant value you might introduce to the function.

### Step 4 – Calculating a Similarity Score

Next up I needed to calculate a similarity score between the dithered then blurred results (which is made up of dual numbers), and the source image which was blurred (and is made up of regular pixels).

The similarity score I went with is just MSE or “Mean Squared Error”.

To calculate MSE, for every pixel you do this:

error = ditheredBlurredImage(x,y) - blurredImage(x,y);
errorSquared = error * error;


After you have the squared error for every pixel, you just take the average of them to get the MSE.

An interesting thing about MSE is that because errors are squared, it will favor smaller errors much more than larger errors, which is a nice property.

A not so nice property about MSE is that it might decide something is a small difference mathematically even though a human would decide that it was a huge difference perceptually. The reverse is also true. Despite this, I chose it because it is simple and I ended up getting decent results with it.

If you want to go down the rabbit hole of looking at “perceptual similarity scores of images” check out these links:

After this step, we have an MSE value which says how similar the images are. A lower value means lower average squared error, so lower numbers are indeed better.

What else is nice is that the MSE value is a dual number with a gradient that has the 9 partial derivatives that describe how much the MSE changes as you adjust each parameter.

That gradient tells us how to adjust the parameters (the 3×3 dither pixels!) to lower the MSE!

### Step 5 – Putting it All Together

Now it’s time to put all of this together and use gradient descent to make our dither pattern better.

Here’s how the program operates:

1. Initialize the 3×3 dither pattern to random values, setting the derivatives to 1.0 in the gradient, for the variable that they represent.
2. do 1000 iterations of this loop:
1. Dither and blur the source image
2. Calculate MSE of this result compared to the source image blurred
3. Using the gradient from the MSE value, subtract the respective partial derivative from each of the pixels in the dither pattern, but scaling the partial derivative by a “learning rate”.
3. Output the best result found

The learning rate starts at 3.0 at loop iteration 0, but decays with each iteration, down to 0.1 at iteration 999. It starts above 1 to help escape local minima, and uses a very small rate at the end to try and get deeper into whatever minimum it has found.

After adjusting the dither pattern pixels, I clamp them to be between 0 and 1.

Something else I ought to mention is that while I’m doing the gradient descent, I keep track of the best scoring dither pattern seen.

This way, after the 1000 iterations are up, if we ever saw anything better than where we are at currently, we just use that instead of the final result.

Presumably, if you tune your parameters (learning rate, iterations, etc!) correctly, this won’t come up often, but it’s always a possibility that your final state is not the best state encountered, so this is a nice way to get better results more often.

## Results

Did you notice that I called this post “searching for an ideal dither pattern” instead of “finding an ideal dither pattern”? (:

The results are decent, but I know they could be better. Even so, I think the techniques talked about here are a good start going down the path of differentiable programming, and similar topics.

Here are some results I was able to get with the code. Click to see the full size images. The shrunken down images have aliasing issues.

The images left to right are: The original, the dither pattern used (repeated), the dithered image, the blurred dither image, and lastly the blurred original image. The program aims to make the last two images look as close as possible as it can, using MSE as the metric for how close they are.

Here is the starting state of using a Gaussian blur with a sigma of 10:

Here it is after the 1000 iterations of gradient descent. Notice the black blob at the top is gone compared to where it started.

Here’s the starting state when using a Gaussian blur sigma of 1:

And here it is after 1000 iterations, which is pretty decent results:

Lastly, here it is with no blurring whatsoever:

And after 1000 iterations, I think it actually looks worse!

Using no blur at all makes for some really awful results. The blur gives the algorithm more freedom on how it can succeed, whereas with no blur, there is a lot less wiggle room for finding a solution.

Another benefit of using the blur before MSE calculation is that a blur is a low pass filter. That means that higher frequencies are gone before the MSE calculation. The result of this is that the algorithm will favor results which are closer to blue noise dithering. Pretty neat right?!

## Closing

I hope you enjoyed this journey through differentiable programming and gradient descent, and I hope you were able to follow along.

Here are some potentially interesting things to do beyond what we talked about here:

• Have it learn from a set of images, instead of only this single image. This should help prevent “over fitting” and let it find a dither pattern which works well for all images instead of just this one specific image.
• Use a separate set of images to gauge the accuracy of the result that weren’t used as part of the training, to help prove that it really hasn’t overfit the training data.
• Try applying “small corruption” in the learning to help prevent overfitting or getting stuck in local minima – one idea would be to have some percentage chance per derivative that you don’t apply the change to the dither pattern pixel. This would add some randomness to the gradient descent instead of it only being down the steepest direction all of the time.
• Instead of optimizing the dithering patterns, you could make a formula that generated the dithering patterns, and instead optimize the coefficients / terms of that formula. If you get good results, you’ll end up with a formula you can use for dithering instead of a pattern, which might be nice for the case of avoiding a texture read in a pixel shader to do the dithering.

I’m not a data scientist or machine learning expert by any means, so there are plenty of improvements to be made. There is a lot of overlap with what is being done here and other algorithms – both in the machine learning realm and outside of the machine learning realm.

For one, you can use Newton’s method for gradient descent. It can find minima faster by using the second derivative in the calculations as well.

However, this algorithm is almost purely “exploitative” in that wherever you start with your parameters, it will try to go from there to the deepest point in whatever valley it’s already in. Some other types of algorithms differ from this in that they are more “explorative” and try to find other valleys, but aren’t always as good at finding the deepest part of the valleys that they do find. More explorative algorithms include simulated annealing, differential evolution, and genetic algorithms.

If you enjoyed this post, check out this book for deeper details on algorithms relating to gradient descent (simulated annealing, genetic algorithms, etc!). It’s a very good book and very easy to read!
Essentials of Metaheuristics

Any corrections to what i’ve said, the code, or suggestions for improvements, please let me know by leaving a comment here, or hitting me up on twitter: https://twitter.com/Atrix256

# Counting Bits & The Normal Distribution

I recently saw some interesting posts on twitter about the normal distribution:

I’m not really a statistics kind of guy, but knowing that probability distributions come up in graphics (Like in PBR & Path Tracing), it seemed like a good time to upgrade knowledge in this area while sharing an interesting technique for generating normal distribution random numbers.

## Basics

Below is an image showing a few normal (aka Gaussian) distributions (from wikipedia).

Normal distributions are defined by these parameters:

• $\mu$ – “mu” is the mean. This is the average value of the distribution. This is where the center (peak) of the curve is on the x axis.
• $\sigma^2$ – “sigma squared” is the variance, and is just the standard deviation squared. I find standard deviation more intuitive to think about.
• $\sigma$ – “sigma” is the standard deviation, which (surprise surprise!) is the square root of the variance. This controls the “width” of the graph. The area under the cover is 1.0, so as you increase standard deviation and make the graph wider, it also gets shorter.

Here’s a diagram of standard deviations to help understand them (also from wikipedia):

I find the standard deviation intuitive because 68.2% of the data is within one standard deviation from the mean (on the plus and minus side of the mean). 95.4% of the data is within two standard deviations of the mean.

Standard deviation is given in the same units as the data itself, so if a bell curve described scores on a test, with a mean of 80 and a standard deviation of 5, it means that 68.2% of the students got between 75 and 85 points on the test, and that 95.4% of the students got between 70 and 90 points on the test.

The normal distribution is what’s called a “probability density function” or pdf, which means that the y axis of the graph describes the likelyhood of the number on the x axis being chosen at random.

This means that if you have a normal distribution that has a specific mean and variance (standard deviation), that numbers closer to the mean are more likely to be chosen randomly, while numbers farther away are less likely. The variance controls how the probability drops off as you get farther away from the mean.

Thinking about standard deviation again, 68.2% of the random numbers generated will be within 1 standard deviation of the mean (+1 std dev or -1 std dev). 95.4% will be within 2 standard deviations.

## Generating Normal Distribution Random Numbers – Coin Flips

Generating uniform random numbers, where every number is as likely as every other number, is pretty simple. In the physical world, you can roll some dice or flip some coins. In the software world, you can use PRNGs.

How would you generate random numbers that follow a normal distribution though?

In C++, there is std::normal_distribution that can do this for you. There is also something called the Box-Muller transform that can turn uniformly distributed random numbers into normal distribution random numbers (info here: Generating Gaussian Random Numbers).

I want to talk about something else though and hopefully build some better intuition.

First let’s look at coin flips.

If you flip a fair coin a million times and keep a count of how many heads and tails you saw, you might get 500014 heads and 499986 tails (I got this with a PRNG – std::mt19937). That is a pretty uniform distribution of values in the range of [0,1]. (breadcrumb: pascal’s triangle row 2 is 1,1)

Let’s flip two coins at a time though and add our values together (say that heads is 0 and tails is 1). Here’s what that graph looks like:

Out of 1 million flips, 250639 had no tails, 500308 had one tail, and 249053 had two tails. It might seem weird that they aren’t all even, but it makes more sense when you look at the outcome of flipping two coins: we can get heads/heads (00), heads/tails (01), tails/heads (10) or tails/tails (11). Two of the four possibilities have a single tails, so it makes sense that flipping two coins and getting one coin being a tail would be twice as likely as getting no tails or two tails. (breadcrumb: pascal’s triangle row 3 is 1,2,1)

What happens when we sum 3 coins? With a million flips I got 125113 0’s, 375763 1’s, 373905 2’s and 125219 3’s.

If you work out the possible combinations, there is 1 way to get 0, 3 ways to get 1, 3 ways to get 2 and 1 way to get 3. Those numbers almost exactly follow that 1, 3, 3, 1 probability. (breadcrumb: pascal’s triangle row 4 is 1,3,3,1)

If we flip 100 coins and sum them, we get this:

That looks a bit like the normal distribution graphs at the beginning of this post doesn’t it?

Flipping and summing coins will get you something called the “Binomial Distribution”, and the interesting thing there is that the binomial distribution approaches the normal distribution the more coins you are summing together. At an infinite number of coins, it is the normal distribution.

## Generating Normal Distribution Random Numbers – Dice Rolls

What if instead of flipping coins, we roll dice?

Well, rolling a 4 sided die a million times, you get each number roughly the same percentage of the time as you’d expect; roughly 25% each. 250125 0’s, 250103 1’s, 249700 2’s, 250072 3’s.

If we sum two 4 sided dice rolls we get this:

If we sum three 4 sided dice rolls we get this:

And if we sum one hundred we get this, which sure looks like a normal distribution:

This isn’t limited to four sided dice though, here’s one hundred 6 sided dice being summed:

With dice, instead of being a “binomial distribution”, it’s called a “multinomial distribution”, but as the number of dice goes to infinity, it also approaches the normal distribution.

This means you can get a normal distribution with not only coins, but any sided dice in general.

An even stronger statement than that is the Central Limit Theorem which says that if you have random numbers from ANY distribution, if you add enough of em together, you’ll often approach a normal distribution.

Strange huh?

## Generating Normal Distribution Random Numbers – Counting Bits

Now comes a fun way of generating random numbers which follow a normal distribution. Are you ready for it?

Simply generate an N bit random number and return how many 1 bits are set.

That gives you a random number that follows a normal distribution!

One problem with this is that you have very low “resolution” random numbers. Counting the bits of a 64 bit random number for instance, you can only return 0 through 64 so there are only 65 possible random numbers.

That is a pretty big limitation, but if you need normal distribution numbers calculated quickly and don’t mind if they are low resolution (like in a pixel shader?), this technique could work well for you.

Another problem though is that you don’t have control over the variance or the mean of the distribution.

That isn’t a super huge deal though because you can easily convert numbers from one normal distribution into another normal distribution.

To do so, you get your normal distribution random number. First you subtract the mean of the distribution to make it centered on 0 (have a mean of 0). You then divide it by the standard deviation to make it be part of a distribution which has a standard deviation of 1.

At this point you have a random number from a normal distribution which has a mean of 0 and a standard deviation of 1.

Next, you multiply the number by the standard deviation of the distribution you want, and lastly you add the mean of the distribution you want.

That’s pretty simple (and is implemented in the source code at the bottom of this post), but to do this you need to know what standard deviation (variance) and mean you are starting with.

If you have some way to generate random numbers in [0, N) and you are summing M of those numbers together, the mean is $M*(N-1)/2$. Note that if you instead are generating random numbers in [1,N], the mean instead is $M*(N+1)/2$.

The variance in either case is $M*(N^2-1)/12$. The standard deviation is the square root of that.

Using that information you have everything you need to generate normal distribution random numbers of a specified mean and variance.

Thanks to @fahickman for the help on calculating mean and variance of dice roll sums.

## Code

Here is the source code I used to generate the data which was used to generate the graphs in this post. There is also an implementation of the bit counting algorithm i mentioned, which converts to the desired mean and variance.

#define _CRT_SECURE_NO_WARNINGS

#include <array>
#include <random>
#include <stdint.h>
#include <stdio.h>
#include <limits>

const size_t c_maxNumSamples = 1000000;
const char* c_fileName = "results.csv";

template <size_t DiceRange, size_t DiceCount, size_t NumBuckets>
void DumpBucketCountsAddRandomNumbers (size_t numSamples, const std::array<size_t, NumBuckets>& bucketCounts)
{
// open file for append if we can
FILE* file = fopen(c_fileName, "a+t");
if (!file)
return;

// write the info
float mean = float(DiceCount) * float(DiceRange - 1.0f) / 2.0f;
float variance = float(DiceCount) * (DiceRange * DiceRange) / 12.0f;
if (numSamples == 1)
{
fprintf(file, "\"%zu random numbers [0,%zu) added together (sum %zud%zu). %zu buckets.  Mean = %0.2f.  Variance = %0.2f.  StdDev = %0.2f.\"\n", DiceCount, DiceRange, DiceCount, DiceRange, NumBuckets, mean, variance, std::sqrt(variance));
fprintf(file, "\"\"");
for (size_t i = 0; i < NumBuckets; ++i)
fprintf(file, ",\"%zu\"", i);
fprintf(file, "\n");
}
fprintf(file, "\"%zu samples\",", numSamples);

// report the samples
for (size_t count : bucketCounts)
fprintf(file, "\"%zu\",", count);

fprintf(file, "\"\"\n");
if (numSamples == c_maxNumSamples)
fprintf(file, "\n");

// close file
fclose(file);
}

template <size_t DiceSides, size_t DiceCount>
{
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<size_t> dist(size_t(0), DiceSides - 1);

std::array<size_t, (DiceSides - 1) * DiceCount + 1> bucketCounts = { 0 };

size_t nextDump = 1;
for (size_t i = 0; i < c_maxNumSamples; ++i)
{
size_t sum = 0;
for (size_t j = 0; j < DiceCount; ++j)
sum += dist(rng);

bucketCounts[sum]++;

if (i + 1 == nextDump)
{
nextDump *= 10;
}
}
}

template <size_t NumBuckets>
void DumpBucketCountsCountBits (size_t numSamples, const std::array<size_t, NumBuckets>& bucketCounts)
{
// open file for append if we can
FILE* file = fopen(c_fileName, "a+t");
if (!file)
return;

// write the info
float mean = float(NumBuckets-1) * 1.0f / 2.0f;
float variance = float(NumBuckets-1) * 3.0f / 12.0f;
if (numSamples == 1)
{
fprintf(file, "\"%zu random bits (coin flips) added together. %zu buckets.  Mean = %0.2f.  Variance = %0.2f.  StdDev = %0.2f.\"\n", NumBuckets - 1, NumBuckets, mean, variance, std::sqrt(variance));
fprintf(file, "\"\"");
for (size_t i = 0; i < NumBuckets; ++i)
fprintf(file, ",\"%zu\"", i);
fprintf(file, "\n");
}
fprintf(file, "\"%zu samples\",", numSamples);

// report the samples
for (size_t count : bucketCounts)
fprintf(file, "\"%zu\",", count);

fprintf(file, "\"\"\n");
if (numSamples == c_maxNumSamples)
fprintf(file, "\n");

// close file
fclose(file);
}

template <size_t NumBits> // aka NumCoinFlips!
void CountBitsTest ()
{

size_t maxValue = 0;
for (size_t i = 0; i < NumBits; ++i)
maxValue = (maxValue << 1) | 1;

std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<size_t> dist(0, maxValue);

std::array<size_t, NumBits + 1> bucketCounts = { 0 };

size_t nextDump = 1;
for (size_t i = 0; i < c_maxNumSamples; ++i)
{
size_t sum = 0;
size_t number = dist(rng);
while (number)
{
if (number & 1)
++sum;
number = number >> 1;
}

bucketCounts[sum]++;

if (i + 1 == nextDump)
{
DumpBucketCountsCountBits(nextDump, bucketCounts);
nextDump *= 10;
}
}
}

float GenerateNormalRandomNumber (float mean, float variance)
{
static std::mt19937 rng;
static std::uniform_int_distribution<uint64_t> dist(0, (uint64_t)-1);

static bool seeded = false;
if (!seeded)
{
seeded = true;
rng.seed(std::random_device()());
}

// generate our normal distributed random number from 0 to 65.
//
float sum = 0.0f;
uint64_t number = dist(rng);
while (number)
{
if (number & 1)
sum += 1.0f;
number = number >> 1;
}

// convert from: mean 32, variance 16, stddev 4
// to: mean 0, variance 1, stddev 1
float ret = sum;
ret -= 32.0f;
ret /= 4.0f;

// convert to the specified mean and variance
ret *= std::sqrt(variance);
ret += mean;
return ret;
}

void VerifyGenerateNormalRandomNumber (float mean, float variance)
{
// open file for append if we can
FILE* file = fopen(c_fileName, "a+t");
if (!file)
return;

// write info
fprintf(file, "\"Normal Distributed Random Numbers. mean = %0.2f.  variance = %0.2f.  stddev = %0.2f\"\n", mean, variance, std::sqrt(variance));

// write some random numbers
fprintf(file, "\"100 numbers\"");
for (size_t i = 0; i < 100; ++i)
fprintf(file, ",\"%f\"", GenerateNormalRandomNumber(mean, variance));
fprintf(file, "\n\n");

// close file
fclose(file);
}

int main (int argc, char **argv)
{
// clear out the file
FILE* file = fopen(c_fileName, "w+t");
if (file)
fclose(file);

// coin flips
{
// flip a fair coin

// flip two coins and sum them

// sum 3 coin flips

// sum 100 coin flips
}

// dice rolls
{
// roll a 4 sided die

// sum two 4 sided dice

// sum three 4 sided dice

// sum one hundred 4 sided dice

// sum one hundred 6 sided dice
}

CountBitsTest<8>();
CountBitsTest<16>();
CountBitsTest<32>();
CountBitsTest<64>();

VerifyGenerateNormalRandomNumber(0.0f, 20.0f);

VerifyGenerateNormalRandomNumber(0.0f, 10.0f);

VerifyGenerateNormalRandomNumber(5.0f, 10.0f);

return 0;
}


# SIMD / GPU Friendly Branchless Binary Search

The other day I was thinking about how you might do a binary search branchlessly. I came up with a way, and I’m pretty sure I’m not the first to come up with it, but it was fun to think about and I wanted to share my solution.

Here it is searching a list of 8 items in 3 steps:

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
size_t ret = (haystack[4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}


The three steps it does are:

1. The list has 8 items in it. We test index 4 to see if we need to continue searching index 0 to 3, or index 4 to 7. The returned index becomes either 0xx or 1xx.
2. The list has 4 items in it now. We test index 2 to see if we need to continue searching index 0 to 1, or index 2 to 3. The returned index becomes one of the following: 00x, 01x, 10x or 11x.
3. The list has 2 items in it. We test index 1 to see if we need to take the left item or the right item. The returned index becomes: 000, 001, 010, 011, 100, 101, 110, or 111.

## But Big O Complexity is Worse!

Usually a binary search can take up to $O(log N)$ steps, where $N$ is the number of items in the list. In this post’s solution, it always takes $log_2N$ steps.

It probably seems odd that this branchless version could be considered an improvement when it has big O complexity that is always the worst case of a regular binary search. That is strange, but in the world of SIMD and shader programs, going branchless can be a big win that is not captured by looking at big O complexity. (Note that cache coherancy and thread contention are two other things not captured by looking at big O complexity).

Also, when working in something like video games or other interactive simulations, an even frame rate is more important than a high frame rate for making a game look and feel smooth. Because of this, if you have algorithms that have very fast common cases but much slower worst cases, you may actually prefer to use an algorithm that is slower in the common case but faster in the worst case just to keep performance more consistent. Using an algorithm such as this, which has a constant amount of work regardless of input can be a good trade off there.

Lastly, in cryptographic applications, attackers can gather secret information by seeing how long certain operations take. For instance, if you use a shorter password than other people, an attacker may be able to detect that by seeing that it consistently takes you a little bit less time to login than other people. They now have an idea of the length of your password, and maybe will brute force you, knowing that you are low hanging fruit!

These timing based attacks can be thwarted by algorithms which run at a constant time regardless of input. This algorithm is one of those algorithms.

As an example of another algorithm that runs in constant time regardless of input, check out CORDIC math. I really want to write up a post on that someday, it’s pretty cool stuff.

You might have noticed that if the item you are searching for isn’t in the list, the function doesn’t return anything indicating that, and you might think that’s strange.

This function actually just returns the largest index that isn’t greater than the value you are searching for. If all the numbers are greater than the value you are searching for, it returns zero.

This might seem odd but this can actually come in handy if the list you are searching represents something like animation data, where there are keyframes sorted by time, and you want to find which two keyframes you are between so that you can interpolate.

To actually test if your value was in the list, you could do an extra check:

    size_t searchValue = 3;
size_t index = BinarySearch8(searchValue, list);
bool found = (list[index] == searchValue);


If you need that extra check, it’s easy enough to add, and if you don’t need that extra check, it’s nice to not have it.

## Without Ternary Operator

If in your setup you don’t have a ternary operator, or if the ternary operator isn’t branchless for you, you get the same results using multiplication:

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
size_t ret = (haystack[4] <= needle) * 4;
ret += (haystack[ret + 2] <= needle) * 2;
ret += (haystack[ret + 1] <= needle) * 1;
return ret;
}


Note that on some platforms, the less than or equal test will be a branch! None of the platforms or compilers I tested had that issue but if you find yourself hitting that issue, you can do a branchless test via subtraction or similar.

Here is a godbolt link that lets you view the assembly for various compilers. When you open the link you’ll see clang doing this work branchlessly.
View Assembly

@adamjmiles from twitter also verified that GCN does it branchlessly, which you can see at the link below. Thanks for that!
View GCN Assembly

Something to keep in mind for the non GPU case though is that if you were doing this in SIMD, you’d be using SIMD intrinsics.

## Larger Lists

It’s trivial to search larger numbers of values. Here it is searching 16 items in 4 steps:

size_t BinarySearch16 (size_t needle, const size_t haystack[16])
{
size_t ret = (haystack[8] <= needle) ? 8 : 0;
ret += (haystack[ret + 4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}


And here it is searching 32 items in 5 steps:

size_t BinarySearch32 (size_t needle, const size_t haystack[32])
{
size_t ret = (haystack[16] <= needle) ? 16 : 0;
ret += (haystack[ret + 8] <= needle) ? 8 : 0;
ret += (haystack[ret + 4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}


## Non Power of 2 Lists

Let’s say that your list is not a perfect power of two in length. GASP!

You can still use the technique, but you treat it as if it has the next power of 2 up items, and then make sure your indices stay in range. The nice part here is that you don’t have to do extra work on the index at each step of the way, only in the places where it’s possible for the index to go out of range.

Here it is searching an array of size 7 in 3 steps:

size_t BinarySearch7 (size_t needle, const size_t haystack[7])
{
size_t ret = 0;
size_t testIndex = 0;

// test index is at most 4, so is within range.
testIndex = ret + 4;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 6, so is within range.
testIndex = ret + 2;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 7, so could be out of range.
// use min() to make sure the index stays in range.
testIndex = std::min<size_t>(ret + 1, 6);
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

return ret;
}


There are some other techniques for dealing with non power of 2 sized lists that you can find in the links at the bottom, but there was one particularly interesting that my friend and ex boss James came up with.

Basically, you start out with something like this if you were searching a list of 7 items:

    // 7 because the list has 7 items in it.
// 4 because it's half of the next power of 2 that is >= 7.
ret = (haystack[4] <= needle) * (7-4);


The result is that instead of having ret go to either 0 or 4, it goes to 0 or 3.

From there, in both cases you have 4 items in your sublist remaining, so you don’t need to worry about the index going out of bounds from that point on.

## Code

Here’s some working code demonstrating the ideas above, as well as it’s output.

#include <algorithm>
#include <stdlib.h>

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
// using ternary operator
size_t ret = (haystack[4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}

size_t BinarySearch8b (size_t needle, const size_t haystack[8])
{
// using multiplication
size_t ret = (haystack[4] <= needle) * 4;
ret += (haystack[ret + 2] <= needle) * 2;
ret += (haystack[ret + 1] <= needle) * 1;
return ret;
}

size_t BinarySearch7 (size_t needle, const size_t haystack[7])
{
// non perfect power of 2.  use min() to keep it from going out of bounds.
size_t ret = 0;
size_t testIndex = 0;

// test index is 4, so is within range.
testIndex = ret + 4;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 6, so is within range.
testIndex = ret + 2;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 7, so could be out of range.
// use min() to make sure the index stays in range.
testIndex = std::min<size_t>(ret + 1, 6);
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

return ret;
}

int main (int argc, char **argv)
{
// search a list of size 8
{
// show the data
printf("Seaching through a list with 8 items:\n");
size_t data[8] = { 1, 3, 5, 6, 9, 11, 15, 21 };
printf("data = [");
for (size_t i = 0; i < sizeof(data)/sizeof(data[0]); ++i)
{
if (i > 0)
printf(", ");
printf("%zu", data[i]);
}
printf("]\n");

// do some searches on it using ternary operation based function
printf("\nTernary based searches:\n");
#define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch8(needle, data), data[BinarySearch8(needle, data)], data[BinarySearch8(needle, data)] == needle ? "true" : "false");
FIND(2);
FIND(3);
FIND(0);
FIND(22);
FIND(16);
FIND(15);
FIND(21);
#undef FIND

// do some searches on it using multiplication based function
printf("\nMultiplication based searches:\n");
#define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch8b(needle, data), data[BinarySearch8b(needle, data)], data[BinarySearch8b(needle, data)] == needle ? "true" : "false");
FIND(2);
FIND(3);
FIND(0);
FIND(22);
FIND(16);
FIND(15);
FIND(21);
#undef FIND

printf("\n\n\n\n");
}

// search a list of size 7
{
// show the data
printf("Seaching through a list with 7 items:\n");
size_t data[7] = { 1, 3, 5, 6, 9, 11, 15};
printf("data = [");
for (size_t i = 0; i < sizeof(data)/sizeof(data[0]); ++i)
{
if (i > 0)
printf(", ");
printf("%zu", data[i]);
}
printf("]\n");

// do some searches on it using ternary operation based function
printf("\nTernary based searches:\n");
#define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch7(needle, data), data[BinarySearch7(needle, data)], data[BinarySearch7(needle, data)] == needle ? "true" : "false");
FIND(2);
FIND(3);
FIND(0);
FIND(22);
FIND(16);
FIND(15);
FIND(21);
#undef FIND

printf("\n\n\n\n");
}

system("pause");
return 0;
}


## Closing

Another facet of binary searching is that it isn’t the most cache friendly algorithm out there. There might be some value in combining the above with the information in the link below.

Cache-friendly binary search

If you like this sort of thing, here is an interesting paper from this year (2017):
Array Layouts For Comparison-Based Searching

And further down the rabbit hole a bit, this talks about re-ordering the search array to fit things into a cache line better:
https://people.mpi-inf.mpg.de/~rgemulla/publications/schlegel09search.pdf

Taking the next step is Intel and Oracle’s FAST paper:
http://www.timkaldewey.de/pubs/FAST__TODS11.pdf

Florian Gross from twitch made me aware of the last two links and also mentioned his master’s these in this area (thank you Florian!):

@rygorous mentioned on twitter some improvements such as ternary and quaternary search, as well as a way to handle the case of non power of 2 sized lists without extra index checks:

Thanks to everyone who gave feedback. It’s a very interesting topic, of which this post only seems to scratch this surface!

Hopefully you found this interesting. Questions, comments, corrections, let me know!

# When Random Numbers Are Too Random: Low Discrepancy Sequences

Random numbers can be useful in graphics and game development, but they have a pesky and sometimes undesirable habit of clumping together.

This is a problem in path tracing and monte carlo integration when you take N samples, but the samples aren’t well spread across the sampling range.

This can also be a problem for situations like when you are randomly placing objects in the world or generating treasure for a treasure chest. You don’t want your randomly placed trees to only be in one part of the forest, and you don’t want a player to get only trash items or only godly items when they open a treasure chest. Ideally you want to have some randomness, but you don’t want the random number generator to give you all of the same or similar random numbers.

The problem is that random numbers can be TOO random, like in the below where you can see clumps and large gaps between the 100 samples.

For cases like that, when you want random numbers that are a little bit more well distributed, you might find some use in low discrepancy sequences.

The standalone C++ code (one source file, standard headers, no libraries to link to) I used to generate the data and images are at the bottom of this post, as well as some links to more resources.

# What Is Discrepancy?

In this context, discrepancy is a measurement of the highest or lowest density of points in a sequence. High discrepancy means that there is either a large area of empty space, or that there is an area that has a high density of points. Low discrepancy means that there are neither, and that your points are more or less pretty evenly distributed.

The lowest discrepancy possible has no randomness at all, and in the 1 dimensional case means that the points are evenly distributed on a grid. For monte carlo integration and the game dev usage cases I mentioned, we do want some randomness, we just want the random points to be spread out a little more evenly.

If more formal math notation is your thing, discrepancy is defined as:
$D_{N}(P)=\sup _{{B\in J}}\left|{\frac {A(B;P)}{N}}-\lambda _{s}(B)\right|$

Equidistributed sequence

For monte carlo integration specifically, this is the behavior each thing gives you:

• High Discrepancy: Random Numbers / White Noise aka Uniform Distribution – At lower sample counts, convergance is slower (and have higher variance) due to the possibility of not getting good coverage over the area you integrating. At higher sample counts, this problem disappears. (Hint: real time graphics and preview renderings use a smaller number of samples)
• Lowest Discrepancy: Regular Grid – This will cause aliasing, unlike the other “random” based sampling, which trade aliasing for noise. Noise is preferred over aliasing.
• Low Discrepancy: Low Discrepancy Sequences – In lower numbers of samples, this will have faster convergence by having better coverage of the sampling space, but will use randomness to get rid of aliasing by introducing noise.

Also interesting to note, Quasi Monte Carlo has provably better asymptotic convergence than regular monte carlo integration.

# 1 Dimensional Sequences

We’ll first look at 1 dimensional sequences.

## Grid

Here are 100 samples evenly spaced:

## Random Numbers (White Noise)

This is actually a high discrepancy sequence. To generate this, you just use a standard random number generator to pick 100 points between 0 and 1. I used std::mt19937 with a std::uniform_real_distribution from 0 to 1:

## Subrandom Numbers

Subrandom numbers are ways to decrease the discrepancy of white noise.

One way to do this is to break the sampling space in half. You then generate even numbered samples in the first half of the space, and odd numbered samples in the second half of the space.

There’s no reason you can’t generalize this into more divisions of space though.

This splits the space into 4 regions:

8 regions:

16 regions:

32 regions:

There are other ways to generate subrandom numbers though. One way is to generate random numbers between 0 and 0.5, and add them to the last sample, plus 0.5. This gives you a random walk type setup.

Here is that:

## Uniform Sampling + Jitter

If you take the first subrandom idea to the logical maximum, you break your sample space up into N sections and place one point within those N sections to make a low discrepancy sequence made up of N points.

Another way to look at this is that you do uniform sampling, but add some random jitter to the samples, between +/- half a uniform sample size, to keep the samples in their own areas.

This is that:

I have heard that Pixar invented this technique interestingly.

# Irrational Numbers

Rational numbers are numbers which can be described as fractions, such as 0.75 which can be expressed as 3/4. Irrational numbers are numbers which CANNOT be described as fractions, such as pi, or the golden ratio, or the square root of a prime number.

Interestingly you can use irrational numbers to generate low discrepancy sequences. You start with some value (could be 0, or could be a random number), add the irrational number, and modulus against 1.0. To get the next sample you add the irrational value again, and modulus against 1.0 again. Rinse and repeat until you get as many samples as you want.

Some values work better than others though, and apparently the golden ratio is provably the best choice (1.61803398875…), says Wikipedia.

Here is the golden ratio, using 4 different random (white noise) starting values:

Here I’ve used the square root of 2, with 4 different starting random numbers again:

Lastly, here is pi, with 4 random starting values:

## Van der Corput Sequence

The Van der Corput sequence is the 1d equivelant of the Halton sequence which we’ll talk about later.

How you generate values in the Van der Corput sequence is you convert the index of your sample into some base.

For instance if it was base 2, you would convert your index to binary. If it was base 16, you would convert your index to hexadecimal.

Now, instead of treating the digits as if they are $B^0$, $B^1$, $B^2$, etc (where B is the base), you instead treat them as $B^{-1}$, $B^{-2}$, $B^{-3}$ and so on. In other words, you multiply each digit by a fraction and add up the results.

To show a couple quick examples, let’s say we wanted sample 6 in the sequence of base 2.

First we convert 6 to binary which is 110. From right to left, we have 3 digits: a 0 in the 1’s place, a 1 in the 2’s place, and a 1 in the 4’s place. $0*1 + 1*2 + 1*4 = 6$, so we can see that 110 is in fact 6 in binary.

To get the Van der Corput value for this, instead of treating it as the 1’s, 2’s and 4’s digit, we treat it as the 1/2, 1/4 and 1/8’s digit.

$0 * 1/2 + 1 * 1/4 + 1 * 1/8 = 3/8$.

So, sample 6 in the Van der Corput sequence using base 2 is 3/8.

Let’s try sample 21 in base 3.

First we convert 21 to base 3 which is 210. We can verify this is right by seeing that $0 * 1 + 1 * 3 + 2 * 9 = 21$.

Instead of a 1’s, 3’s and 9’s digit, we are going to treat it like a 1/3, 1/9 and 1/27 digit.

$0 * 1/3 + 1 * 1/9 + 2 * 1/27 = 5/27$

So, sample 21 in the Van der Corput sequence using base 3 is 5/27.

Here is the Van der Corput sequence for base 2:

Here it is for base 3:

Base 4:

Base 5:

## Sobol

One dimensional Sobol is actually just the Van der Corput sequence base 2 re-arranged a little bit, but it’s generated differently.

You start with 0 (either using it as sample 0 or sample -1, doesn’t matter which), and for each sample you do this:

1. Calculate the Ruler function value for the current sample’s index(more info in a second)
2. Make the direction vector by shifting 1 left (in binary) 31 – ruler times.
3. XOR the last sample by the direction vector to get the new sample
4. To interpret the sample as a floating point number you divide it by $2^{32}$

That might sound completely different than the Van der Corput sequence but it actually is the same thing – just re-ordered.

In the final step when dividing by $2^{32}$, we are really just interpreting the binary number as a fraction just like before, but it’s the LEFT most digit that is the 1/2 spot, not the RIGHT most digit.

The Ruler Function goes like: 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, …

It’s pretty easy to calculate too. Calculating the ruler function for an index (starting at 1) is just the zero based index of the right most 1’s digit after converting the number to binary.

1 in binary is 001 so Ruler(1) is 0.
2 in binary is 010 so Ruler(2) is 1.
3 in binary is 011 so Ruler(3) is 0.
4 in binary is 100 so Ruler(4) is 2.
5 in binary is 101 so Ruler(5) is 0.
and so on.

Here is 1D Sobol:

# Hammersley

In one dimension, the Hammersley sequence is the same as the base 2 Van der Corput sequence, and in the same order. If that sounds strange that it’s the same, it’s a 2d sequence I broke down into a 1d sequence for comparison. The one thing Hammersley has that makes it unique in the 1d case is that you can truncate bits.

It doesn’t seem that useful for 1d Hammersley to truncate bits but knowing that is useful info too I guess. Look at the 2d version of Hammersley to get a fairer look at it, because it’s meant to be a 2d sequence.

Here is Hammersley:

With 1 bit truncated:

With 2 bits truncated:

# Poisson Disc

Poisson disc points are points which are densely packed, but have a minimum distance from each other.

Computer scientists are still working out good algorithms to generate these points efficiently.

I use “Mitchell’s Best-Candidate” which means that when you want to generate a new point in the sequence, you generate N new points, and choose whichever point is farthest away from the other points you’ve generated so far.

Here it is where N is 100:

# 2 Dimensional Sequences

Next up, let’s look at some 2 dimensional sequences.

## Grid

Below is 2d uniform samples on a grid.

Note that uniform grid is not particularly low discrepancy for the 2d case! More info here: Is it expected that uniform points would have non zero discrepancy?

## Random

Here are 100 random points:

## Uniform Grid + Jitter

Here is a uniform grid that has random jitter applied to the points. Jittered grid is a pretty commonly used low discrepancy sampling technique that has good success.

## Subrandom

Just like in 1 dimensions, you can apply the subrandom ideas to 2 dimensions where you divide the X and Y axis into so many sections, and randomly choose points in the sections.

If you divide X and Y into the same number of sections though, you are going to have a problem because some areas are not going to have any points in them.

@Reedbeta pointed out that instead of using i%x and i%y, that you could use i%x and (i/x)%y to make it pick points in all regions.

Picking different numbers for X and Y can be another way to give good results. Here’s dividing X and Y into 2 and 3 sections respectively:

If you choose co-prime numbers for divisions for each axis you can get maximal period of repeats. 2 and 3 are coprime so the last example is a good example of that, but here is 3 and 11:

Here is 3 and 97. 97 is large enough that with only doing 100 samples, we are almost doing jittered grid on the y axis.

Here is the other subrandom number from 1d, where we start with a random value for X and Y, and then add a random number between 0 and 0.5 to each, also adding 0.5, to make a “random walk” type setup again:

## Halton

The Halton sequence is just the Van der Corput sequence, but using a different base on each axis.

Here is the Halton sequence where X and Y use bases 2 and 3:

Here it is using bases 5 and 7:

Here are bases 13 and 9:

# Irrational Numbers

The irrational numbers technique can be used for 2d as well but I wasn’t able to find out how to make it give decent looking output that didn’t have an obvious diagonal pattern in them. Bart Wronski shared a neat paper that explains how to use the golden ratio in 2d with great success: Golden Ratio Sequences For Low-Discrepancy Sampling

This uses the golden ratio for the X axis and the square root of 2 for the Y axis. Below that is the same, with a random starting point, to make it give a different sequence.

Here X axis uses square root of 2 and Y axis uses square root of 3. Below that is a random starting point, which gives the same discrepancy.

## Hammersley

In 2 dimensions, the Hammersley sequence uses the 1d Hammersley sequence for the X axis: Instead of treating the binary version of the index as binary, you treat it as fractions like you do for Van der Corput and sum up the fractions.

For the Y axis, you just reverse the bits and then do the same!

Here is the Hammersley sequence. Note we would have to take 128 samples (not just the 100 we did) if we wanted it to fill the entire square with samples.

Truncating bits in 2d is a bit useful. Here is 1 bit truncated:

2 bits truncated:

## Poisson Disc

Using the same method we did for 1d, we can generate points in 2d space:

## N Rooks

There is a sampling pattern called N-Rooks where you put N rooks onto a chess board and arrange them such that no two are in the same row or column.

A way to generate these samples is to realize that there will be only one rook per row, and that none of them will ever be in the same column. So, you make an array that has numbers 0 to N-1, and then shuffle the array. The index into the array is the row, and the value in the array is the column.

Here are 100 rooks:

## Sobol

Sobol in two dimensions is more complex to explain so I’ll link you to the source I used: Sobol Sequences Made Simple.

The 1D sobol already covered is used for the X axis, and then something more complex was used for the Y axis:

Bart Wronski has a really great series on a related topic: Dithering in Games

Wikipedia: Low Discrepancy Sequence

Wikipedia: Halton Sequence

Wikipedia: Van der Corput Sequence

Using Fibonacci Sequence To Generate Colors

Deeper info and usage cases for low discrepancy sequences

Poisson-Disc Sampling

Low discrepancy sequences are related to blue noise. Where white noise contains all frequencies evenly, blue noise has more high frequencies and fewer low frequencies. Blue noise is essentially the ultimate in low discrepancy, but can be expensive to compute. Here are some pages on blue noise:

Free Blue Noise Textures

The problem with 3D blue noise

Stippling and Blue Noise

Vegetation placement in “The Witness”

Here are some links from @marc_b_reynolds:
Sobol (low-discrepancy) sequence in 1-3D, stratified in 2-4D.
Classic binary-reflected gray code.
Sobol.h

Weyl Sequence

## Code

#define _CRT_SECURE_NO_WARNINGS

#include <windows.h>  // for bitmap headers and performance counter.  Sorry non windows people!
#include <vector>
#include <stdint.h>
#include <random>
#include <array>
#include <algorithm>
#include <stdlib.h>
#include <set>

typedef uint8_t uint8;

#define NUM_SAMPLES 100  // to simplify some 2d code, this must be a square
#define NUM_SAMPLES_FOR_COLORING 100

// Turning this on will slow things down significantly because it's an O(N^5) operation for 2d!
#define CALCULATE_DISCREPANCY 0

#define IMAGE1D_WIDTH 600
#define IMAGE1D_HEIGHT 50
#define IMAGE2D_WIDTH 300
#define IMAGE2D_HEIGHT 300

#define AXIS_HEIGHT 40
#define DATA_HEIGHT 20
#define DATA_WIDTH 2

#define COLOR_FILL SColor(255,255,255)
#define COLOR_AXIS SColor(0, 0, 0)

//======================================================================================
struct SImageData
{
SImageData ()
: m_width(0)
, m_height(0)
{ }

size_t m_width;
size_t m_height;
size_t m_pitch;
std::vector<uint8> m_pixels;
};

struct SColor
{
SColor (uint8 _R = 0, uint8 _G = 0, uint8 _B = 0)
: R(_R), G(_G), B(_B)
{ }

uint8 B, G, R;
};

//======================================================================================
bool SaveImage (const char *fileName, const SImageData &image)
{
// open the file if we can
FILE *file;
file = fopen(fileName, "wb");
if (!file) {
printf("Could not save %s\n", fileName);
return false;
}

// write the data and close the file
fclose(file);

return true;
}

//======================================================================================
void ImageInit (SImageData& image, size_t width, size_t height)
{
image.m_width = width;
image.m_height = height;
image.m_pitch = 4 * ((width * 24 + 31) / 32);
image.m_pixels.resize(image.m_pitch * image.m_width);
std::fill(image.m_pixels.begin(), image.m_pixels.end(), 0);
}

//======================================================================================
void ImageClear (SImageData& image, const SColor& color)
{
uint8* row = &image.m_pixels[0];
for (size_t rowIndex = 0; rowIndex < image.m_height; ++rowIndex)
{
SColor* pixels = (SColor*)row;
std::fill(pixels, pixels + image.m_width, color);

row += image.m_pitch;
}
}

//======================================================================================
void ImageBox (SImageData& image, size_t x1, size_t x2, size_t y1, size_t y2, const SColor& color)
{
for (size_t y = y1; y < y2; ++y)
{
uint8* row = &image.m_pixels[y * image.m_pitch];
SColor* start = &((SColor*)row)[x1];
std::fill(start, start + x2 - x1, color);
}
}

//======================================================================================
float Distance (float x1, float y1, float x2, float y2)
{
float dx = (x2 - x1);
float dy = (y2 - y1);

return std::sqrtf(dx*dx + dy*dy);
}

//======================================================================================
SColor DataPointColor (size_t sampleIndex)
{
SColor ret;
float percent = (float(sampleIndex) / (float(NUM_SAMPLES_FOR_COLORING) - 1.0f));

ret.R = uint8((1.0f - percent) * 255.0f);
ret.G = 0;
ret.B = uint8(percent * 255.0f);

float mag = (float)sqrt(ret.R*ret.R + ret.G*ret.G + ret.B*ret.B);
ret.R = uint8((float(ret.R) / mag)*255.0f);
ret.G = uint8((float(ret.G) / mag)*255.0f);
ret.B = uint8((float(ret.B) / mag)*255.0f);

return ret;
}

//======================================================================================
float RandomFloat (float min, float max)
{
static std::random_device rd;
static std::mt19937 mt(rd());
std::uniform_real_distribution<float> dist(min, max);
return dist(mt);
}

//======================================================================================
size_t Ruler (size_t n)
{
size_t ret = 0;
while (n != 0 && (n & 1) == 0)
{
n /= 2;
++ret;
}
return ret;
}

//======================================================================================
float CalculateDiscrepancy1D (const std::array<float, NUM_SAMPLES>& samples)
{
// some info about calculating discrepancy
// https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

// Calculates the discrepancy of this data.
// Assumes the data is [0,1) for valid sample range
std::array<float, NUM_SAMPLES> sortedSamples = samples;
std::sort(sortedSamples.begin(), sortedSamples.end());

float maxDifference = 0.0f;
for (size_t startIndex = 0; startIndex <= NUM_SAMPLES; ++startIndex)
{
// startIndex 0 = 0.0f.  startIndex 1 = sortedSamples[0]. etc

float startValue = 0.0f;
if (startIndex > 0)
startValue = sortedSamples[startIndex - 1];

for (size_t stopIndex = startIndex; stopIndex <= NUM_SAMPLES; ++stopIndex)
{
// stopIndex 0 = sortedSamples[0].  startIndex[N] = 1.0f. etc

float stopValue = 1.0f;
if (stopIndex < NUM_SAMPLES)
stopValue = sortedSamples[stopIndex];

float length = stopValue - startValue;

// open interval (startValue, stopValue)
size_t countInside = 0;
for (float sample : samples)
{
if (sample > startValue &&
sample < stopValue)
{
++countInside;
}
}
float density = float(countInside) / float(NUM_SAMPLES);
float difference = std::abs(density - length);
if (difference > maxDifference)
maxDifference = difference;

// closed interval [startValue, stopValue]
countInside = 0;
for (float sample : samples)
{
if (sample >= startValue &&
sample <= stopValue)
{
++countInside;
}
}
density = float(countInside) / float(NUM_SAMPLES);
difference = std::abs(density - length);
if (difference > maxDifference)
maxDifference = difference;
}
}
return maxDifference;
}

//======================================================================================
float CalculateDiscrepancy2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples)
{
// some info about calculating discrepancy
// https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

// Calculates the discrepancy of this data.
// Assumes the data is [0,1) for valid sample range.

// Get the sorted list of unique values on each axis
std::set<float> setSamplesX;
std::set<float> setSamplesY;
for (const std::array<float, 2>& sample : samples)
{
setSamplesX.insert(sample[0]);
setSamplesY.insert(sample[1]);
}
std::vector<float> sortedXSamples;
std::vector<float> sortedYSamples;
sortedXSamples.reserve(setSamplesX.size());
sortedYSamples.reserve(setSamplesY.size());
for (float f : setSamplesX)
sortedXSamples.push_back(f);
for (float f : setSamplesY)
sortedYSamples.push_back(f);

// Get the sorted list of samples on the X axis, for faster interval testing
std::array<std::array<float, 2>, NUM_SAMPLES> sortedSamplesX = samples;
std::sort(sortedSamplesX.begin(), sortedSamplesX.end(),
[] (const std::array<float, 2>& itemA, const std::array<float, 2>& itemB)
{
return itemA[0] < itemB[0];
}
);

// calculate discrepancy
float maxDifference = 0.0f;
for (size_t startIndexY = 0; startIndexY <= sortedYSamples.size(); ++startIndexY)
{
float startValueY = 0.0f;
if (startIndexY > 0)
startValueY = *(sortedYSamples.begin() + startIndexY - 1);

for (size_t startIndexX = 0; startIndexX <= sortedXSamples.size(); ++startIndexX)
{
float startValueX = 0.0f;
if (startIndexX > 0)
startValueX = *(sortedXSamples.begin() + startIndexX - 1);

for (size_t stopIndexY = startIndexY; stopIndexY <= sortedYSamples.size(); ++stopIndexY)
{
float stopValueY = 1.0f;
if (stopIndexY < sortedYSamples.size())
stopValueY = sortedYSamples[stopIndexY];

for (size_t stopIndexX = startIndexX; stopIndexX <= sortedXSamples.size(); ++stopIndexX)
{
float stopValueX = 1.0f;
if (stopIndexX < sortedXSamples.size())
stopValueX = sortedXSamples[stopIndexX];

// calculate area
float length = stopValueX - startValueX;
float height = stopValueY - startValueY;
float area = length * height;

// open interval (startValue, stopValue)
size_t countInside = 0;
for (const std::array<float, 2>& sample : samples)
{
if (sample[0] > startValueX &&
sample[1] > startValueY &&
sample[0] < stopValueX &&
sample[1] < stopValueY)
{
++countInside;
}
}
float density = float(countInside) / float(NUM_SAMPLES);
float difference = std::abs(density - area);
if (difference > maxDifference)
maxDifference = difference;

// closed interval [startValue, stopValue]
countInside = 0;
for (const std::array<float, 2>& sample : samples)
{
if (sample[0] >= startValueX &&
sample[1] >= startValueY &&
sample[0] <= stopValueX &&
sample[1] <= stopValueY)
{
++countInside;
}
}
density = float(countInside) / float(NUM_SAMPLES);
difference = std::abs(density - area);
if (difference > maxDifference)
maxDifference = difference;
}
}
}
}

return maxDifference;
}

//======================================================================================
void Test1D (const char* fileName, const std::array<float, NUM_SAMPLES>& samples)
{
// create and clear the image
SImageData image;

// setup the canvas
ImageClear(image, COLOR_FILL);

// calculate the discrepancy
#if CALCULATE_DISCREPANCY
float discrepancy = CalculateDiscrepancy1D(samples);
printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
#endif

// draw the sample points
size_t i = 0;
for (float f: samples)
{
size_t pos = size_t(f * float(IMAGE1D_WIDTH)) + IMAGE_PAD;
ImageBox(image, pos, pos + 1, IMAGE1D_CENTERY - DATA_HEIGHT / 2, IMAGE1D_CENTERY + DATA_HEIGHT / 2, DataPointColor(i));
++i;
}

// draw the axes lines. horizontal first then the two vertical
ImageBox(image, IMAGE_PAD, IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);
ImageBox(image, IMAGE1D_WIDTH + IMAGE_PAD, IMAGE1D_WIDTH + IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);

// save the image
SaveImage(fileName, image);
}

//======================================================================================
void Test2D (const char* fileName, const std::array<std::array<float,2>, NUM_SAMPLES>& samples)
{
// create and clear the image
SImageData image;

// setup the canvas
ImageClear(image, COLOR_FILL);

// calculate the discrepancy
#if CALCULATE_DISCREPANCY
float discrepancy = CalculateDiscrepancy2D(samples);
printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
#endif

// draw the sample points
size_t i = 0;
for (const std::array<float, 2>& sample : samples)
{
size_t posx = size_t(sample[0] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
size_t posy = size_t(sample[1] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
ImageBox(image, posx - 1, posx + 1, posy - 1, posy + 1, DataPointColor(i));
++i;
}

// horizontal lines

// vertical lines

// save the image
SaveImage(fileName, image);
}

//======================================================================================
void TestUniform1D (bool jitter)
{
// calculate the sample points
const float c_cellSize = 1.0f / float(NUM_SAMPLES+1);
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i] = float(i+1) / float(NUM_SAMPLES+1);
if (jitter)
samples[i] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
}

// save bitmap etc
if (jitter)
Test1D("1DUniformJitter.bmp", samples);
else
Test1D("1DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom1D ()
{
// calculate the sample points
const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
samples[i] = RandomFloat(0.0f, 1.0f);

// save bitmap etc
Test1D("1DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA1D (size_t numRegions)
{
const float c_randomRange = 1.0f / float(numRegions);

// calculate the sample points
const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i] = RandomFloat(0.0f, c_randomRange);
samples[i] += float(i % numRegions) / float(numRegions);
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "1DSubRandomA_%zu.bmp", numRegions);
Test1D(fileName, samples);
}

//======================================================================================
void TestSubRandomB1D ()
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
float sample = RandomFloat(0.0f, 0.5f);
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
sample = std::fmodf(sample + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
samples[i] = sample;
}

// save bitmap etc
Test1D("1DSubRandomB.bmp", samples);
}

//======================================================================================
void TestVanDerCorput (size_t base)
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i] = 0.0f;
float denominator = float(base);
size_t n = i;
while (n > 0)
{
size_t multiplier = n % base;
samples[i] += float(multiplier) / denominator;
n = n / base;
denominator *= base;
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "1DVanDerCorput_%zu.bmp", base);
Test1D(fileName, samples);
}

//======================================================================================
void TestIrrational1D (float irrational, float seed)
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
float sample = seed;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
sample = std::fmodf(sample + irrational, 1.0f);
samples[i] = sample;
}

// save bitmap etc
char irrationalStr[256];
sprintf(irrationalStr, "%f", irrational);
char seedStr[256];
sprintf(seedStr, "%f", seed);
char fileName[256];
sprintf(fileName, "1DIrrational_%s_%s.bmp", &irrationalStr[2], &seedStr[2]);
Test1D(fileName, samples);
}

//======================================================================================
void TestSobol1D ()
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
size_t ruler = Ruler(i + 1);
size_t direction = size_t(size_t(1) << size_t(31 - ruler));
sampleInt = sampleInt ^ direction;
samples[i] = float(sampleInt) / std::pow(2.0f, 32.0f);
}

// save bitmap etc
Test1D("1DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley1D (size_t truncateBits)
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
size_t n = i >> truncateBits;
float base = 1.0f / 2.0f;
samples[i] = 0.0f;
while (n)
{
if (n & 1)
samples[i] += base;
n /= 2;
base /= 2.0f;
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "1DHammersley_%zu.bmp", truncateBits);
Test1D(fileName, samples);
}

//======================================================================================
float MinimumDistance1D (const std::array<float, NUM_SAMPLES>& samples, size_t numSamples, float x)
{
// Used by poisson.
// This returns the minimum distance that point (x) is away from the sample points, from [0, numSamples).
float minimumDistance = 0.0f;
for (size_t i = 0; i < numSamples; ++i)
{
float distance = std::abs(samples[i] - x);
if (i == 0 || distance < minimumDistance)
minimumDistance = distance;
}
return minimumDistance;
}

//======================================================================================
void TestPoisson1D ()
{
// every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
const size_t c_bestOfAttempts = 100;

// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
{
// generate some random points and keep the one that has the largest minimum distance from any of the existing points
float bestX = 0.0f;
float bestMinDistance = 0.0f;
for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
{
float attemptX = RandomFloat(0.0f, 1.0f);
float minDistance = MinimumDistance1D(samples, sampleIndex, attemptX);

if (minDistance > bestMinDistance)
{
bestX = attemptX;
bestMinDistance = minDistance;
}
}
samples[sampleIndex] = bestX;
}

// save bitmap etc
Test1D("1DPoisson.bmp", samples);
}

//======================================================================================
void TestUniform2D (bool jitter)
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
const float c_cellSize = 1.0f / float(c_oneSide+1);
for (size_t iy = 0; iy < c_oneSide; ++iy)
{
for (size_t ix = 0; ix < c_oneSide; ++ix)
{
size_t sampleIndex = iy * c_oneSide + ix;

samples[sampleIndex][0] = float(ix + 1) / (float(c_oneSide + 1));
if (jitter)
samples[sampleIndex][0] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);

samples[sampleIndex][1] = float(iy + 1) / (float(c_oneSide) + 1.0f);
if (jitter)
samples[sampleIndex][1] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
}
}

// save bitmap etc
if (jitter)
Test2D("2DUniformJitter.bmp", samples);
else
Test2D("2DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom2D ()
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i][0] = RandomFloat(0.0f, 1.0f);
samples[i][1] = RandomFloat(0.0f, 1.0f);
}

// save bitmap etc
Test2D("2DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA2D (size_t regionsX, size_t regionsY)
{
const float c_randomRangeX = 1.0f / float(regionsX);
const float c_randomRangeY = 1.0f / float(regionsY);

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i][0] = RandomFloat(0.0f, c_randomRangeX);
samples[i][0] += float(i % regionsX) / float(regionsX);

samples[i][1] = RandomFloat(0.0f, c_randomRangeY);
samples[i][1] += float(i % regionsY) / float(regionsY);
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "2DSubRandomA_%zu_%zu.bmp", regionsX, regionsY);
Test2D(fileName, samples);
}

//======================================================================================
void TestSubRandomB2D ()
{
// calculate the sample points
float samplex = RandomFloat(0.0f, 0.5f);
float sampley = RandomFloat(0.0f, 0.5f);
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samplex = std::fmodf(samplex + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
sampley = std::fmodf(sampley + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
samples[i][0] = samplex;
samples[i][1] = sampley;
}

// save bitmap etc
Test2D("2DSubRandomB.bmp", samples);
}

//======================================================================================
void TestHalton (size_t basex, size_t basey)
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
// x axis
samples[i][0] = 0.0f;
{
float denominator = float(basex);
size_t n = i;
while (n > 0)
{
size_t multiplier = n % basex;
samples[i][0] += float(multiplier) / denominator;
n = n / basex;
denominator *= basex;
}
}

// y axis
samples[i][1] = 0.0f;
{
float denominator = float(basey);
size_t n = i;
while (n > 0)
{
size_t multiplier = n % basey;
samples[i][1] += float(multiplier) / denominator;
n = n / basey;
denominator *= basey;
}
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "2DHalton_%zu_%zu.bmp", basex, basey);
Test2D(fileName, samples);
}

//======================================================================================
void TestSobol2D ()
{
// calculate the sample points

// x axis
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
size_t ruler = Ruler(i + 1);
size_t direction = size_t(size_t(1) << size_t(31 - ruler));
sampleInt = sampleInt ^ direction;
samples[i][0] = float(sampleInt) / std::pow(2.0f, 32.0f);
}

// y axis
// uses numbers: new-joe-kuo-6.21201

// Direction numbers
std::vector<size_t> V;
V.resize((size_t)ceil(log((double)NUM_SAMPLES) / log(2.0)));
V[0] = size_t(1) << size_t(31);
for (size_t i = 1; i < V.size(); ++i)
V[i] = V[i - 1] ^ (V[i - 1] >> 1);

// Samples
sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i) {
size_t ruler = Ruler(i + 1);
sampleInt = sampleInt ^ V[ruler];
samples[i][1] = float(sampleInt) / std::pow(2.0f, 32.0f);
}

// save bitmap etc
Test2D("2DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley2D (size_t truncateBits)
{
// figure out how many bits we are working in.
size_t value = 1;
size_t numBits = 0;
while (value < NUM_SAMPLES)
{
value *= 2;
++numBits;
}

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
// x axis
samples[i][0] = 0.0f;
{
size_t n = i >> truncateBits;
float base = 1.0f / 2.0f;
while (n)
{
if (n & 1)
samples[i][0] += base;
n /= 2;
base /= 2.0f;
}
}

// y axis
samples[i][1] = 0.0f;
{
size_t n = i >> truncateBits;
size_t mask = size_t(1) << (numBits - 1 - truncateBits);

float base = 1.0f / 2.0f;
{
samples[i][1] += base;
base /= 2.0f;
}
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "2DHammersley_%zu.bmp", truncateBits);
Test2D(fileName, samples);
}

//======================================================================================
void TestRooks2D ()
{
// make and shuffle rook positions
std::random_device rd;
std::mt19937 mt(rd());
std::array<size_t, NUM_SAMPLES> rookPositions;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
rookPositions[i] = i;
std::shuffle(rookPositions.begin(), rookPositions.end(), mt);

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
// x axis
samples[i][0] = float(rookPositions[i]) / float(NUM_SAMPLES-1);

// y axis
samples[i][1] = float(i) / float(NUM_SAMPLES - 1);
}

// save bitmap etc
Test2D("2DRooks.bmp", samples);
}

//======================================================================================
void TestIrrational2D (float irrationalx, float irrationaly, float seedx, float seedy)
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
float samplex = seedx;
float sampley = seedy;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samplex = std::fmodf(samplex + irrationalx, 1.0f);
sampley = std::fmodf(sampley + irrationaly, 1.0f);

samples[i][0] = samplex;
samples[i][1] = sampley;
}

// save bitmap etc
char irrationalxStr[256];
sprintf(irrationalxStr, "%f", irrationalx);
char irrationalyStr[256];
sprintf(irrationalyStr, "%f", irrationaly);
char seedxStr[256];
sprintf(seedxStr, "%f", seedx);
char seedyStr[256];
sprintf(seedyStr, "%f", seedy);
char fileName[256];
sprintf(fileName, "2DIrrational_%s_%s_%s_%s.bmp", &irrationalxStr[2], &irrationalyStr[2], &seedxStr[2], &seedyStr[2]);
Test2D(fileName, samples);
}

//======================================================================================
float MinimumDistance2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples, size_t numSamples, float x, float y)
{
// Used by poisson.
// This returns the minimum distance that point (x,y) is away from the sample points, from [0, numSamples).
float minimumDistance = 0.0f;
for (size_t i = 0; i < numSamples; ++i)
{
float distance = Distance(samples[i][0], samples[i][1], x, y);
if (i == 0 || distance < minimumDistance)
minimumDistance = distance;
}
return minimumDistance;
}

//======================================================================================
void TestPoisson2D ()
{
// every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
const size_t c_bestOfAttempts = 100;

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
{
// generate some random points and keep the one that has the largest minimum distance from any of the existing points
float bestX = 0.0f;
float bestY = 0.0f;
float bestMinDistance = 0.0f;
for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
{
float attemptX = RandomFloat(0.0f, 1.0f);
float attemptY = RandomFloat(0.0f, 1.0f);
float minDistance = MinimumDistance2D(samples, sampleIndex, attemptX, attemptY);

if (minDistance > bestMinDistance)
{
bestX = attemptX;
bestY = attemptY;
bestMinDistance = minDistance;
}
}
samples[sampleIndex][0] = bestX;
samples[sampleIndex][1] = bestY;
}

// save bitmap etc
Test2D("2DPoisson.bmp", samples);
}

//======================================================================================
int main (int argc, char **argv)
{
// 1D tests
{
TestUniform1D(false);
TestUniform1D(true);

TestUniformRandom1D();

TestSubRandomA1D(2);
TestSubRandomA1D(4);
TestSubRandomA1D(8);
TestSubRandomA1D(16);
TestSubRandomA1D(32);

TestSubRandomB1D();

TestVanDerCorput(2);
TestVanDerCorput(3);
TestVanDerCorput(4);
TestVanDerCorput(5);

// golden ratio mod 1 aka (sqrt(5) - 1)/2
TestIrrational1D(0.618034f, 0.0f);
TestIrrational1D(0.618034f, 0.385180f);
TestIrrational1D(0.618034f, 0.775719f);
TestIrrational1D(0.618034f, 0.287194f);

// sqrt(2) - 1
TestIrrational1D(0.414214f, 0.0f);
TestIrrational1D(0.414214f, 0.385180f);
TestIrrational1D(0.414214f, 0.775719f);
TestIrrational1D(0.414214f, 0.287194f);

// PI mod 1
TestIrrational1D(0.141593f, 0.0f);
TestIrrational1D(0.141593f, 0.385180f);
TestIrrational1D(0.141593f, 0.775719f);
TestIrrational1D(0.141593f, 0.287194f);

TestSobol1D();

TestHammersley1D(0);
TestHammersley1D(1);
TestHammersley1D(2);

TestPoisson1D();
}

// 2D tests
{
TestUniform2D(false);
TestUniform2D(true);

TestUniformRandom2D();

TestSubRandomA2D(2, 2);
TestSubRandomA2D(2, 3);
TestSubRandomA2D(3, 11);
TestSubRandomA2D(3, 97);

TestSubRandomB2D();

TestHalton(2, 3);
TestHalton(5, 7);
TestHalton(13, 9);

TestSobol2D();

TestHammersley2D(0);
TestHammersley2D(1);
TestHammersley2D(2);

TestRooks2D();

// X axis = golden ratio mod 1 aka (sqrt(5)-1)/2
// Y axis = sqrt(2) mod 1
TestIrrational2D(0.618034f, 0.414214f, 0.0f, 0.0f);
TestIrrational2D(0.618034f, 0.414214f, 0.775719f, 0.264045f);

// X axis = sqrt(2) mod 1
// Y axis = sqrt(3) mod 1
TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.0f, 0.0f);
TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.775719f, 0.264045f);

TestPoisson2D();
}

#if CALCULATE_DISCREPANCY
printf("\n");
system("pause");
#endif
}


# How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

1. Explanation
2. Working through examples
3. Simple sample C++ source code using only standard includes
4. Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

# Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);


To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

# Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

$Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)$

Plugging in the specific values for the weights and biases, it looks like this:

$Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)$

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

$Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)$

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

$\sigma(x) = \frac{1}{1+e^{-x}}$

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

$Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295$

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula $y=mx+b$ and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is $\frac{dy}{dx}$, which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by $\frac{dy}{dx}$ to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: $\frac{\partial y}{\partial x}$.

The reason partial derivatives even exist is because if you have a function of multiple variables like $z=f(x,y)=x^2+3y+2$, you have two variables that you can take the derivative of. You can calculate $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$. The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

$\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\$

If you put both of those values together into a vector $(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

1. Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
2. Multiply the gradient by a small “learning rate” value, such as 0.05
3. Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

$Cost = ||target-output||^2$

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

# More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like $\frac{\partial Cost}{\partial Weight0}$, you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

# Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

$dy/dx$

That you can also write it like this:

$dy/du*du/dx$

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

$3/2 = 1.5$

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
$3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5$

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

1. dCost/dOutput: The derivative of how a neuron’s output affects cost
2. dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
3. dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

# Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

# Backpropagation Example 2: Single Neuron, Two Training Examples

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
$\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476$

If we follow the same steps with the “input 0, output 1” training example we get these:
$\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887$

To get the actual derivatives to train the network with, we just average them!
$\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294$

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

# Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225$

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167$

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

$\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360$

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

$\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463$

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463$

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

# Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729$

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984$

Now we start the backward pass to calculate the derivatives for training.

## Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

$\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385$

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

$\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\$

## Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

$\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381$

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

$\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\$

## Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

$\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364$

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

$\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\$

## Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

$\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01$

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\$

## Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

# Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
float input, float desiredOutput,
float weight, float bias,
float& error, float& cost, float& actualOutput,
float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
float Z = input * weight + bias;
float O = 1.0f / (1.0f + std::exp(-Z));

// the actual output of the network is the activation of the neuron
actualOutput = O;

// calculate error
error = std::abs(desiredOutput - actualOutput);

// calculate cost
cost = 0.5f * error * error;

// calculate how much a change in neuron activation affects the cost function
// deltaCost/deltaO = O - target
float deltaCost_deltaO = O - desiredOutput;

// calculate how much a change in neuron weighted input affects neuron activation
// deltaO/deltaZ = O * (1 - O)
float deltaO_deltaZ = O * (1 - O);

// calculate how much a change in a neuron's weighted input affects the cost function.
// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
float neuronError = deltaCost_deltaO * deltaO_deltaZ;
deltaCost_deltaBias = neuronError;

// calculate how much a change in the weight affects the cost function.
// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * input
deltaCost_deltaWeight = neuronError * input;

// As a bonus, calculate how much a change in the input affects the cost function.
// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
// deltaCost/deltaInput = neuronError * weight
deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example1.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 10000;

// training data
// input: 1, output: 0
const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float error = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
// run the network to get error and derivatives
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
}
#endif

weight -= deltaCost_deltaWeight * c_learningRate;
bias -= deltaCost_deltaBias * c_learningRate;
}

printf("Example1 Final Error: %fn", error);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example2.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 100000;

// training data
// input: 1, output: 0
// input: 0, output: 1
const std::array<std::array<float, 2>, 2> c_trainingData = { {
{1.0f, 0.0f},
{0.0f, 1.0f}
} };

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
avgError = 0.0f;
float avgOutput = 0.0f;
float avgCost = 0.0f;
float avgDeltaCost_deltaWeight = 0.0f;
float avgDeltaCost_deltaBias = 0.0f;
float avgDeltaCost_deltaInput = 0.0f;

// run the network to get error and derivatives for each training example
for (const std::array<float, 2>& trainingData : c_trainingData)
{
float error = 0.0f;
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

avgError += error;
avgOutput += output;
avgCost += cost;
avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
avgDeltaCost_deltaBias += deltaCost_deltaBias;
avgDeltaCost_deltaInput += deltaCost_deltaInput;
}

avgError /= (float)c_trainingData.size();
avgOutput /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
avgDeltaCost_deltaBias /= (float)c_trainingData.size();
avgDeltaCost_deltaInput /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
}
#endif

weight -= avgDeltaCost_deltaWeight * c_learningRate;
bias -= avgDeltaCost_deltaBias * c_learningRate;
}

printf("Example2 Final Error: %fn", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
std::array<float, 2> m_input;
std::array<float, 2> m_output;
};

void Example3RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 4>& weights, const std::array<float, 2>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// the actual output of the network is the activation of the neurons
actualOutput[0] = O0;
actualOutput[1] = O1;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = O0 - target0
float deltaCost_deltaO0 = O0 - desiredOutput[0];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = O1 - target1
float deltaCost_deltaO1 = O1 - desiredOutput[1];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO0/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example3.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 520000;

// training data: OR/AND
// input: 00, output: 00
// input: 01, output: 10
// input: 10, output: 10
// input: 11, output: 11
const std::array<SExample3Training, 4> c_trainingData = { {
{{0.0f, 0.0f}, {0.0f, 0.0f}},
{{0.0f, 1.0f}, {1.0f, 0.0f}},
{{1.0f, 0.0f}, {1.0f, 0.0f}},
{{1.0f, 1.0f}, {1.0f, 1.0f}},
} };

// starting weight and bias values
std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
std::array<float, 2> biases = { 0.5f, 0.1f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
//float avgCost = 0.0f;
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example3 Final Error: %fn", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 8>& weights, const std::array<float, 4>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// calculate Z2 and O2 for neuron2
float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
float O2 = 1.0f / (1.0f + std::exp(-Z2));

// calculate Z3 and O3 for neuron3
float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
float O3 = 1.0f / (1.0f + std::exp(-Z3));

// the actual output of the network is the activation of the output layer neurons
actualOutput[0] = O2;
actualOutput[1] = O3;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 2 -----

// calculate how much a change in neuron 2 activation affects the cost function
// deltaCost/deltaO2 = O2 - target0
float deltaCost_deltaO2 = O2 - desiredOutput[0];

// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
// deltaO2/deltaZ2 = O2 * (1 - O2)
float deltaO2_deltaZ2 = O2 * (1 - O2);

// calculate how much a change in neuron 2 weighted input affects the cost function.
// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
deltaCost_deltaBiases[2] = neuron2Error;

// calculate how much a change in weight4 affects the cost function.
// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * O0
// similar thing for weight5
deltaCost_deltaWeights[4] = neuron2Error * O0;
deltaCost_deltaWeights[5] = neuron2Error * O1;

//----- Neuron 3 -----

// calculate how much a change in neuron 3 activation affects the cost function
// deltaCost/deltaO3 = O3 - target1
float deltaCost_deltaO3 = O3 - desiredOutput[1];

// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
// deltaO3/deltaZ3 = O3 * (1 - O3)
float deltaO3_deltaZ3 = O3 * (1 - O3);

// calculate how much a change in neuron 3 weighted input affects the cost function.
// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
deltaCost_deltaBiases[3] = neuron3Error;

// calculate how much a change in weight6 affects the cost function.
// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * O0
// similar thing for weight7
deltaCost_deltaWeights[6] = neuron3Error * O0;
deltaCost_deltaWeights[7] = neuron3Error * O1;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO1/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example4.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 5000;

// training data: 0.05, 0.1 in = 0.01, 0.99 out
const std::array<SExample3Training, 1> c_trainingData = { {
{{0.05f, 0.1f}, {0.01f, 0.99f}},
} };

// starting weight and bias values
std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example4 Final Error: %fn", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

int main (int argc, char **argv)
{
Example1();
Example2();
Example3();
Example4();
system("pause");
return 0;
}


The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

1. finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
2. Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

# A Geometric Interpretation of Neural Networks

In the 90s before I was a professional programmer / game developer I looked at neural networks and found them interesting but got scared off by things like back propagation, which I wasn’t yet ready to understand.

With all the interesting machine learning things going on in modern times, I decided to have a look again and have been pleasantly surprised at how simple they are to understand. If you have knowledge of partial derivatives and gradients (like, if you’ve done any ray marching), you have the knowledge it takes to understand it.

Here are some really great resources I recomend for learning the nuts and bolts of how modern neural networks actually work:
Learn TensorFlow and deep learning, without a Ph.D.
Neural Networks and Deep Learning
A Neural Network Playground (Web Based NN sand box from google)

This post doesn’t require any understanding of neural networks or partial derivatives, so don’t worry if you don’t have that knowledge. The harder math comes up when training a neural network, but we are only going to be dealing with evaluating neural networks, which is much simpler.

## A Geometric Interpretation of a Neuron

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron.

To calculate the output value of a neuron, you multiply every input into that neuron by a weight for that input, and add a bias. This value is fed into an activation function (more on activation functions shortly) and the result is the output value of that neuron. Here is a diagram for a single neuron:

A more formal definition of a neuron’s output is below. $b$ is the bias, $w_j$ is the j’th weight and $i_j$ is the j’th input.
$Output = b+\sum_{j=0}^nw_ji_j$

You might notice that the above is really just the dot product of the weight vector and the input vector, with a value added on the end. We could re-write it like that:
$Output = Dot(w,i)+b$

Interestingly, that is the same equation that you use to find the distance of a point to a plane. Let’s say that we have a plane defined by a unit length normal N and a distance to the origin d, and we want to calculate the distance of a point P to that plane. We’d use this formula:
$Distance = Dot(N,P)+d$

This would give us a signed distance, where the value will be negative if we are in the negative half space defined by the plane, and positive otherwise.

This equation works if you are working in 3 dimensional space, but also works in general for any N dimensional point and plane definition.

What does this mean? Well, this tells us that every neuron in a neural network is essentially deciding what side of a hyperplane a point is on. Each neuron is doing a linear classification, saying if something is on side A or side B, and giving a distance of how far it is into A or B.

This also means that when you combine multiple neurons into a network, that an output neuron of that neural network tells you whether the input point is inside or outside of some shape, and by how much.

I find this interesting for two reason.

Firstly, it means that a neural network can be interpreted as encoding SHAPES, which means it could be used for modeling shapes. I’m interested in seeing what sort of shapes it’s capable of, and any sorts of behaviors this representation might have. I don’t expect it to be useful for, say, main stream game development (bonus if it is useful!) but at minimum it ought to be an interesting investigation to help understand neural networks a bit better.

Secondly, there is another machine learning algorithm called Support Vector Machines which are also based on being able to tell you which side of a separation a data point is on. However, unlike the above, SVM separations are not limited to plane tests and can use arbitrary shapes for separation. Does this mean that we are leaving something on the table for neural networks? Could we do better than we are to make networks with fewer layers and fewer neurons that do better classification by doing non linear separation tests?

Quick side note: besides classification, neural nets can help us with something called regression, which is where you fit a network to some analog data, instead of the more discrete problem of classification, which tells you what group something belongs to.

It turns out that the activation function of a neuron can have a pretty big influence on what sort of shapes are possible, which makes it so we aren’t strictly limited to shapes made up of planes and lines, and also means we aren’t necessarily leaving things on the table compared to SVM’s.

This all sort of gives me the feeling though that modern neural networks may not be the best possible algorithm for the types of things we use them for. I feel like we may need to generalize them beyond biological limitations to allow things like multiplications between weighted inputs instead of only sums. I feel like that sort of setup will be closer to whatever the real ideal “neural computation” model might be. However, the modern main stream neural models do have the benefit that they are evaluated very efficiently via dot products. They are particularly well suited for execution on GPUs which excel at performing homogenous operations on lots and lots of data. So, a more powerful and more general “neuron” may come at the cost of increased computational costs, which may make them less desirable in the end.

As a quick tangent, here is a paper from 2011 which shows a neural network model which does in fact allow for multiplication between neuron inputs. You then will likely be wanting exponents and other things, so while it’s a step in the right direction perhaps, it doesn’t yet seem to be the end all be all!
Sum-Product Networks: A New Deep Architecture

It’s also worth while to note that there are other flavors of neural networks, such as convolutional neural networks, which work quite a bit differently.

Let’s investigate this geometric interpretation of neurons as binary classifiers a bit, focusing on some different activation functions!

## Step Activation Function

The Heaviside step function is very simple. If you give it a value greater than zero, it returns a 1, else it returns a 0. That makes our neuron just spit out binary: either a 0 or a 1. The output of a neuron using the step activation function is just the below:

$Output = Dot(w,i)+b > 0$

The output of a neuron using the step activation function is true if the input point is in the positive half space of the plane that this neuron describes, else it returns false.

Let’s think about this in 2d. Let’s make a neural network that takes x and y as input and spits out a value. We can make an image that visualizes the range from (-1,-1) to (1,1). Negative values can be shown in blue, zero in white, and positive values in orange.

To start out, we’ll make a 2d plane (aka a line) that runs vertically and passes through the origin. That means it is a 2d plane with a normal of (1,0) and a distance from the origin of 0. In other words, our network will have a single neuron that has weights of (1,0) and a bias of 0. This is what the visualization looks like:

You can actually play around with the experiments of this post and do your own using an interactive visualization I made for this post. Click here to see this first experiment: Experiment: Vertical Seperation

We can change the normal (weights) to change the angle of the line, and we can also change the bias to move the line to it’s relative left or right. Here is the same network that has it’s weights adjusted to (1,1) with a bias of 0.1.

Experiment: Diagonal Separation

The normal (1,1) isn’t normalized though, which makes it so the distance from origin (aka the bias) isn’t really 0.1 units. The distance from origin is actually divided by the length of the normal to get the REAL distance to origin, so in the above image, where the normal is a bit more than 1.0, the line is actually less than 0.1 units from the origin.

Below is the visualization if we normalize the weights to (0.707,0.707) but leave the bias at 0.1 units. The result is that the line is actually 0.1 units away from the origin.

Experiment: Normalized Diagonal Separation

Recalling the description of our visualization, white is where the network outputs zero, while orange is where the network outputs a positive number (which is 1 in this case).

If we define three lines such that their negative half spaces don’t completely overlap, we can get a triangle where the network outputs a zero, while everywhere else it outputs a positive value. We do this by having three sibling neurons in the first layer which define three separate lines, and then in the output neuron we give them all a weight of 1. This makes it so the area outside the triangle is always a positive value, which step turns into 1, but inside the triangle, the value remains at 0.

We can turn this negative space triangle into a positive space triangle however by making the output neuron have a weight on the inputs of -1, and adding a bias of 0.1. This makes it so that pixels in the positive space of any of the lines will become a negative value. The negative space of those three lines get a small bias to make it be a positive value, resulting in the step function making the values be 0 outside of the triangle, and 1 inside the triangle. This gives us a positive space triangle:

Taking this a bit further, we can make the first layer define 6 lines, which make up two different triangles – a bigger one and a smaller one. We can then have a second layer which has two neurons – one which makes a positive space larger triangle, and one which makes a positive space smaller triangle. Then, in the output neuron we can give the larger triangle neuron a weight of 1, while giving the smaller triangle neuron a weight of -1. The end result is that we have subtracted the smaller triangle from the larger one:

Using the step function we have so far been limited to line based shapes. This has been due to the fact that we can only test our inputs against lines. There is a way around this though: Pass non linear input into the network!

Below is a circle with radius 0.5. The neural network has only a single input which is sqrt(x*x+y*y). The output neuron has a bias of -0.5 and a weight of 1. It’s the bias value which controls the radius of the circle.

You could pass other non linear inputs into the network to get a whole host of other shapes. You could pass sin(x) as an input for example, or x squared.

While the step function is inherently very much limited to linear tests, you can still do quite a lot of interesting non linear shapes (and data separations) by passing non linear input into the network.

Unfortunately though, you as a human would have to know the right non linear inputs to provide. The network can’t learn how to make optimal non linear separations when using the step function. That’s quite a limitation, but as I understand it, that’s how it works with support vector machines as well: a human can define non linear separations, but the human must decide the details of that separation.

BTW it seems like there could be a fun puzzle game here. Something like you have a fixed number of neurons that you can arrange in however many layers you want, and your goal is to separate blue data points from orange ones, or something like that. If you think that’d be a fun game, go make it with my blessing! I don’t have time to pursue it, so have at it (:

## Identity and Relu Activation Functions

The identity activation function doesn’t do anything. It’s the same as if no activation function is used. I’ve heard that it can be useful in regression, but it can also be useful for our geometric interpretation.

Below is the same circle experiment as before, but using the identity activation function instead of the step activation function:

Remembering that orange is positive, blue is negative, and white is zero, you can see that outside the circle is orange (positive) and inside the circle is blue (negative), while the outline of the circle itself is white. We are looking at a signed distance field of the circle! Every point on this image is a scalar value that says how far inside or outside that point is from the surface of the shape.

Signed distance fields are a popular way of rendering vector graphics on the GPU. They are often approximated by storing the distance field in a texture and sampling that texture at runtime. Storing them in a texture only requires a single color channel for storage, and as you zoom in to the shape, they preserve their shape a lot better than regular images. You can read more about SDF textures on my post: Distance Field Textures.

Considering the machine learning perspective, having a signed distance field is also an interesting proposition. It would allow you to do classification of input, but also let you know how deeply that input point is classified within it’s group. This could be a confidence level maybe, or could be interpreted in some other way, but it gives a sort of analog value to classification, which definitely seems like it could come in handy sometimes.

If we take our negative space triangle example from the last section and change it from using step activation to identity activation, we find that our technique doesn’t generalize naively though, as we see below. (It doesn’t generalize for the positive space triangle either)

The reason it doesn’t generalize is that the negatives and positives of pixel distances to each of the lines cancel out. Consider a pixel on the edge of the triangle: you are going to have a near zero value for the edge it’s on, and two larger magnitude negative values from the other edges it is in the negative half spaces of. Adding those all together is going to be a negative value.

To help sort this out we can use an activation function called “relu”, which returns 0 if the value it’s given is less than zero, otherwise it returns the value. This means that all our negative values become 0 and don’t affect the distance summation. If we switch all the neurons to using relu activation, we get this:

If you squint a bit, you can see a triangle in the white. If you open the experiment and turn on “discrete output” to force 0 to orange you get a nice image that shows you that the triangle is in fact still there.

Our result with relu is better than identity, but there are two problems with our resulting distance field.

Firstly it isn’t a signed distance field – there is no blue as you might notice. It only gives positive distances, for pixels that are outside the shape. This isn’t actually that big of an issue from a rendering perspective, as unsigned distance fields are still useful. It also doesn’t seem that big of an issue from a machine learning perspective, as it still gives some information about how deeply something is classified, even though it is only from one direction.

I think with some clever operations, you could probably create the internal negative signed distance using different operations, and then could compose it with the external positive distance in the output neuron by adding them together.

The second problem is a bigger deal though: The distance field is no longer accurate!

By summing the distance values, the distance is incorrect for points where the closest feature of the triangle is a vertex, because multiple lines are contributing their distance to the final value.

I can’t think of any good ways to correct that problem in the context of a neural network, but the distance is an approximation, and is correct for the edges, and also gets more correct the closer you get to the object, so this is still useful for rendering, and probably still useful for machine learning despite it being an imperfect measurement.

## Sigmoid and Hyperbolic Tangent Activation Function

The sigmoid function is basically a softer version of the step function and gives output between 0 and 1. The hyperbolic tangent activation function is also a softer version of the step function but gives output between -1 and 1.

Sigmoid:

Hyperbolic Tangent:

(images from Wolfram Mathworld)

They have different uses in machine learning, but I’ve found them to be visibly indistinguishable in my experiments after compensating for the different range of output values. It makes me think that smoothstep could probably be a decent activation function too, so long as your input was in the 0 to 1 range (maybe you could clamp input to 0 and 1?).

These activation functions let you get some non linearity in your neural network in a way that the learning algorithms can adjust, which is pretty neat. That puts us more into the realm where a neural net can find a good non linear separation for learned data. For the geometric perspective, this also lets us make more interesting non linear shapes.

Unfortunately, I haven’t been able to get a good understanding of how to use these functions to craft desired results.

It seems like if you add even numbers of hyperbolic tangents together in a neural network that you end up getting a lot of white, like below:

However, if you add an odd number of them together, it starts to look a bit more interesting, like this:

Other than that, it’s been difficult seeing a pattern that I can use to craft things. The two examples above were made by modifying the negative space triangle to use tanh instead of step.

## Closing

We’ve wandered a bit in the idea of interpreting neural networks geometrically but I feel like we’ve only barely scratched the surface. This also hasn’t been a very rigorous exploration, but more has just been about exploring the problem space to get a feeling for what might be possible.

It would be interesting to look more deeply into some of these areas, particularly for the case of distance field generation of shapes, or combining activation functions to get more diverse results.

While stumbling around, it seems like we may have gained some intuition about how neural networks work as well.

It feels like whenever you add a layer, you are adding the ability for a “logical operation” to happen within the network.

For instance, in the triangle cutout experiment, the first layer after the inputs defines the 6 individual lines of the two triangles and classifies input accordingly. The next layer combines those values into the two different triangle shapes. The layer after that converts them from negative space triangles to positive space triangles. Lastly, the output layer subtracts the smaller triangle’s values from the larger triangle’s values to make the final triangle outline shape.

Each layer has a logical operation it performs, which is based on the steps previous to it.

Another piece of intuition I’ve found is that it seems like adding more neurons to a layer allows it to do more work in parallel.

For instance, in the triangle cutout experiment, we created those two triangles in parallel, reserving some neurons in each layer for each triangle. It was only in the final step that we combined the values into a single output.

Since neurons can only access data from the previous network layer, it seems as though adding more neurons to layers can help push data forward from previous layers, to later layers that might need the data. But, it seems like it is most efficient to process input data as early as possible, so that you don’t have to shuttle it forward and waste layers / neurons / memory and computing power.

Here is some info on other activation functions:
Wikipedia:Activation Function

Here’s a link that talks about how perceptrons (step activated neural networks) relate to SVMs:
Hyperplane based Classification: Perceptron and (Intro to) Support Vector Machines

By the way, did I mention you can visualize neural networks in three dimensions as well?

Experiment: 3d Visualization

Here are the two visualizers of neural networks I made for this post using WebGL2:
Neural Network Visualization 2D
Neural Network Visualization 3D

If you play around with this stuff and find anything interesting, please share!

# Evaluating Points on Analytical Surfaces and in Analytical Volumes Using the GPU Texture Sampler

This is an extension of a paper I wrote which shows how to use the linear texture sampling capabilities of the GPU to calculate points on Bezier curves. You store the control points in the texture, then sample along the texture’s diagonal to get points on the curve:
GPU Texture Sampler Bezier Curve Evaluation

This extension shows how to use the technique to evaluate points on surfaces and inside of volumes, where those surfaces and volumes are defined either by Bezier curves or polynomials (Tensor products of polynomials to be more specific).

As an example of what this post will allow you to do:

• By taking a single sample of a 3d RGBA volume texture, you’ll be able to get a bicubic interpolated value (a bicubic surface).
• Alternately, taking a single sample of a 3d RGBA volume texture will allow you to get a linear interpolation between two biquadratic surfaces (a linear/biquadratic volume).
• This post also covers how to extend this to higher degree surfaces and volumes.

Here are two images generated by the WebGL2 demos I made for this post which utilize this technique for rendering surfaces, fog volumes, and solid volumes. (link to the demos at bottom of post!)

All textures are size 2 on each axis which makes it a cache friendly technique (you can grow the texture sizes for piecewise curves/surfaces/volumes though). It leverages the hardware interpolation which makes it a relatively computationally inexpensive technique, and it supports all polynomials within the limitations of floating point math, so is also very flexible and expressive. You could even extend this to rational polynomial surfaces and volumes which among other things would allow perfect representations of conic sections.

The animated Bezier curve images in this post came from wikipedia. Go have a look and drop them a few bucks if you find wikipedia useful!
Wikipedia: Bézier curve

# Curves

If you’ve read my curve paper and understand the basics you can skip this section and go onto the section “Before Going Into Surfaces”.

Let’s talk about how to store curves of various degrees in textures and evaluate points on them using the GPU Texture sampler. We’ll need this info when we are working with surfaces and volumes because a higher degree curve is dual to a section of lower degree surface or an even lower degree volume.

The three ways we’ll be talking about controlling the order of curves are:

1. Texture Dimensionality – 1d texture vs 2d texture vs 3d texture vs 4d texture.
2. Number of Color Channels – How many color channels are used? R? RG? RGB? RGBA?
3. Multiple Texture Samples – Doing multiple texture reads.

## Texture Dimensionality

By texture dimensionality I mean how many dimensions the texture has. In all cases, the size of the texture is going to be 2 on each axis.

Starting with a 1d texture, we have a single texture coordinate (u) to sample along. As we change the u value from 0 to 1, we are just linearly interpolating between the two values. A 1d texture that has 2 pixels in it can store a degree 1 curve, also known as a linear Bezier curve. With linear texture sampling, the GPU hardware will do this linear interpolation for you.

The equation for linear interpolation between two values A and B which are at t=0 and t=1 respectively is:
$A*(1-t) + B*t$

Here’s the 1d texture:

Here’s a linear curve:

Going to a 2d texture it gets more interesting. We now have two texture coordinates to sample along (u,v). Using linear sampling, the hardware will do bilinear interpolation (linear interpolation across each axis) to get the value at a specific (u,v) texture coordinate.

Here is the equation for bilinear interpolation between 4 values A,B,C,D which are at texture coordinates (0,0), (1,0), (0,1), (1,1) respectively, being sampled at (u,v):

$(A*(1-u) + B*u)*(1-v) + (C*(1-u) + D*u) * v$

That equation interpolates from A to B by u (x axis), and from C to D by u (x axis), and then interpolates from the first result to the second by v (y axis). Note that it doesn’t actually matter which axis is interpolated by first. An equivelant equation would be one that interpolates from A to B by v (y axis) and from B to C by v (y axis) and then between those results by u (x axis).

With that equation, something interesting starts to happen if you use the same value (t) for u and v, expand and simplify, and end up at this equation:

$A*(1-t)^2 + (B+C)*(1-t)t + Dt^2$

That equation is very close to the quadratic Bezier formula, which is below:

$A*(1-t)^2 + B*2(1-t)t + Ct^2$

To get to that equation, we just make B and C the same value (B), and rename D to C since that letter is unused. This tells us how we need to set up our 2d texture such that when we sample along the diagonal, we get the correct point on our quadratic Bezier curve:

Here’s a quadratic Bezier curve in action. You can see how it is a linear interpolation between two linear interpolations, just like taking a bilinearly interpolated sample on our texture is.

Taking this to a 3d texture, we now have three texture coordinates to sample along (u,v,w). Again, with linear sampling turned on, the hardware will do trilinear interpolation to get the value at a specific (u,v,w) texture coordinate.

If we follow the same process as the 2d texture, we will wind up with the equation for a cubic Bezier curve:

$A*(1-t)^3 + B*3(1-t)^2t + C*3(1-t)t^2 + Dt^3$

Here’s how the texture is laid out:

Here’s a cubic Bezier curve in action, where you can see 3 levels of linear interpolations, just like how trilinear interpolation works:

While I have never used a 4d texture it appears that directx supports them and there looks to be an OpenGL extension to support them as well.

If we took this to a 4d texture, we would end up with the equation for a quartic curve. If you have trouble visualizing what a 4d texture even looks like, you aren’t alone. You have four texture coordinates to sample along (u,v,w,t). When you sample it, there are two 3d volume textures that are sampled at (u,v,w), resulting in two values as a result. These values are then interpolated by t to give you the final value. A fourth dimensional texture lookup is just an interpolation between 2 three dimensional texture lookups. That is true of all dimensional texture lookups in fact. An N dimensional texture lookup is just the linear interpolation between two N-1 dimensional texture lookups. For example, a three dimensional texture lookup is just an interpolation between 2 two dimensional texture lookups. This “hierchical interpolation” is the link I noticed between texture interpolation and the De Casteljau algorithm, since that is also a hierchical interpolation algorithm, just with fewer values interpolated between.

Here’s how the 4d texture is laid out:

Here’s the quartic Bezier equation, which is what you get the answer to if you sample a 4d texture at (t,t,t,t):

$A*(1-t)^4 + B*4(1-t)^3t + C*6(1-t)^2t^2 + D*4(1-t)t^3 + Et^4$

Here’s a quartic Bezier curve in action, showing 4 levels of linear interpolation, just like how quadrilinear interpolation works with 4d textures:

So, the bottom line of this section is that if we sample along the diagonal of an N dimensional texture which has one color channel, we will get points on a degree N curve.

## Number of Color Channels

Another way we can control the degree of a curve stored in a texture is by the number of color channels that are stored in the texture.

In the section above we showed a 1d texture that stored a linear curve. it had only one color channel:

Let’s add another color channel. A,B will be stored in the red channel, and B,C will be stored in the green channel:

When we read that texture at location (t), we will get the following values:

1. R: The linear interpolation between A and B at time t.
2. G: The linear interpolation between B and C at time t.

Now, if we just lerp between R and G in our shader, for time t, we will get the point at time t, on the cubic Bezier curve defined by control points A,B,C.

Pretty cool right?

What happens if we add another color channel, blue?

Well, when we sample the texture at time t, we get the following values:

1. R: The linear interpolation between A and B at time t.
2. G: The linear interpolation between B and C at time t.
3. B: The linear interpolation between C and D at time t.

We can combine these values using the quadratic Bezier curve formula, as if these were each a control point:

$R*(1-t)^2 + G*2(1-t)t + Bt^2$

The result we get is a point on the CUBIC curve defined by the four control points A,B,C,D.

In the previous section, it took a 3d volume texture to calculate a cubic curve. In this section we were able to do it with a 1d RGB texture, but it came at the cost of of having to do some calculation in the shader code after sampling the texture to combine the color channels and get the final result.

How exactly does adding a color channel affect the degree though? Each color channel added increases the degree by 1.

You can see this is true by seeing in the last section how a 3 dimensional texture can evaluate a cubic, and a 4 dimensional texture can evalaute a quartic, but the 4th dimensional texture was just two 3 dimensional textures. Adding a second color channel just doubles the size of your data (and adding two tripples, and adding three quadruples), so having a 3d volume texture that has two color channels is the same as having a 4d volume texture with a single color channel. In both cases, you are just interpolating between two 3d texture samples.

## Multiple Texture Samples

Multiple texture samples is the last way to control curve degree that we are going to talk about.

Taking extra texture samples is a lot like adding color channels.

If you have a 1d RGB texture, you get a result of 3 lerps – R,G,B – which you can use to calculate a cubic curve point (order 3). If you take a second sample, you get R0,G0,B0,R1,G1,B1 which is a result of 6 lerps, which gives you a point on a sextic curve (order 6).

If you have a 2d RGBA texture, you get the result of 4 quadratic interpolations – R,G,B,A – which gives you an order 5 curve point. Taking another texture read gives you 8 quadratic interpolation results, which you can put together to make an order 9 curve point. Taking a third texture read would get you up to order 13.

Just like adding color channels, taking extra texture samples requires you to combine the multiple results in your shader, which increases computational cost.

Besides that, you are also doing more texture reads, which can be another source of performance loss. The textures are small (up to 2x2x2x2) so are texture cache friendly, but if you have multiple textures, it could start to add up I’m sure.

IMO this option should be avoided in favor of the others, when possible.

# Before Going Into Surfaces

Before we start on surfaces, I want to mention a few things.

Even though we’ve been talking about Bezier curves specifically, a previous post explained how to convert any polynomial from power basis form into Bernstein basis form (aka you can turn any polynomial into a Bezier curve that is exactly equivelant). So, this generalizes to polynomials, and even rational polynomials if you do division in your shader code, but I’ll point you towards that post for more information on that: Evaluating Polynomials with the GPU Texture Sampler.

You can also extend the above for piecewise curves easily enough. You just set up a different curve (or surface or volume, as we describe below) for different ranges of your parameter space values. From time 0 to 1, you may use one texture, and from time 1 to 2, you may use another. Better yet, you would store both curves in a single texture, and just make the texture be a little larger, instead of having two separate textures.

Also, many other types of curves – B-splines, nurbs – can be broken down exactly into piecewise Bezier curves (rational, if the source curve is rational). Check these links for more info:
Algorithms for B-Spline Curves
Wikipedia: De Boor’s Algorithm.

# Surfaces

Finally onto surfaces!

I’m going to show how to extend the curve calculation technique to calculating points on Bezier rectangles. A Bezier rectangle is a rectangular surface which has one or more bezier curves across the X axis and one or more bezier curves across the Y axis. The degree of the curve on each axis doesn’t need to match so it could be quadratic on one axis and cubic on the other as an example.

To actually evaluate a point on the surface at location (u,v), you evaluate a point on each x axis curve for time u, and then you use those resulting values as control points in another curve that you evaluate at time v.

Just like linear interpolation, it doesn’t matter which axis you evaluate first for a Bezier rectangle surface so you could switch the order of the axis evaluation if you want to.

The image above shows a bicubic surface, the blue lines show the x axis cubic bezier curves, while the yellow lines show the y axis cubic bezier curves. Those lines are called “isolines” or “isocurves”. The 16 control points are shown in magenta.

Another name for a Bezier rectangle is a tensor product surface. This is a more generalized term as it isn’t limited to Bezier curves.

Note: there is another type of Bezier surface called a Bezier Triangle but I haven’t worked much with them so can’t say if any of these techniques work with them or not. It would be interesting to explore how these techniques apply to Bezier triangles, if at all.

Hopefully it should come as no surprise that a 2d texture using regular bilinear interpolation is in fact a Bezier rectangle which is linear on the x axis and linear on the y axis. It has a degree of (1,1) and is stored in a 2d texture (2×2 pixels), where the four control points are just stored in the four pixels. You just sample the texture at (u,v) to get that point on the surface. Pretty simple stuff.

Order (1,1) Bezier Rectangle:

Something interesting to note is that while the isolines (edges) of the rectangle are linear, the surface itself is curved. In fact, we know that the diagonal of this surface is in fact a quadratic Bezier curve because we calculate curves by sampling along the diagonal! (if the middle corners are different, it’s the same as if they were both replaced with the average of their values).

There are other ways to store this Degree (1,1) surface in a texture besides how i described. You could also have a 1 dimensional texture with two color channels, where you sample it along the u axis, and then interpolate your R and G values, using the v axis value. This would come at the cost of doing a lerp in the shader code, instead of having the texture sampler hardware do it for you.

Now that the simplest case is out of the way, how about the next simplest? What if we want a surface where we linearly interpolate between two quadratic curves? That is, what if we want to make a degree (2,1) Bezier rectangle?

Order (2,1) Bezier Rectangle:

Well if you think about it geometrically, we can store a quadratic curve in a 2d texture (2×2) with a single color channel. To linearly interpolate between two of those, we need two of those to interpolate between. So, we need a 3d texture, since that is just an interpolation between two 2d textures.

When we sample that texture, we use the coordinates (u,u,v). That will make it quadratic in u, but linear in v.

Stepping up the complexity again, what if we wanted to make a biquadratic surface – aka degree (2,2)?

Order (2,2) Bezier Rectangle:

Well, to make a quadratic curve we need 3 control points, so for a biquadratic surface we need 3 quadratic curves to quadratically interpolate between.

One way to do this would be with a 4d texture, sampling along (u,u,v,v) to make it quadratic in both u and v.

But, because 4d textures are kind of exotic and may not be supported, we can achieve this by instead having a 3d texture with two color channels: R,G.

When we sample that texture, we sample at (u,u,v) to get two values: R,G. Next we linearly interpolate from R to G using v. This makes us quadratic in both u and v.

There are other ways to encode this surface as well, but i’ll leave that to you to think about if you want to (:

Lastly, what if we wanted a bicubic surface? A cubic curve has 4 control points, so we need 4 cubic curves to cubically interpolate between to make our final surface.

Order (3,3) Bezier Rectangle:

Thinking back to the first section, a 3d texture can evaluate a cubic curve. Since we need four cubic curves, let’s just use all four color channels RGBA. We would sample our texture at (u,u,u) to get four cubic curves in RGBA and then would use the cubic Bezier formula to combine those four values using v into our final result.

# Surfaces Generalized

Generalizing surface calculations a bit, there are basically two steps.

First is you need to figure out what your requirements for the x axis is as far as texture storage for the desired degree you want. From there, you figure out what degree you want on your y axis, and that degree is what you multiply the x axis texture storage requirements for.

It can be a little bit like tetris trying to figure out how to fit various degree surfaces into various texture sizes and layouts, but it gets easier with a little practice.

It’s also important to remember that the x axis being the first axis is by convention only. It could easily be the y axis that defines the texture storage requirements, and is multiplied by the degree of the x axis.

# Volumes

Volumes aren’t a whole lot more complex than surfaces, but they are a lot hungrier for texture space and linear interpolations!

Extending the generalization of surfaces, you once again figure out requirements for the x axis, multiply those by the degree of the y axis, and then multiply that result by the degree of the z axis.

The simplest case for volumes is the trilinear case, aka the Degree (1,1,1) Bezier rectangle.

Order (1,1,1) Bezier Cube:

It’s a bit difficult to understand what’s going on in that picture by seeing the data as just fog density, so the demos let you specify a surface threshold such that if the fog is denser than that amount, it shows it as a surface. Here is the same trilinear Bezier volume with a surface threshold.

Order (1,1,1) Bezier Cube:

You just store your 8 values in the 8 corners of the 2x2x2 texture cube, and sample at (u,v,w) to get your trilinear result.

The next simplest case is that you want to quadratically interpolate between two linear surfaces – a Degree (1,1,2) Bezier rectangle.

Order (1,1,2) Bezier Cube:

To do this, you need 3 bilinear surfaces to interpolate between.

One way to do this would be to have a 2d Texture with R,G,B color channels. Sample the texture at (u,v), then quadratically interpolate R,G,B using w.

Another way to do this would be to have a 3d texture with R and G channels. When sampling, you sample the 3d texture at (u,v,w) to get your R and G results. You then linearly interpolate from R to G by w to get the final value.

Yet another way to do this would be to use a 4d texture if you have support for it, and sample along (u,v,w,w) to get your curve point using only hardware interpolation.

The next simplest volume type is a linear interpolation between two biquadratic surfaces – a Degree (2,2,1) Bezier rectangle.

Order (2,2,1) Bezier Cube:

From the surfaces section, we saw we could store a biquadratic surface in a 3d texture using two color channels R,G. After sampling at (u,u,v) you interpolate from R to G by v.

To make a volume that linearly interpolates between two biquadratic surfaces, we need two biquadratic surfaces, so need to double the storage we had before.

We can use a 3d texture with 4 color channels to make this happen by storing the first biquadratic in R,G and the second in B,A, sampling this texture at (u,u,v). Next, we interpolate between R and G by v, and also interpolate between B and A by v. Lastly, we linearly interpolate between those two results using w.

The next higher surface would be a triquadratic volume, which is degree (2,2,2). Since you can store a biquadratic surface in a 3d 2x2x2 texture with two color channels, and a triquadratic volume needs 3 of those, we need a 3d texture 6 color channels. Since that doesn’t exist, we could do something like store 2 of the quadratic surfaces in a 2x2x2 RGBA texture, and the other quadratic surface in a 2x2x2 RG texture. We would take two texture samples and combine the 6 results into our final value.

Tricubic is actually pretty simple to conceptualize luckily. We know that we can store a bicubic surface in a 3d 2x2x2 RGBA texture. We also know that we would need 4 of those if we want to make a tricubic volume. So, we could do 4 texture reads (one for each of our bicubic surfaces) and then combine those 4 samples across w to get our final volume value.

# Closing

Hopefully you were able to follow along and see that this stuff is potentially pretty powerful.

Some profiling needs to be done to better understand the performance characteristics of using the texture sampler in this way, versus other methods of curve, surface and volume calculation. I have heard that even when your texture samples are in the texture cache, that it can still take like ~100 cycles to get the information back on a texture read. That means that this is probably not going to be as fast as using shader instructions to calculate the points on the curve. However, if you are compute bound and can offload some work to the texture sampler, or if you are already using a texture to store 1d/2d/3d data (or beyond) that you can aproximate with this technique, that you will have a net win.

One thing I really like about this is that it makes use of non programmable hardware to do useful work. It feels like if you were compute bound, that you could offload some work to the texture sampler if you had some polynomials to evaluate (or surfaces/volumes to sample), and get some perf back.

I also think this could possibly be an interesting way to make concise representations (and evaluation) of non polygonal models. I imagine it would have to be piecewise to make things that look like real world objects, but you do have quite a bit of control with Bezier curves, surfaces and volumes, especially if you use rational ones by doing a divide in your shader.

Here’s a few specifica areas I think this technique could help out with:

• Higher order texture interpolation with fewer samples – You’d have to preprocess textures and would spend more memory on them, but it may be worth while in some situations for higher quality results with a single texture read.
• 2D signed distance field rendering – SDF textures are great for making pseudo vector art. They do break down in some cases and at some magnification levels though. It would be interesting to see if using this technique could improve things either with higher order interpolation, or maybe by encoding (signed) distances differently. Possibly also just useful for describing 2d vector art in a polynomial form?
• 3d signed distance field rendering – Ray marching can make use of signed distance fields to render 3d objects. It can also make use of functions which can only give you inside or outside status based on a point. It would be interesting to explore encoding and decoding both of these types of functions within textures using this technique, to sample shapes during ray marching.

If you are interested in the above, or curious to learn more, here are some good links!

If you have any questions, corrections, feedback, ideas for extensions, etc please let me know! You can leave a comment below, or contact me on twitter at @Atrix256.

## Feedback / Ideas

@anders_breakin had some ideas that could possibly pan out:

1. The derivative of a Bezier curve is another Bezier curve (Derivatives of a Bézier Curve). You could encode the derivative curve(s) in a texture and use that to get the normal instead of using the central differences method. That might give higher quality normals, but should also decrease the number of texture reads needed to get the normal.
2. If you want more accuracy, you may subdivide the curve into more numerous piecewise curves. The texture interpolator only has 8 bits of decimal precision (X.8 fixed point) when interpolating, but if you give it less of the curve/surface/volume to interpolate over at a time, it seems like that would result in more effective precision.

@Vector_GL suggested reading the values in the vertex shader and using the results in the pixel shader. I think something like this could work where you read the control points in the VS, and pass them to the PS, which would then be able to ray march the tensor product surface by evaluating it without texture reads. So long as you have fewer VS instances than PS instances (the triangles are not subpixel!) that this could be an interesting thing to try. It doesn’t take advantage of the texture interpolator, but maybe there would be a way to combine the techniques. If not, this still seems very pragmatic.

I was thinking maybe this could be done via “rasterization” by drawing a bunch of unit cubes and having the PS do the ray marching. With some careful planning, you could probably use Z-testing on this too, to quickly cull hidden pixels without having to ray march them.

# Failed Experiment: The GPU Texture Sampler is Turing Complete But That Fact is Pretty Useless

While it’s true that the GPU texture sampler can evaluate digital logic circuits, it turns out there’s a much better and simpler way to evaluate logic with textures. That better and simpler way isn’t even that useful unfortunately!

This post will show the path I took from the initially intriguing possibilities to the more mundane final answer. You may be able to see mistakes in my reasoning along the way, or be able to get to the punch line sooner (:

This was meant to be an extension to a paper I wrote talking about how you can evaluate Bezier curves by storing only the control points in a texture and then sampling along the texture diagonal:
GPU Texture Sampler Bezier Curve Evaluation

The ideas from this post started with a tweet from @marcosalvi:

Because the last post showed how to evaluate arbitrary polynomials using the texture sampler, and digital circuits can be described as as polynomials in Algebraic Normal Form (ANF), that means we can use the texture sampler to evaluate digital logic circuits. Let’s check it out!

First up, we need to be able to convert logic into ANF. Oddly enough, I already have a post about how to do that, with working C++ source code, so go check it out: Turning a Truth Table Into A digital Circuit (ANF).

As an example, let’s work with a circuit that takes 3 input bits, and adds them together to make a 2 bit result. We’ll need one ANF expression per output bit. $O_0$ will be the 1’s place output bit (least significant bit), and $O_1$ will be the 2’s place output bit (most significant bit). Our 3 input bits will be u,v,w.

$O_0 = u \oplus v \oplus w$
$O_1 = uv \oplus vw \oplus uw$

If we want to use our polynomial evaluation technique, we need equations that are univariate (one variable) instead of multivariate (multiple variables). We can try just using a single variable x in place of u,v and w. Remember that in ANF, you work with polynomials mod 2 (aka $\mathbb{Z}_2$), and that XOR ($\oplus$) is addition while AND is multiplication. This gives the formulas below:

$O_0 = x + x + x = 3x$
$O_1 = xx + xx + xx = 3x^2$

The next thing we need to use the technique is to know the Bezier control points that make a Bezier curve that is equivalent to this polynomial. Since we have 3 input variables into our digital circuit, if they were all 3 multiplied together (ANDed together), we would have a cubic equation, so we need to convert those polynomials to cubic Bernstein basis polynomials. We can use the technique from the last post to get the control points of that equivalent curve.

$O_0 \begin{array}{c|c|c|c|c} 0 & 0 / 1 = 0 & 1 & 2 & 3 \\ 3 & 3 / 3 = 1 & 1 & 1 & \\ 0 & 0 / 3 = 0 & 0 & & \\ 0 & 0 / 1 = 0 & & & \\ \end{array}$

$O_1 \begin{array}{c|c|c|c|c} 0 & 0 / 1 = 0 & 0 & 1 & 3 \\ 0 & 0 / 3 = 0 & 1 & 2 & \\ 3 & 3 / 3 = 1 & 1 & & \\ 0 & 0 / 1 = 0 & & & \\ \end{array}$

Now that we have our control points, we can set up our textures to evaluate our two cubic Bezier curves (one for $O_0$, one for $O_1$). We’ll need to use 3d textures and we’ll need to set up the control points like the below, so that when we sample along the diagonal of the texture we get the points on our curves.

The picture below shows where each control point goes, to set up a cubic Bezier texture. The blue dot is the origin (0,0,0) and the red dot is the extreme value of the cube (1,1,1). The grey line represents the diagonal that we sample along.

Coincidentally, our control points for the $O_0$ curve are actually 0,1,2,3 so that cube above is what our 3d texture needs to look like for the $O_0$ curve.

Below is what the $O_1$ curve’s 3d texture looks like. Note that in reality, we could store these both in a single 3d texture, just use say the red color channel for $O_0$ and the green color channel for $O_1$.

Now that we have our textures set up let’s try it out. Let’s make a table where we have our three input bits, and we use those as texture coordinates in our 3d textures (the texture cubes above) to see what values we get. (Quick note – things are slightly simplified here vs reality. The pixel’s actual value is at a half pixel offset from the texture coordinates, so we’d be sampling between (0.5,0.5,0.5) and (1.5,1.5,1.5) instead of from (0,0,0) to (1,1,1), but we can ignore that detail for now to make this stuff clearer.)

$\begin{array}{|c|c|c|c|c|} \hline u & v & w & O_1 & O_0 \\ \hline 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 \\ 0 & 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 & 2 \\ 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 1 & 2 \\ 1 & 1 & 0 & 1 & 2 \\ 1 & 1 & 1 & 3 & 3 \\ \hline \end{array}$

Now, let’s modulus the result by 2 since ANF expects to work mod 2 ($\mathbb{Z}_2$ to be more precise), and put the decimal value of the result next to it.

$\begin{array}{|c|c|c|c|c|c|} \hline u & v & w & O_1 \% 2 & O_0 \% 2 & Result \\ \hline 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 1 \\ 0 & 1 & 0 & 0 & 1 & 1 \\ 0 & 1 & 1 & 1 & 0 & 2 \\ 1 & 0 & 0 & 0 & 1 & 1 \\ 1 & 0 & 1 & 1 & 0 & 2 \\ 1 & 1 & 0 & 1 & 0 & 2 \\ 1 & 1 & 1 & 1 & 1 & 3 \\ \hline \end{array}$

It worked! The result value is the count of the input bits set to 1.

Unfortunately we have a problem. When we converted the multivariate equation into a univariate equation, we just replaced u,v,w with x. This is only valid if the function is symmetric – if u,v,w can be interchanged with each other and not affect the result of the function. This bit adding digital circuit we made happened to have that property, but most digital circuits do not have that property – most of the time, not all input bits are treated equal. If we made a circuit that added two 2-bit numbers and have a 3-bit result for instance, the high bits of the input numbers have a very different meaning than the low bits and this technique falls apart. (Quick note – we are actually doing the reverse of the polynomial blossoming thing i mentioned in the last post. Blossoming is the act of taking a univariate function and breaking it into a multivariate function that is linear in each variable. The term is called symmetric multiaffine equation if you want to find out more about that.)

This turns out not to be a deal breaker though because it turns out we didn’t have to do a lot of the work that we did to get these volume textures. It turns out we don’t need to calculate the Bezier curve control points, and we don’t even need to make an ANF expression of the digital circuit we want to evaluate.

Let’s recap what we are trying to do. We have 3 input values which are either 0 or 1, we have a 3d texture which is 2x2x2, and we are ultimately using those 3 input values as texture coordinates (u,v,w) to do a lookup into a texture to get a single bit value out.

Here’s a big aha moment. We are just making a binary 3d lookup table, so can take our truth table of whatever it is we are trying to do, and then directly make the final 3d textures described above.

Not only does it work for the example we gave, with a lot less effort and math, it also works for the broken case I mentioned of the function not being symmetric, and not all input bits being equal.

Something else to note is that because we are only sampling at 0 or 1, we don’t need linear texture interpolation at all and can use nearest neighbor (point) sampling on our textures for increased performance. Also because the texture data is just a binary 0 or 1, we could use 1 bit textures.

The second aha moment comes up when you realize that all we are doing is taking some number of binary input bits, using those as texture coordinates, and then looking up a value in a texture.

You can actually use a 1D texture for this!

You take your input bits and form an integer, then look up the value at that pixel location. You build your texture lookup table using this same mapping.

So… it turns out this technique led to a dead end. It was just extra complexity to do nothing special.

Before it all fell apart, I was also thinking this might be a good avenue for doing homomorphic encryption on the GPU, but I don’t believe this aids that at all. (Super Simple Symmetric Leveled Homomorphic Encryption Implementation)

# But Wait – Analog Valued Logic?

One thought I had while all this was unraveling was that maybe this was still useful, because if you put an analog value in (not a 0 or 1, but say 0.3), that maybe this could be used as a sort of “Fuzzy Logic” type logic evaluation.

Unfortunately, it looks like that doesn’t work either!

You can see how it breaks down and some more info here:
Computer Science Stack Exchange: Using analog values with Algebraic Normal Form?

# Oh Well

Sometimes when exploring new frontiers (even if they are just new to us) we hit dead ends, our ideas fail etc. It happens. It’s part of the learning process, and also is useful sometimes to know what doesn’t work and why, instead of just always knowing what DOES work.

Anyways… posts on using the texture sampler for calculating points on data surfaces and data volumes are coming next (:

To give a brief taste of how that is going to play out:

• Doing a single texture read of a 3d RGBA texture can give you a triquadratic interpolated value.
• Alternately, doing a single texture read of a 3d RGBA texture can give you a bicubic interpolated value.

# The Secret to Writing Fast Code / How Fast Code Gets Slow

This is a “soft tech” post. If that isn’t your thing, don’t worry, I’ll be returning to some cool “hard tech” and interesting algorithms after this. I’ve been abusing the heck out of the GPU texture sampler lately, so be on the lookout for some posts on that soon (;

I’m about to show you some of the fastest code there is. It’s faster than the fastest real time raytracer, it’s faster than Duff’s Device.

Heck, despite the fact that it runs on a classical computer, it runs faster than Shor’s Algorithm which uses quantum computing to factor integers so quickly that it breaks modern cryptographic algorithms.

This code also runs faster than Grover’s Algorithm which is another quantum algorithm that can search an unsorted list in O(sqrt(N)).

Even when compiled in debug it runs faster than all of those things.

Are you ready? here it is…

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
return 0;
}


Yes, the code does nothing and that is precisely why it runs so fast.

# The Secret to Writing Fast Code

The secret to writing fast code, no matter what you are writing is simple: Don’t do anything that is too slow.

Let’s say you started with a main() function like i showed above and you decided you want to make a real time raytracer that runs on the CPU.

First thing you do is figure out what frame rate you want it to run at, at the desired resolution. From there, you know how many milliseconds you have to render each frame, and now you have a defined budget you need to stay inside of. If you stay in that budget, you’ll consider it a real time raytracer. If you go outside of that budget, it will no longer be real time, and will be a failed program.

You may get camera control working and primary rays intersecting a plane, and find you’ve used 10% of your budget and 90% of the budget remains. So far so good.

Next up you add some spheres and boxes, diffuse and specular shade them with a directional light and a couple point lights. You find that you’ve used 40% of your budget, and 60% remains. We are still looking good.

Next you decide you want to add reflection and refraction, allowing up to 3 ray bounces. You find you are at 80% of your budget and are still looking good. We are still running fast enough to be considered real time.

Now you say to yourself “You know what? I’m going to do 4x super sampling for anti aliasing!”, so you shoot 4 rays out per pixel instead of 1 and average them.

You profile and uh oh! You are at 320% of your budget! Your ray tracer is no longer real time!

What do you do now? Well, hopefully it’s obvious: DON’T DO THAT, IT’S TOO SLOW!

So you revert it and maybe drop in some FXAA as a post processing pass on your render each frame. Now you are at 95% of your budget let’s say.

Now you may want to add another feature, but with only 5% of your budget left you probably don’t have much performance to spare to do it.

So, you implement whatever it is, find that you are at 105% of your budget.

Unlike the 4x super sampling, which was 220% overbudget, this new feature being only 5% over budget isn’t THAT much. At this point you could profile something that already exists (maybe even your new feature) and see if you can improve it’s performance, or if you can find some clever solution that gives you a performance boost, at the cost of things you don’t care about, you can do that to get some performance back. This is a big part of the job as a successful programmer / software engineer – make trade offs where you gain benefits you care about, at the cost of things you do not care about.

At this point, you can also decide if this new feature is more desired than any of the existing features. If it is, and you can cut an old feature you don’t care about anymore, go for it and make the trade.

Rinse and repeat this process with new features and functionality until you have the features you want, that fit within the performance budget you have set.

Follow this recipe and you too will have your very own real time raytracer (BTW related:Making a Ray Traced Snake Game in Shadertoy).

Maintaining a performance budget isn’t magic. It’s basically subtractive synthesis. Carve time away from your performance budget by adding a feature, then optimize or remove features if you are over budget. Rinse and repeat until the sun burns out.

Ok, so if it’s so easy, why do we EVER have performance problems?

## How Fast Code Gets Slow

Performance problems come up when we are not paying attention. Sometimes we cause them for ourselves, and sometimes things outside of our control cause them.

The biggest way we cause performance problems for ourselves is by NOT MEASURING.

If you don’t know how your changes affect performance, and performance is something you care about, you are going to have a bad time.

If you care about performance, measure performance regularly! Profile before and after your changes and compare the differences. Have automated tests that profile your software and report the results. Understand how your code behaves in the best and worst case. Watch out for algorithms that sometimes take a lot longer than their average case. Stable algorithms make for stable experiences (and stable frame rates in games). This is because algorithms that have “perf spikes” sometimes line up on the same frame, and you’ll have more erratic frame rate, which makes your game seem much worse than having a stable but lower frame rate.

But, again, performance problems aren’t always the programmers fault. Sometimes things outside of our control change and cause us perf problems.

Well, let’s say that you are tasked with writing some very light database software which keeps track of all employee’s birthdays.

Maybe you use a hash map to store birthdays. The key is the string of the person’s name, and the value is a unix epoch timestamp.

Simple and to the point. Not over-engineered.

Now, someone else has a great idea – we have this database software you wrote, what if we use it to keep track of all of our customers and end user birthdays as well?

So, while you are out on vacation, they make this happen. You come back and the “database” software you made is running super slow. There are hundreds of thousands of people stored in the database, and it takes several seconds to look up a single birthday. OUCH!

So hotshot, looks like your code isn’t so fast huh? Actually no, it’s just that your code was used for something other than the original intended usage case. If this was included in the original specs, you would have done something different (and more complex) to handle this need.

This was an exaggerated example, but this sort of thing happens ALL THE TIME.

If you are working on a piece of software, and the software requirements change, it could turn any of your previous good decisions into poor decisions in light of the new realities.

However, you likely don’t have time to go back and re-think and possibly re-work every single thing you had written up to that point. You move onward and upward, a little more heavy hearted.

The target moved, causing your code to rot a bit, and now things are likely in a less than ideal situation. You wouldn’t have planned for the code you have with the info you have now, but it’s the code you do have, and the code you have to stick with for the time being.

Every time that happens, you incur a little more tech debt / code complexity and likely performance problems as well.

You’ll find that things run a little slower than they should, and that you spend more time fighting symptoms with small changes and somewhat arbitrary rules – like telling people not to use name lengths more than 32 characters for maximum performance of your birthday database.

Unfortunately change is part of life, and very much part of software development, and it’s impossible for anyone to fully predict what sort of changes might be coming.

Those changes are often due to business decisions (feedback on product, jockying for a new position in the marketplace, etc), so are ultimately what give us our paychecks and are ultimately good things. Take it from me, who has worked at ~7 companies in 15 years. Companies that don’t change/adapt die.

So, change sucks for our code, but it’s good for our wallets and keeps us employed 😛

Eventually the less than ideal choices of the past affecting the present will reach some threshold where something will have to be done about it. This will likely happen at the point that it’s easier to refactor some code, than to keep fighting the problems it’s creating by being less than ideal, or when something that really NEEDS to happen CAN’T happen without more effort than the refactor would take.

When that happens, the refactor comes in, where you DO get to go back and rethink your decisions, with knowledge of the current realities.

The great thing about the refactor is that you probably have a lot of stuff that your code is doing which it doesn’t really even NEED to be doing.

Culling that dead functionality feels great, and it’s awesome watching your code become simple again. It’s also nice not having to explain why that section of code behaves the way it does (poorly) and the history of it coming to be. “No really, I do know better, but…!!!”

One of the best feelings as a programmer is looking at a complex chunk of code that has been a total pain, pressing the delete key, and getting a little bit closer back to the fastest code in the world:

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
return 0;
}


PS: Another quality of a successful engineer is being able to constantly improve software as it’s touched. If you are working in an area of code, and you see something ugly that can be fixed quickly and easily, do it while you are there. Since the only constant in software development is change, and change causes code quality to continually degrade, make yourself a force of continual code improvement and help reverse the flow of the code flowing into the trash can.

## Engines

In closing, I want to talk about game engines – 3rd party game engines, and re-using an engine from a different project. This also applies to using middleware.

Existing engines are great in that when you and your team know how to use them, you can get things set up very quickly. It lets you hit the ground running.

However, no engine is completely generic. No engine is completely flexible.

That means that when you use an existing engine, there will be some amount of features and functionality which were made without your specific usage case in mind.

You will be stuck in the world where from day 1 you are incurring the tech debt type problems I describe above, but you will likely be unable or unwilling to refactor everything to suit your needs specifically.

I don’t mention this to say that engines are bad. Lots of successful games have used engines made by other people, or re-used engines from previous projects.

However, it’s a different kind of beast using an existing engine.

Instead of making things that suit your needs, and then using them, you’ll be spending your time figuring out how to use the existing puzzle pieces to do what you want. You’ll also be spending time backtracking as you hit dead ends, or where your first cobbled together solution didn’t hold up to the new realities, and you need to find a new path to success that is more robust.

Just something to be aware of when you are looking at licensing or re-using an engine, and thinking that it’ll solve all your problems and be wonderful. Like all things, it comes at a cost!

Using an existing engine does put you ahead of the curve: At day 1 you already have several months of backlogged technical debt!

Unfortunately business realities mean we can’t all just always write brand new engines all the time. It’s unsustainable

Agree / Disagree / Have something to say?