## Solving N equations and N unknowns: The Fine Print (Gauss Jordan Elimination)

In basic algebra we were taught that if we have three unknowns (variables), it takes three equations to solve for them.

There’s some fine print though that isn’t talked about until quite a bit later.

Let’s have a look at three unknowns in two equations:

$A + B + C = 2 \\ B = 5$

If we just need a third equation to solve this, why not just modify the second equation to make a third?

$-B = -5$

That obviously doesn’t work, because it doesn’t add any new information! If you try it out, you’ll find that adding that equation doesn’t get you any closer to solving for the variables.

So, it takes three equations to solve for three unknowns, but the three equations have to provide unique, meaningful information. That is the fine print.

How can we know if an equation provides unique, meaningful information though?

It turns out that linear algebra gives us a neat technique for simplifying a system of equations. It actually solves for individual variables if it’s able to, and also gets rid of redundant equations that don’t add any new information.

This simplest form is called the Reduced Row Echelon Form (Wikipedia) which you may also see abbreviated as “rref” (perhaps a bit of a confusing term for programmers) and it involves you putting the equations into a matrix and then performing an algorithm, such as Gauss–Jordan elimination (Wikipedia) to get the rref.

# Equations as a Matrix

Putting n set of equations into a matrix is really straight forward.

Each row of a matrix is a separate equation, and each column represents the coefficient of a variable.

Let’s see how with this set of equations:

$3x + y = 5\\ 2y = 7\\ y + z = 14$

Not every equation has every variable in it, so let’s fix that by putting in zero terms for the missing variables, and let’s make the one terms explicit as well:

$3x + 1y + 0z = 5\\ 0x + 2y + 0z = 7\\ 0x + 1y + 1z = 14$

Putting those equations into a matrix looks like this:

$\left[\begin{array}{rrr} 3 & 1 & 0 \\ 0 & 2 & 0 \\ 0 & 1 & 1 \end{array}\right]$

If you also include the constants on the right side of the equation, you get what is called an augmented matrix, which looks like this:

$\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

# Reduced Row Echelon Form

Wikipedia explains the reduced row echelon form this way:

• all nonzero rows (rows with at least one nonzero element) are above any rows of all zeroes (all zero rows, if any, belong at the bottom of the matrix), and
• the leading coefficient (the first nonzero number from the left, also called the pivot) of a nonzero row is always strictly to the right of the leading coefficient of the row above it.
• Every leading coefficient is 1 and is the only nonzero entry in its column.

This is an example of a 3×5 matrix in reduced row echelon form:
$\left[\begin{array}{rrrrr} 1 & 0 & a_1 & 0 & b_1 \\ 0 & 1 & a_2 & 0 & b_2 \\ 0 & 0 & 0 & 1 & b_3 \end{array}\right]$

Basically, the lower left triangle of the matrix (the part under the diagonal) needs to be zero, and the first number in each row needs to be one.

Looking back at the augmented matrix we made:

$\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

If we put it into reduced row echelon form, we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 0 & 0.5 \\ 0 & 1 & 0 & 3.5 \\ 0 & 0 & 1 & 10.5 \end{array}\right]$

There’s something really neat about the reduced row echelon form. If we take the above augmented matrix and turn it back into equations, look what we get:

$1x + 0y + 0z = 0.5\\ 0x + 1y + 0z = 3.5\\ 0x + 0y + 1z = 10.5$

Or if we simplify that:

$x = 0.5\\ y = 3.5\\ z = 10.5$

Putting it into reduced row echelon form simplified our set of equations so much that it actually solved for our variables. Neat!

How do we put a matrix into rref? We can use Gauss–Jordan elimination.

# Gauss–Jordan Elimination

Gauss Jordan Elimination is a way of doing operations on rows to be able to manipulate the matrix to get it into the desired form.

It’s often explained that there are three row operations you can do:

• Type 1: Swap the positions of two rows.
• Type 2: Multiply a row by a nonzero scalar.
• Type 3: Add to one row a scalar multiple of another.

You might notice that the first two rules are technically just cases of using the third rule. I find that easier to remember, maybe you will too.

The algorithm for getting the rref is actually pretty simple.

1. Starting with the first column of the matrix, find a row which has a non zero in that column, and make that row be the first row by swapping it with the first row.
2. Multiply the first row by a value so that the first column has a 1 in it.
3. Subtract a multiple of the first row from every other row in the matrix so that they have a zero in the first column.

You’ve now handled one column (one variable) so move onto the next.

1. Continuing on, we consider the second column. Find a row which has a non zero in that column and make that row be the second row by swapping it with the second row.
2. Multiply the second row by a value so that the second column has a 1 in it.
3. Subtract a multiple of the second row from every other row in the matrix so that they have a zero in the second column.

You repeat this process until you either run out of rows or columns, at which point you are done.

Note that if you ever find a column that has only zeros in it, you just skip that row.

Let’s work through the example augmented matrix to see how we got it into rref. Starting with this:

$\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

We already have a non zero in the first column, so we multiply the top row by 1/3 to get this:

$\left[\begin{array}{rrr|r} 1 & 0.3333 & 0 & 1.6666 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

All the other rows have a zero in the first column so we move to the second row and the second column. The second row already has a non zero in the second column, so we multiply the second row by 1/2 to get this:

$\left[\begin{array}{rrr|r} 1 & 0.3333 & 0 & 1.6666 \\ 0 & 1 & 0 & 3.5 \\ 0 & 1 & 1 & 14 \end{array}\right]$

To make sure the second row is the only row that has a non zero in the second column, we subtract the second row times 1/3 from the first row. We also subtract the second row from the third row. That gives us this:

$\left[\begin{array}{rrr|r} 1 & 0 & 0 & 0.5 \\ 0 & 1 & 0 & 3.5 \\ 0 & 0 & 1 & 10.5 \end{array}\right]$

Since the third row has a 1 in the third column, and all other rows have a 0 in that column we are done.

That’s all there is to it! We put the matrix into rref, and we also solved the set of equations. Neat huh?

You may notice that the ultimate rref of a matrix is just the identity matrix. This is true unless the equations can’t be fully solved.

# Overdetermined, Underdetermined & Inconsistent Equations

Systems of equations are overdetermined when they have more equations than unknowns, like the below which has three equations and two unknowns:

$x + y = 3 \\ x = 1 \\ y = 2 \\$

Putting that into (augmented) matrix form gives you this:

$\left[\begin{array}{rr|r} 1 & 1 & 3 \\ 1 & 0 & 1 \\ 0 & 1 & 2 \end{array}\right]$

If you put that into rref, you end up with this:

$\left[\begin{array}{rr|r} 1 & 0 & 1 \\ 0 & 1 & 2 \\ 0 & 0 & 0 \end{array}\right]$

The last row became zeroes, which shows us that there was redundant info in the system of equations that disappeared. We can easily see that x = 1 and y = 2, and that satisfies all three equations.

Just like we talked about in the opening of this post, if you have equations that don’t add useful information beyond what the other equations already give, it will disappear when you put it into rref. That made our over-determined system become just a determined system.

What happens though if we change the third row in the overdetermined system to be something else? For instance, we can say y=10 instead of y=2:

$x + y = 3 \\ x = 1 \\ y = 10 \\$

The augmented matrix for that is this:

$\left[\begin{array}{rr|r} 1 & 1 & 3 \\ 1 & 0 & 1 \\ 0 & 1 & 10 \end{array}\right]$

If we put that in rref, we get the identity matrix out which seems like everything is ok:

$\left[\begin{array}{rr|r} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right]$

However, if we turn it back into a set of equations, we can see that we have a problem:

$x = 0 \\ x = 0 \\ 0 = 1 \\$

The result says that 0 = 1, which is not true. Having a row of “0 = 1” in rref is how you detect that a system of equations is inconsistent, or in other words, that the equations give contradictory information.

A system of equations can also be underderdetermined, meaning there isn’t enough information to solve the equations. Let’s use the example from the beginning of the post:

$A + B + C = 2 \\ B = 5 \\$

In an augmented matrix, that looks like this:

$\left[\begin{array}{rrr|r} 1 & 1 & 1 & 2 \\ 0 & 1 & 0 & 5 \\ \end{array}\right]$

Putting that in rref we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 1 & -3 \\ 0 & 1 & 0 & 5 \\ \end{array}\right]$

Converting the matrix back into equations we get this:

$A + C = -3 \\ B = 5 \\$

This says there isn’t enough information to fully solve the equations, and shows how A and C are related, even though B is completely determined.

Note that another way of looking at this is that “A” and “C” are “free variables”. That means that if your equations specify constraints, that you are free to choose a value for either A or C. If you choose a value for one, the other becomes defined. B is not a free variable because it’s value is determined.

Let’s finish the example from the beginning of the post, showing what happens when we “make up” an equation by transforming one of the equations we already have:

$A + B + C = 2 \\ B = 5\\ -B = -5$

The augmented matrix looks like this:

$\left[\begin{array}{rrr|r} 1 & 1 & 1 & 2 \\ 0 & 1 & 0 & 5 \\ 0 & -1 & 0 & -5 \\ \end{array}\right]$

Putting it in rref, we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 1 & -3 \\ 0 & 1 & 0 & 5 \\ 0 & 0 & 0 & 0 \\ \end{array}\right]$

Which as you can see, our rref matrix is the same as it was without the extra “made up” equation besides the extra row of zeros in the result.

The number of non zero rows in a matrix in rref is known as the rank of the matrix. In these last two examples, the rank of the matrix was two in both cases. That means that you can tell if adding an equation to a system of equations adds any new, meaningful information or not by seeing if it changes the rank of the matrix for the set of equations. If the rank is the same before and after adding the new equation, it doesn’t add anything new. If the rank does change, that means it does add new information.

This concept of “adding new, meaningful information” actually has a formalized term: linear independence. If a new equation is linearly independent from the other equations in the system, it will change the rank of the rref matrix, else it won’t.

The rank of a matrix for a system of equations just tells you the number of linearly independent equations there actually are, and actually gives you what those equations are in their simplest form.

Lastly I wanted to mention that the idea of a system of equations being inconsistent is completely separate from the idea of a system of equations being under determined or over determined. They can be both over determined and inconsistent, under determined and inconsistent, over determined and consistent or under determined and consistent . The two ideas are completely separate, unrelated things.

# Inverting a Matrix

Interestingly, Gauss-Jordan elimination is also a common way for efficiently inverting a matrix!

How you do that is make an augmented matrix where on the left side you have the matrix you want to invert, and on the right side you have the identity matrix.

Let’s invert a matrix I made up pretty much at random:

$\left[\begin{array}{rrr|rrr} 1 & 0 & 1 & 1 & 0 & 0 \\ 0 & 3 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1\\ \end{array}\right]$

Putting that matrix in rref, we get this:

$\left[\begin{array}{rrr|rrr} 1 & 0 & 0 & 1 & 0 & -1 \\ 0 & 1 & 0 & 0 & 0.3333 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1\\ \end{array}\right]$

The equation on the right is the inverse of the original matrix we had on the left!

You can double check by using an online matrix inverse calculator if you want: Inverse Matrix Calculator

Note that not all matrices are invertible though! When you get an inconsistent result, or the result is not the identity matrix, it wasn’t invertible.

# Solving Mx = b

Let’s say that you have two vectors x and b, and a matrix M. Let’s say that we know the matrix M and the vector b, and that we are trying to solve for the vector x.

This comes up more often that you might suspect, including when doing “least squares fitting” of an equation to a set of data points (more info on that: Incremental Least Squares Curve Fitting).

One way to solve this equation would be to calculate the inverse matrix of M and multiply that by vector b to get vector x:

$Mx = b\\ x = M^{-1} * b$

However, Gauss-Jordan elimination can help us here too.

If we make an augmented matrix where on the left we have M, and on the right we have b, we can put the matrix into rref, which will essentially multiply vector b by the inverse of M, leaving us with the vector x.

For instance, on the left is our matrix M that scales x,y,z by 2. On the right is our vector b, which is the matrix M times our unknown vector x:

$\left[\begin{array}{rrr|r} 2 & 0 & 0 & 2 \\ 0 & 2 & 0 & 4 \\ 0 & 0 & 2 & 8 \\ \end{array}\right]$

Putting that into rref form we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 2 \\ 0 & 0 & 1 & 4 \\ \end{array}\right]$

From this, we know that the value of vector x is the right side of the augmented matrix: (1,2,4)

This only works when the matrix is invertible (aka when the rref goes to an identity matrix).

# Source Code

Here is some C++ source code which does Gauss-Jordan elimination. It’s written mainly to be readable, not performant!

#include <stdio.h>
#include <array>
#include <vector>
#include <assert.h>

// Define a vector as an array of floats
template<size_t N>
using TVector = std::array<float, N>;

// Define a matrix as an array of vectors
template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

// Helper function to fill out a matrix
template <size_t M, size_t N>
TMatrix<M, N> MakeMatrix (std::initializer_list<std::initializer_list<float>> matrixData)
{
TMatrix<M, N> matrix;

size_t m = 0;
assert(matrixData.size() == M);
for (const std::initializer_list<float>& rowData : matrixData)
{
assert(rowData.size() == N);
size_t n = 0;
for (float value : rowData)
{
matrix[m][n] = value;
++n;
}
++m;
}

return matrix;
}

// Make a specific row have a 1 in the colIndex, and make all other rows have 0 there
template <size_t M, size_t N>
bool MakeRowClaimVariable (TMatrix<M, N>& matrix, size_t rowIndex, size_t colIndex)
{
// Find a row that has a non zero value in this column and swap it with this row
{
// Find a row that has a non zero value
size_t nonZeroRowIndex = rowIndex;
while (nonZeroRowIndex < M && matrix[nonZeroRowIndex][colIndex] == 0.0f)
++nonZeroRowIndex;

// If there isn't one, nothing to do
if (nonZeroRowIndex == M)
return false;

// Otherwise, swap the row
if (rowIndex != nonZeroRowIndex)
std::swap(matrix[rowIndex], matrix[nonZeroRowIndex]);
}

// Scale this row so that it has a leading one
float scale = 1.0f / matrix[rowIndex][colIndex];
for (size_t normalizeColIndex = colIndex; normalizeColIndex < N; ++normalizeColIndex)
matrix[rowIndex][normalizeColIndex] *= scale;

// Make sure all rows except this one have a zero in this column.
// Do this by subtracting this row from other rows, multiplied by a multiple that makes the column disappear.
for (size_t eliminateRowIndex = 0; eliminateRowIndex < M; ++eliminateRowIndex)
{
if (eliminateRowIndex == rowIndex)
continue;

float scale = matrix[eliminateRowIndex][colIndex];
for (size_t eliminateColIndex = 0; eliminateColIndex < N; ++eliminateColIndex)
matrix[eliminateRowIndex][eliminateColIndex] -= matrix[rowIndex][eliminateColIndex] * scale;
}

return true;
}

// make matrix into reduced row echelon form
template <size_t M, size_t N>
void GaussJordanElimination (TMatrix<M, N>& matrix)
{
size_t rowIndex = 0;
for (size_t colIndex = 0; colIndex < N; ++colIndex)
{
if (MakeRowClaimVariable(matrix, rowIndex, colIndex))
{
++rowIndex;
if (rowIndex == M)
return;
}
}
}

int main (int argc, char **argv)
{
auto matrix = MakeMatrix<3, 4>(
{
{ 2.0f, 0.0f, 0.0f, 2.0f },
{ 0.0f, 2.0f, 0.0f, 4.0f },
{ 0.0f, 0.0f, 2.0f, 8.0f },
});

GaussJordanElimination(matrix);

return 0;
}


I hope you enjoyed this post and/or learned something from it. This is a precursor to an interesting (but maybe obscure) topic for my next blog post, which involves a graphics / gamedev thing.

Any comments, questions or corrections, let me know in the comments below or on twitter at @Atrix256

## Orthogonal Projection Matrix Plainly Explained

“Scratch a Pixel” has a really nice explanation of perspective and orthogonal projection matrices.

It inspired me to make a very simple / plain explanation of orthogonal projection matrices that hopefully will help them be less opaque for folks and more intuitive.

Original article: Scratch A Pixel: The Perspective and Orthographic Projection Matrix

# Let’s Get To It!

The whole purpose of an orthogonal matrix is to take x,y and z as input and output x,y and z such that valid points on the screen will have x,y,z values between -1 and 1.

If we transform a point and get an x,y or z that is outside of that range, we know the point is outside of the screen either because it’s too far left, right, up or down, or because it’s too close or too far on the z axis.

Let’s think about how we’d do this, thinking only about the x coordinate for now.

To map some range of x values from -1 to 1, we’ll need to decide on what x value maps to -1 and what x value maps to 1. We’ll call these “left” and “right”.

Given a left and right value, and an x value we want to map to the range, perhaps the most straight forward way to do it would be this:

$XOut = \frac{X-Left}{Right-Left} * 2 - 1$

The division calculates the percentage of how far X is between left and right. Multiplying that by 2 and subtracting 1 changes it so instead of valid points being from 0 to 1 (aka from 0% to 100%), they are instead between -1 and 1.

Let’s change this formula so that there is one term that is multiplied by X and another term that has everything else. (Wondering why? It’s because I’m cheating and know the final form. Don’t feel bad if it isn’t intuitive why we’d do this!)

$\frac{X-Left}{Right-Left} * 2 - 1 =\\ \\ \frac{2(X-Left)}{Right-Left} - 1 =\\ \\ \frac{2X-2*Left}{Right-Left} - 1 =\\ \\ \frac{2X-2*Left}{Right-Left} - \frac{Right-Left}{Right-Left} =\\ \\ \frac{2X-2*Left-(Right-Left)}{Right-Left} =\\ \\ \frac{2X-2*Left-Right+Left}{Right-Left} =\\ \\ \frac{2X-Left-Right}{Right-Left} =\\ \\ \frac{2X}{Right-Left} - \frac{Right+Left}{Right-Left} =\\ \\ \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left}\\$

Setting up the formula this way allows us to transform the x component of an (x,y,z,1) point using a dot product:

$(x,y,z,1) \cdot (\frac{2}{Right-Left},0,0,-\frac{Right+Left}{Right-Left}) = \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left}$

A dot product is what happens during matrix multiplication, so if we put this into a 4×4 matrix, we get the same result. Let’s check that out.

We start with an identity matrix. If we use it to transform an (x,y,z,1) point, we get the same point as output aka nothing happens.

$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}$

Now let’s put the x transform we came up with into the matrix:

$\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ -\frac{Right+Left}{Right-Left} & 0 & 0 & 1 \\ \end{bmatrix}$

If we use that matrix to transform an (x,y,z,1) point, it will transform our x component as we described (valid ranges of x that are between left and right will be between -1 and 1), while leaving the other components of the point alone.

As you might imagine, it’s pretty simple to get our formulas for y and z as well. Starting with the x formula, we can just change x with y and z, and right/left with top/bottom and far/near.

$XOut = \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left} \\ \\ YOut = \frac{2}{Top-Bottom}Y - \frac{Top+Bottom}{Top-Bottom} \\ \\ ZOut = \frac{2}{Far-Near}Z - \frac{Far+Near}{Far-Near}$

We can put those into our matrix to get a full orthographic projection matrix.

$\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{2}{Far-Near} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}$

There we go, that’s all there is to making an orthographic projection matrix. It’s whole purpose is to convert x,y,z values to be between -1 and 1 so that the GPU knows whether points are inside our outside the screen – and thus whether they need to be clipped or not.

## Variations

While the projection matrix we made is a valid orthographic projection matrix in OpenGL, we actually need a tweak for it to be valid for DirectX. The reason for this is because while in OpenGL the clip space for z is between -1 and 1, it’s actually between 0 and 1 for DirectX!

If you leave off the *2-1 for the z formula, but leave it for x and y, you’ll end up with a matrix like this one:

$\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{1}{Near-Far} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Near}{Near-Far} & 1 \\ \end{bmatrix}$

Another variation you’ll see is a version where the camera is centered on the origin for the x and y axis. In other words, left = -right, and top = -bottom. When that is true, right+left and top+bottom become zero which simplifies the matrix to this:

$\begin{bmatrix} \frac{2}{Width} & 0 & 0 & 0 \\ 0 & \frac{2}{Height} & 0 & 0 \\ 0 & 0 & \frac{2}{Far-Near} & 0 \\ 0 & 0 & -\frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}$

Another variation you’ll see is that the matrix is transposed. You’ll see this when switching between pre and post multiplication, or when switching from column major matrices to row matrices. Either is valid and it’s basically just a notation and convention thing. Here is the origional matrix we made transposed.

$\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & -\frac{Right+Left}{Right-Left} \\ 0 & \frac{2}{Top-Bottom} & 0 & -\frac{Top+Bottom}{Top-Bottom} \\ 0 & 0 & \frac{2}{Far-Near} & -\frac{Far+Near}{Far-Near} \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}$

Lastly, the above matrices were all for a “left handed” system. That means that it assumes the positive x axis goes to the right, the positive y axis goes up, and the positive z goes into your screen (aka, the camera is looking down the positive z axis). Positive Z values will map to the valid -1 to 1 range, while negative z values will be outside the valid range.

A variation on the orthographic projection matrix we made that you’ll see is the matrix being a “right handed” matrix which is the same as the left handed matrix, except that the positive z axis goes out from your screen (aka the camera is looking down the negative z axis). Negative Z values will map to the valid -1 to 1 range, while positive z values will be outside the valid range.

To switch the handedness of the matrix, you just flip the sign of the element at (3,3), so here is our original orthographic projection matrix, but converted to right handed instead of left handed.

$\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & -\frac{2}{Far-Near} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}$

You may also just see the denominator changed from $Far-Near$ to $Near-Far$ which has the same effect, and would give you something like this:

$\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{2}{Near-Far} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}$

Fun trivia: the term “sinister” comes from latin, meaning “left handed”. So, when talking to someone about their graphics engine, you can ask them whether or not they use sinister projection 😛

Scratch A Pixel: The Perspective and Orthographic Projection Matrix

D3DXMatrixOrthoRH (DirectX) – shows the resulting matrix. Also links to left handed and off center variants.

glOrtho (OpenGL) – shows resulting matrix.

## Plastic Bag Ban – Semi Reusable Bag Kiosks a Better Solution?

I an in favor of people generating less trash, and have been amazed that where I live (southern California), people have taken a “plastic bag ban” so well in stride. It felt like one of those things where we couldn’t live without it, but it turns out we can quite easily. Maybe there’s a lesson there, but that’s not the point of this post so I’ll get back to it (;

Where I live, you can BUY plastic bags for use from the grocery store for about 10 cents each, which is cost prohibitive enough that people tend to bring their own bags.

Now let me tell you, I’m no dummy. I am pretty sure those re-usable plastic bags everyone has gotten use far more resources to make and create more pollution in the process so seem like they will take some usage before they hit a break even point compared to the disposable bags. Google seems to give mixed results, saying that some bags are better than others: Google: are reusable bags better for the environment.

I think that the sentiment of what’s going on is good for sure though, and I’m hoping it’s a net positive (?), and I think that there is some winning to be had here environmentally – if done right.

However, let’s put the environmental concerns on the back burner for a second.

Grocery store baggers now get handed all sorts of differently shaped bags of various capacity, and folks often want their stuff bagged specific ways to make sure it all fits in however many they brought. It’s also very common for people to forget their bags at home, leave them in the car, etc. This makes things a bit awkward and definitely not as fast and smooth as it used to be with disposable bags.

My idea to address these issues is this:

• When you check out from the store, there are semi-reusable bags in bails similar to how disposable bags are in bails now. You pay to use the bags, but the cost is mostly a refundable deposit.
• You take the bags home, unload, etc like normal, but ideally you do not throw the bags away.
• The next time you come to the store, you feed your empty bags into a kiosk at the front of the store (similar to the coin to cash machines many currently have) and it prints you out a voucher saying how many bags you returned.
• The machine bails the bags back up for use by the store employees at the register, or perhaps some off site service does this and washes the bags, gets rid of damaged ones etc.
• You shop as normal, but when you check out, you give them your bag voucher. If the voucher is for fewer bags than you need for this trip, you can pay for some extra ones. If your voucher is for more than you need this time, you get the deposit refunded on the ones you don’t need.

The idea here is that at the end of the day…

• The baggers have a uniform type of bag to work with, which makes their job easier and allows them to bag more quickly (like the old days!).
• The bags are ideally as environmentally friendly made as possible (heck, they could be made from corn, hemp, burlap, whatever).
• It isn’t a big deal if you forgot your bags at home or in the car – you pay a little extra to use new bags, but when you return them, you get most of that back.

There are some obvious issues to work through, including:

• Getting those kiosks into the stores, and ideally having all stores use compatible bags.
• Dealing with damaged or dirty bags. It would be nice if the machine or whatever process bails them is able to detect and segregate them somehow, and ideally have some washing / minimal recycling process off site to make new ones (like pulp and re-form?)
• The plastic bags are a profit center for grocery stores now. They would need sufficient incentive, or pressure from customers to make it happen.

So, that’s my idea. Environmentally friendly semi-reusable bags – but with the convenience we all have come to enjoy from disposable bags. The best of both worlds.

As a video game programmer this is far outside of my interest and ability, so please take this business idea if you have the desire and the means. Let’s make it happen!

## Neural Network Recipe: Recognize Handwritten Digits With 95% Accuracy

This post is a recipe for making a neural network which is able to recognize hand written numeric digits (0-9) with 95% accuracy.

The intent is that you can use this recipe (and included simple C++ code, and interactive web demo!) as a starting point for some hands on experimentation.

A recent post of mine talks about all the things used in this recipe so give it a read if you want more info about anything: How to Train Neural Networks With Backpropagation.

This recipe is also taken straight from this amazing website (but coded from scratch in C++ by myself), where it’s implemented in python: Using neural nets to recognize handwritten digits.

# Recipe

The neural network takes as input 28×28 greyscale images, so there will be 784 input neurons.

There is one hidden layer that has 30 neurons.

The final layer is the output layer which has 10 neurons.

The output neuron with the highest activation is the digit that was recognized. For instance if output neuron 0 had the highest activation, the network detected a 0. If output neuron 2 was highest, the network detected a 2.

The neurons use the sigmoid activation function, and the cost function used is half mean squared error.

Training uses a learning rate of 3.0 and the training data is processed by the network 30 times (aka 30 training epochs), using a minibatch size of 10.

A minibatch size of 10 just means that we calculate the gradient for 10 training samples at a time and adjust the weights and biases using that gradient. We do that for the entire (shuffled) 60,000 training items and call that a single epoch. 30 epochs mean we do this full process 30 times.

There are 60,000 items in the training data, mapping 28×28 greyscale images to what digit 0-9 they actually represent.

Besides the 60,000 training data items, there are also 10,000 separate items that are the test data. These test data items are items never seen by the network during training and are just used as a way to see how well the network has learned about the problem in general, versus learning about the specific training data items.

The test and training data is the MNIST data set. I have a link to zip file I made with the data in it below, but this is where I got the data from: The MNIST database of handwritten digits.

That is the entire recipe!

## Results

The 30 training epochs took 1 minute 22 seconds on my machine in release x64 (with REPORT_ERROR_WHILE_TRAINING() set to 0 to speed things up), but the code could be made to run faster by using SIMD, putting it on the GPU, getting multithreading involved or other things.

Below is a graph of the accuracy during the training epochs.

Notice that most learning happened very early on and then only slowly improved from there. This is due to our neuron activation functions and also our cost function. There are better choices for both, but this is also an ongoing area of research to improve in neural networks.

The end result of my training run is 95.32% accuracy but you may get slightly higher or lower due to random initialization of weights and biases. That sounds pretty high, but if you were actually using this, 4 or 5 numbers wrong out of 100 is a pretty big deal! The record for MNIST is 99.77% accuracy using “a committee of convolutional networks” where they distorted the input data during training to make it learn in a more generalized way (described as “committee of 35 conv. net, 1-20-P-40-P-150-10 [elastic distortions]”).

A better cost function would probably be the cross entropy cost function, a better activation function than sigmoid would probably be an ELU (Exponential Linear Unit). A soft max layer could be used instead of just taking the maximum output neuron as the answer. The weights could be initialized to smarter values. We could also use a convolutional layer to help let the network learn features in a way that didn’t also tie the features to specific locations in the images.

Many of these things are described in good detail at http://neuralnetworksanddeeplearning.com/, particularly in chapter 3 where they make a python implementation of a convolutional neural network which performs better than this one. I highly recommend checking that website out!

# HTML5 Demo

You can play with a network created with this recipe here: Recognize Handwritten Digit 95% Accuracy

Here is an example of it correctly detecting that I drew a 4.

The demo works “pretty well” but it does have a little less than 95% accuracy.

The reason for this though is that it isn’t comparing apples to apples.

A handwritten digit isn’t quite the same as a digit drawn with a mouse. Check out the image below to see 100 of the training images and see what i mean.

The demo finds the bounding box of the drawn image and rescales that bounding box to a 20×20 image, preserving the aspect ratio. It then puts that into a 28×28 image, using the center of mass of the pixels to center the smaller image in the larger one. This is how the MNIST data was generated, so makes the demo more accurate, but it also has the nice side effect of making it so you can draw a number of any size, in any part of the box, and it will treat it the same as if you drew it at a difference size, or in a different part of the box.

The code that goes with this post outputs the weights, biases and network structure in a json format that is very easy to drop into the html5 demo. This way, if you want to try different things in the network, it should be fairly low effort to adjust the demo to try your adjustments there as well.

Lastly, it might be interesting to get the derivatives of the inputs and play around with the input you gave it. Some experiments I can think of:

1. When it misclassifies what number you drew, have it adjust what you drew (the input) to be more like what the network would expect to see for that digit. This could help show why it misclassified your number.
2. Start with a well classified number and make it morph into something recognized by the network as a different number.
3. Start with a random static (noise) image and adjust it until the network recognizes it as a digit. It would be interesting to see if it looked anything like a number, or if it was still just static.

# Source Code

The source code and mnist data is on github at MNIST1, but is also included below for your convenience.

If grabbing the source code from below instead of github, you will need to extract this zip file into the working directory of the program as well. It contains the test data used for training the network.
mnist.zip

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <random>
#include <array>
#include <vector>
#include <algorithm>
#include <chrono>

typedef uint32_t uint32;
typedef uint16_t uint16;
typedef uint8_t uint8;

// Set to 1 to have it show error after each training and also writes it to an Error.csv file.
// Slows down the process a bit (+~50% time on my machine)
#define REPORT_ERROR_WHILE_TRAINING() 1

const size_t c_numInputNeurons = 784;
const size_t c_numHiddenNeurons = 30;  // NOTE: setting this to 100 hidden neurons can give better results, but also can be worse other times.
const size_t c_numOutputNeurons = 10;

const size_t c_trainingEpochs = 30;
const size_t c_miniBatchSize = 10;
const float c_learningRate = 3.0f;

// ============================================================================================
//                                     SBlockTimer
// ============================================================================================
// times a block of code
struct SBlockTimer
{
SBlockTimer (const char* label)
{
m_start = std::chrono::high_resolution_clock::now();
m_label = label;
}

~SBlockTimer ()
{
std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
printf("%s%0.2f seconds\n", m_label, seconds.count());
}

std::chrono::high_resolution_clock::time_point m_start;
const char* m_label;
};

// ============================================================================================
// ============================================================================================

inline uint32 EndianSwap (uint32 a)
{
return (a<<24) | ((a<<8) & 0x00ff0000) |
((a>>8) & 0x0000ff00) | (a>>24);
}

// MNIST data and file format description is from http://yann.lecun.com/exdb/mnist/
class CMNISTData
{
public:
CMNISTData ()
{
m_labelData = nullptr;
m_imageData = nullptr;

m_imageCount = 0;
m_labels = nullptr;
m_pixels = nullptr;
}

{
// set the expected image count
m_imageCount = training ? 60000 : 10000;

const char* labelsFileName = training ? "train-labels.idx1-ubyte" : "t10k-labels.idx1-ubyte";
FILE* file = fopen(labelsFileName,"rb");
if (!file)
{
printf("could not open %s for reading.\n", labelsFileName);
return false;
}
fseek(file, 0, SEEK_END);
long fileSize = ftell(file);
fseek(file, 0, SEEK_SET);
m_labelData = new uint8[fileSize];
fclose(file);

const char* imagesFileName = training ? "train-images.idx3-ubyte" : "t10k-images.idx3-ubyte";
file = fopen(imagesFileName, "rb");
if (!file)
{
printf("could not open %s for reading.\n", imagesFileName);
return false;
}
fseek(file, 0, SEEK_END);
fileSize = ftell(file);
fseek(file, 0, SEEK_SET);
m_imageData = new uint8[fileSize];
fclose(file);

// endian swap label file if needed, just first two uint32's.  The rest is uint8's.
uint32* data = (uint32*)m_labelData;
if (data[0] == 0x01080000)
{
data[0] = EndianSwap(data[0]);
data[1] = EndianSwap(data[1]);
}

// verify that the label file has the right header
if (data[0] != 2049 || data[1] != m_imageCount)
{
return false;
}
m_labels = (uint8*)&(data[2]);

// endian swap the image file if needed, just first 4 uint32's. The rest is uint8's.
data = (uint32*)m_imageData;
if (data[0] == 0x03080000)
{
data[0] = EndianSwap(data[0]);
data[1] = EndianSwap(data[1]);
data[2] = EndianSwap(data[2]);
data[3] = EndianSwap(data[3]);
}

// verify that the image file has the right header
if (data[0] != 2051 || data[1] != m_imageCount || data[2] != 28 || data[3] != 28)
{
return false;
}
m_pixels = (uint8*)&(data[4]);

// convert the pixels from uint8 to float
m_pixelsFloat.resize(28 * 28 * m_imageCount);
for (size_t i = 0; i < 28 * 28 * m_imageCount; ++i)
m_pixelsFloat[i] = float(m_pixels[i]) / 255.0f;

// success!
return true;
}

~CMNISTData ()
{
delete[] m_labelData;
delete[] m_imageData;
}

size_t NumImages () const { return m_imageCount; }

const float* GetImage (size_t index, uint8& label) const
{
label = m_labels[index];
return &m_pixelsFloat[index * 28 * 28];
}

private:
void* m_labelData;
void* m_imageData;

size_t m_imageCount;
uint8* m_labels;
uint8* m_pixels;

std::vector<float> m_pixelsFloat;
};

// ============================================================================================
//                                    NEURAL NETWORK
// ============================================================================================

template <size_t INPUTS, size_t HIDDEN_NEURONS, size_t OUTPUT_NEURONS>
class CNeuralNetwork
{
public:
CNeuralNetwork ()
{
// initialize weights and biases to a gaussian distribution random number with mean 0, stddev 1.0
std::random_device rd;
std::mt19937 e2(rd());
std::normal_distribution<float> dist(0, 1);

for (float& f : m_hiddenLayerBiases)
f = dist(e2);

for (float& f : m_outputLayerBiases)
f = dist(e2);

for (float& f : m_hiddenLayerWeights)
f = dist(e2);

for (float& f : m_outputLayerWeights)
f = dist(e2);
}

void Train (const CMNISTData& trainingData, size_t miniBatchSize, float learningRate)
{
// shuffle the order of the training data for our mini batches
if (m_trainingOrder.size() != trainingData.NumImages())
{
m_trainingOrder.resize(trainingData.NumImages());
size_t index = 0;
for (size_t& v : m_trainingOrder)
{
v = index;
++index;
}
}
static std::random_device rd;
static std::mt19937 e2(rd());
std::shuffle(m_trainingOrder.begin(), m_trainingOrder.end(), e2);

// process all minibatches until we are out of training examples
size_t trainingIndex = 0;
while (trainingIndex < trainingData.NumImages())
{
// Clear out minibatch derivatives.  We sum them up and then divide at the end of the minimatch
std::fill(m_miniBatchHiddenLayerBiasesDeltaCost.begin(), m_miniBatchHiddenLayerBiasesDeltaCost.end(), 0.0f);
std::fill(m_miniBatchOutputLayerBiasesDeltaCost.begin(), m_miniBatchOutputLayerBiasesDeltaCost.end(), 0.0f);
std::fill(m_miniBatchHiddenLayerWeightsDeltaCost.begin(), m_miniBatchHiddenLayerWeightsDeltaCost.end(), 0.0f);
std::fill(m_miniBatchOutputLayerWeightsDeltaCost.begin(), m_miniBatchOutputLayerWeightsDeltaCost.end(), 0.0f);

// process the minibatch
size_t miniBatchIndex = 0;
while (miniBatchIndex < miniBatchSize && trainingIndex < trainingData.NumImages())
{
// get the training item
uint8 imageLabel = 0;
const float* pixels = trainingData.GetImage(m_trainingOrder[trainingIndex], imageLabel);

// run the forward pass of the network
uint8 labelDetected = ForwardPass(pixels, imageLabel);

// run the backward pass to get derivatives of the cost function
BackwardPass(pixels, imageLabel);

// add the current derivatives into the minibatch derivative arrays so we can average them at the end of the minibatch via division.
for (size_t i = 0; i < m_hiddenLayerBiasesDeltaCost.size(); ++i)
m_miniBatchHiddenLayerBiasesDeltaCost[i] += m_hiddenLayerBiasesDeltaCost[i];
for (size_t i = 0; i < m_outputLayerBiasesDeltaCost.size(); ++i)
m_miniBatchOutputLayerBiasesDeltaCost[i] += m_outputLayerBiasesDeltaCost[i];
for (size_t i = 0; i < m_hiddenLayerWeightsDeltaCost.size(); ++i)
m_miniBatchHiddenLayerWeightsDeltaCost[i] += m_hiddenLayerWeightsDeltaCost[i];
for (size_t i = 0; i < m_outputLayerWeightsDeltaCost.size(); ++i)
m_miniBatchOutputLayerWeightsDeltaCost[i] += m_outputLayerWeightsDeltaCost[i];

// note that we've added another item to the minibatch, and that we've consumed another training example
++trainingIndex;
++miniBatchIndex;
}

// divide minibatch derivatives by how many items were in the minibatch, to get the average of the derivatives.
// NOTE: instead of doing this explicitly like in the commented code below, we'll do it implicitly
// by dividing the learning rate by miniBatchIndex.
/*
for (float& f : m_miniBatchHiddenLayerBiasesDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchOutputLayerBiasesDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchHiddenLayerWeightsDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchOutputLayerWeightsDeltaCost)
f /= float(miniBatchIndex);
*/

float miniBatchLearningRate = learningRate / float(miniBatchIndex);

// apply training to biases and weights
for (size_t i = 0; i < m_hiddenLayerBiases.size(); ++i)
m_hiddenLayerBiases[i] -= m_miniBatchHiddenLayerBiasesDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_outputLayerBiases.size(); ++i)
m_outputLayerBiases[i] -= m_miniBatchOutputLayerBiasesDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_hiddenLayerWeights.size(); ++i)
m_hiddenLayerWeights[i] -= m_miniBatchHiddenLayerWeightsDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_outputLayerWeights.size(); ++i)
m_outputLayerWeights[i] -= m_miniBatchOutputLayerWeightsDeltaCost[i] * miniBatchLearningRate;
}
}

// This function evaluates the network for the given input pixels and returns the label it thinks it is from 0-9
uint8 ForwardPass (const float* pixels, uint8 correctLabel)
{
// first do hidden layer
for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
{
float Z = m_hiddenLayerBiases[neuronIndex];

for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
Z += pixels[inputIndex] * m_hiddenLayerWeights[HiddenLayerWeightIndex(inputIndex, neuronIndex)];

m_hiddenLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
}

// then do output layer
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
float Z = m_outputLayerBiases[neuronIndex];

for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
Z += m_hiddenLayerOutputs[inputIndex] * m_outputLayerWeights[OutputLayerWeightIndex(inputIndex, neuronIndex)];

m_outputLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
}

// calculate error.
// this is the magnitude of the vector that is Desired - Actual.
// Commenting out because it's not needed.
/*
{
error = 0.0f;
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;
float diff = (desiredOutput - m_outputLayerOutputs[neuronIndex]);
error += diff * diff;
}
error = std::sqrt(error);
}
*/

// find the maximum value of the output layer and return that index as the label
float maxOutput = m_outputLayerOutputs[0];
uint8 maxLabel = 0;
for (uint8 neuronIndex = 1; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
if (m_outputLayerOutputs[neuronIndex] > maxOutput)
{
maxOutput = m_outputLayerOutputs[neuronIndex];
maxLabel = neuronIndex;
}
}
return maxLabel;
}

// Functions to get weights/bias values. Used to make the JSON file.
const std::array<float, HIDDEN_NEURONS>& GetHiddenLayerBiases () const { return m_hiddenLayerBiases; }
const std::array<float, OUTPUT_NEURONS>& GetOutputLayerBiases () const { return m_outputLayerBiases; }
const std::array<float, INPUTS * HIDDEN_NEURONS>& GetHiddenLayerWeights () const { return m_hiddenLayerWeights; }
const std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>& GetOutputLayerWeights () const { return m_outputLayerWeights; }

private:

static size_t HiddenLayerWeightIndex (size_t inputIndex, size_t hiddenLayerNeuronIndex)
{
return hiddenLayerNeuronIndex * INPUTS + inputIndex;
}

static size_t OutputLayerWeightIndex (size_t hiddenLayerNeuronIndex, size_t outputLayerNeuronIndex)
{
return outputLayerNeuronIndex * HIDDEN_NEURONS + hiddenLayerNeuronIndex;
}

// this function uses the neuron output values from the forward pass to backpropagate the error
// of the network to calculate the gradient needed for training.  It figures out what the error
// is by comparing the label it came up with to the label it should have come up with (correctLabel).
void BackwardPass (const float* pixels, uint8 correctLabel)
{
// since we are going backwards, do the output layer first
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
// calculate deltaCost/deltaBias for each output neuron.
// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
//
// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
//
// deltaCost/deltaO = O - desiredOutput
// deltaO/deltaZ = O * (1 - O)
//
float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;

float deltaCost_deltaO = m_outputLayerOutputs[neuronIndex] - desiredOutput;
float deltaO_deltaZ = m_outputLayerOutputs[neuronIndex] * (1.0f - m_outputLayerOutputs[neuronIndex]);

m_outputLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

// calculate deltaCost/deltaWeight for each weight going into the neuron
//
// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
// deltaCost/deltaWeight = deltaCost/deltaBias * input
//
for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
m_outputLayerWeightsDeltaCost[OutputLayerWeightIndex(inputIndex, neuronIndex)] = m_outputLayerBiasesDeltaCost[neuronIndex] * m_hiddenLayerOutputs[inputIndex];
}

// then do the hidden layer
for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
{
// calculate deltaCost/deltaBias for each hidden neuron.
// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
//
// deltaCost/deltaO =
//   Sum for each output of this neuron:
//
// deltaTargetZ/deltaSourceO is the value of the weight connecting the source and target neuron.
//
// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
// deltaO/deltaZ = O * (1 - O)
//
float deltaCost_deltaO = 0.0f;
for (size_t destinationNeuronIndex = 0; destinationNeuronIndex < OUTPUT_NEURONS; ++destinationNeuronIndex)
deltaCost_deltaO += m_outputLayerBiasesDeltaCost[destinationNeuronIndex] * m_outputLayerWeights[OutputLayerWeightIndex(neuronIndex, destinationNeuronIndex)];
float deltaO_deltaZ = m_hiddenLayerOutputs[neuronIndex] * (1.0f - m_hiddenLayerOutputs[neuronIndex]);
m_hiddenLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

// calculate deltaCost/deltaWeight for each weight going into the neuron
//
// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
// deltaCost/deltaWeight = deltaCost/deltaBias * input
//
for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
m_hiddenLayerWeightsDeltaCost[HiddenLayerWeightIndex(inputIndex, neuronIndex)] = m_hiddenLayerBiasesDeltaCost[neuronIndex] * pixels[inputIndex];
}
}

private:

// biases and weights
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiases;
std::array<float, OUTPUT_NEURONS>					m_outputLayerBiases;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeights;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeights;

// neuron activation values aka "O" values
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerOutputs;
std::array<float, OUTPUT_NEURONS>					m_outputLayerOutputs;

// derivatives of biases and weights for a single training example
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiasesDeltaCost;
std::array<float, OUTPUT_NEURONS>					m_outputLayerBiasesDeltaCost;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeightsDeltaCost;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeightsDeltaCost;

// derivatives of biases and weights for the minibatch. Average of all items in minibatch.
std::array<float, HIDDEN_NEURONS>					m_miniBatchHiddenLayerBiasesDeltaCost;
std::array<float, OUTPUT_NEURONS>					m_miniBatchOutputLayerBiasesDeltaCost;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_miniBatchHiddenLayerWeightsDeltaCost;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_miniBatchOutputLayerWeightsDeltaCost;

// used for minibatch generation
std::vector<size_t>									m_trainingOrder;
};

// ============================================================================================
//                                   DRIVER PROGRAM
// ============================================================================================

// training and test data
CMNISTData g_trainingData;
CMNISTData g_testData;

// neural network
CNeuralNetwork<c_numInputNeurons, c_numHiddenNeurons, c_numOutputNeurons> g_neuralNetwork;

float GetDataAccuracy (const CMNISTData& data)
{
size_t correctItems = 0;
for (size_t i = 0, c = data.NumImages(); i < c; ++i)
{
uint8 label;
const float* pixels = data.GetImage(i, label);
uint8 detectedLabel = g_neuralNetwork.ForwardPass(pixels, label);

if (detectedLabel == label)
++correctItems;
}
return float(correctItems) / float(data.NumImages());
}

void ShowImage (const CMNISTData& data, size_t imageIndex)
{
uint8 label = 0;
const float* pixels = data.GetImage(imageIndex, label);
printf("showing a %i\n", label);
for (int iy = 0; iy < 28; ++iy)
{
for (int ix = 0; ix < 28; ++ix)
{
if (*pixels < 0.125)
printf(" ");
else
printf("+");
++pixels;
}
printf("\n");
}
}

int main (int argc, char** argv)
{
// load the MNIST data if we can
{
printf("Could not load mnist data, aborting!\n");
system("pause");
return 1;
}

#if REPORT_ERROR_WHILE_TRAINING()
FILE *file = fopen("Error.csv","w+t");
if (!file)
{
printf("Could not open Error.csv for writing, aborting!\n");
system("pause");
return 2;
}
fprintf(file, "\"Training Data Accuracy\",\"Testing Data Accuracy\"\n");
#endif

{
SBlockTimer timer("Training Time:  ");

// train the network, reporting error before each training
for (size_t epoch = 0; epoch < c_trainingEpochs; ++epoch)
{
#if REPORT_ERROR_WHILE_TRAINING()
float accuracyTraining = GetDataAccuracy(g_trainingData);
float accuracyTest = GetDataAccuracy(g_testData);
printf("Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
printf("Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);
fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
#endif

printf("Training epoch %zu / %zu...\n", epoch+1, c_trainingEpochs);
g_neuralNetwork.Train(g_trainingData, c_miniBatchSize, c_learningRate);
printf("\n");
}
}

// report final error
float accuracyTraining = GetDataAccuracy(g_trainingData);
float accuracyTest = GetDataAccuracy(g_testData);
printf("\nFinal Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
printf("Final Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);

#if REPORT_ERROR_WHILE_TRAINING()
fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
fclose(file);
#endif

// Write out the final weights and biases as JSON for use in the web demo
{
FILE* file = fopen("WeightsBiasesJSON.txt", "w+t");
fprintf(file, "{\n");

// network structure
fprintf(file, "  \"InputNeurons\":%zu,\n", c_numInputNeurons);
fprintf(file, "  \"HiddenNeurons\":%zu,\n", c_numHiddenNeurons);
fprintf(file, "  \"OutputNeurons\":%zu,\n", c_numOutputNeurons);

// HiddenBiases
auto hiddenBiases = g_neuralNetwork.GetHiddenLayerBiases();
fprintf(file, "  \"HiddenBiases\" : [\n");
for (size_t i = 0; i < hiddenBiases.size(); ++i)
{
fprintf(file, "    %f", hiddenBiases[i]);
if (i < hiddenBiases.size() -1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ],\n");

// HiddenWeights
auto hiddenWeights = g_neuralNetwork.GetHiddenLayerWeights();
fprintf(file, "  \"HiddenWeights\" : [\n");
for (size_t i = 0; i < hiddenWeights.size(); ++i)
{
fprintf(file, "    %f", hiddenWeights[i]);
if (i < hiddenWeights.size() - 1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ],\n");

// OutputBiases
auto outputBiases = g_neuralNetwork.GetOutputLayerBiases();
fprintf(file, "  \"OutputBiases\" : [\n");
for (size_t i = 0; i < outputBiases.size(); ++i)
{
fprintf(file, "    %f", outputBiases[i]);
if (i < outputBiases.size() - 1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ],\n");

// OutputWeights
auto outputWeights = g_neuralNetwork.GetOutputLayerWeights();
fprintf(file, "  \"OutputWeights\" : [\n");
for (size_t i = 0; i < outputWeights.size(); ++i)
{
fprintf(file, "    %f", outputWeights[i]);
if (i < outputWeights.size() - 1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ]\n");

// done
fprintf(file, "}\n");
fclose(file);
}

// You can use the code like the below to visualize an image if you want to.
//ShowImage(g_testData, 0);

system("pause");
return 0;
}


Thanks for reading, and if you have any questions, comments, or just want to chat, hit me up in the comments below, or on twitter at @Atrix256.

## Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences

In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences.

By special request, this is that post!

If you want an explanation of dual numbers, check out these posts:

If you want an explanation of finite differences, check out this post:
Finite Differences

Since the fundamentals are explained in the links above, we’ll go straight to the code.

We’ll be getting the gradient (learning values) for the network in example 4 in the backpropagation post:

Note that I am using “central differences” for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I’m using an epsilon of 0.001.

I didn’t compare the running times of each method as my code is meant to be readable, not fast, and the code isn’t doing enough work to make a meaningful performance test IMO. If you did do a speed test, the finite differences should be by far the slowest, backpropagation should be the fastest, and dual numbers are probably going to be closer to backpropagation than to finite differences.

The output of the program is below. Both backpropagation and dual numbers get the exact derivatives (within the tolerances of floating point math of course!) because they use the chain rule, whereas finite differences is a numerical approximation. This shows up in the fact that backpropagation and dual numbers agree for all values, where finite differences has some small error in the derivatives.

And here is the code:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>

#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

//----------------------------------------------------------------------
// Dual Number Class - CDualNumber
//----------------------------------------------------------------------

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

// constructor to make a constant
CDualNumber (float f = 0.0f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
CDualNumber (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// Set a constant value.
void Set (float f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// Set a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
void Set (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// storage for real and dual values
float                           m_real;
std::array<float, NUMVARIABLES> m_dual;
};

//----------------------------------------------------------------------
// Dual Number Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real + b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real - b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real * b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real / b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
return ret;
}

// NOTE: the "special functions" below all just use the chain rule, which you can also use to add more functions

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
float sqrtReal = sqrt(a.m_real);
ret.m_real = sqrtReal;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = pow(a.m_real, y);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> exp (const CDualNumber<NUMVARIABLES>& a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = exp(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * exp(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = sin(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = cos(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Driver Program
//----------------------------------------------------------------------

enum EWeightsBiases
{
e_weight0 = 0,
e_weight1,
e_weight2,
e_weight3,
e_weight4,
e_weight5,
e_weight6,
e_weight7,
e_bias0,
e_bias1,
e_bias2,
e_bias3,

e_count
};

// our dual number needs a "dual" for every value we want a derivative for: aka every weight and bias
typedef CDualNumber<EWeightsBiases::e_count> TDualNumber;

// templated so it can work for both the dual numbers, as well as the float finite differences
template <typename TBaseType>
void ForwardPass (const std::array<TBaseType, 2>& input, const std::array<TBaseType, 2>& desiredOutput, const std::array<TBaseType, EWeightsBiases::e_count>& weightsBiases, TBaseType& cost)
{
// calculate hidden layer neuron activations
TBaseType Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
TBaseType O0 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z0 * TBaseType(-1.0f)));

TBaseType Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
TBaseType O1 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z1 * TBaseType(-1.0f)));

// calculate output layer neuron activations
TBaseType Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
TBaseType O2 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z2 * TBaseType(-1.0f)));

TBaseType Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
TBaseType O3 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z3 * TBaseType(-1.0f)));

// calculate the cost: 0.5 * ||target-actual||^2
// aka cost = half (error squared)
TBaseType diff1 = TBaseType(desiredOutput[0]) - O2;
TBaseType diff2 = TBaseType(desiredOutput[1]) - O3;
cost = TBaseType(0.5f) * (diff1*diff1 + diff2*diff2);
}

// backpropagation
void ForwardPassAndBackpropagation (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, EWeightsBiases::e_count>& weightsBiases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, EWeightsBiases::e_count>& deltaWeightsBiases
) {
// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// calculate Z2 and O2 for neuron2
float Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
float O2 = 1.0f / (1.0f + std::exp(-Z2));

// calculate Z3 and O3 for neuron3
float Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
float O3 = 1.0f / (1.0f + std::exp(-Z3));

// the actual output of the network is the activation of the output layer neurons
actualOutput[0] = O2;
actualOutput[1] = O3;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 2 -----

// calculate how much a change in neuron 2 activation affects the cost function
// deltaCost/deltaO2 = O2 - target0
float deltaCost_deltaO2 = O2 - desiredOutput[0];

// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
// deltaO2/deltaZ2 = O2 * (1 - O2)
float deltaO2_deltaZ2 = O2 * (1 - O2);

// calculate how much a change in neuron 2 weighted input affects the cost function.
// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
deltaWeightsBiases[e_bias2] = neuron2Error;

// calculate how much a change in weight4 affects the cost function.
// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * O0
// similar thing for weight5
deltaWeightsBiases[e_weight4] = neuron2Error * O0;
deltaWeightsBiases[e_weight5] = neuron2Error * O1;

//----- Neuron 3 -----

// calculate how much a change in neuron 3 activation affects the cost function
// deltaCost/deltaO3 = O3 - target1
float deltaCost_deltaO3 = O3 - desiredOutput[1];

// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
// deltaO3/deltaZ3 = O3 * (1 - O3)
float deltaO3_deltaZ3 = O3 * (1 - O3);

// calculate how much a change in neuron 3 weighted input affects the cost function.
// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
deltaWeightsBiases[e_bias3] = neuron3Error;

// calculate how much a change in weight6 affects the cost function.
// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * O0
// similar thing for weight7
deltaWeightsBiases[e_weight6] = neuron3Error * O0;
deltaWeightsBiases[e_weight7] = neuron3Error * O1;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
float deltaCost_deltaO0 = neuron2Error * weightsBiases[e_weight4] + neuron3Error * weightsBiases[e_weight6];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaWeightsBiases[e_bias0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaWeightsBiases[e_weight0] = neuron0Error * input[0];
deltaWeightsBiases[e_weight1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
float deltaCost_deltaO1 = neuron2Error * weightsBiases[e_weight5] + neuron3Error * weightsBiases[e_weight7];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO1/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaWeightsBiases[e_bias1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaWeightsBiases[e_weight2] = neuron1Error * input[0];
deltaWeightsBiases[e_weight3] = neuron1Error * input[1];
}

int main (int argc, char **argv)
{

// weights and biases, inputs and desired output
const std::array<float, EWeightsBiases::e_count> weightsBiases =
{
0.15f, // e_weight0
0.2f,  // e_weight1
0.25f, // e_weight2
0.3f,  // e_weight3
0.4f,  // e_weight4
0.45f, // e_weight5
0.5f,  // e_weight6
0.55f, // e_weight7
0.35f, // e_bias0
0.35f, // e_bias1
0.6f,  // e_bias2
0.6f   // e_bias3
};

const std::array<float, 2> inputs =
{
0.05f,
0.1f
};

std::array<float, 2> desiredOutput = {
0.01f,
0.99f
};

// =====================================================
// ===== FINITE DIFFERENCES CALCULATED DERIVATIVES =====
// =====================================================

std::array<float, EWeightsBiases::e_count> weightsBiasesFloat;
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
weightsBiasesFloat[i] = weightsBiases[i];

// use central differences to approximate the gradient
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
{
float costSample1 = 0.0f;
weightsBiasesFloat[i] = weightsBiases[i] - EPSILON;
ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample1);

float costSample2 = 0.0f;
weightsBiasesFloat[i] = weightsBiases[i] + EPSILON;
ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample2);

gradientFiniteDifferences[i] = (costSample2 - costSample1) / (EPSILON * 2.0f);

weightsBiasesFloat[i] = weightsBiases[i];
}

// ==============================================
// ===== DUAL NUMBER CALCULATED DERIVATIVES =====
// ==============================================

// dual number weights and biases
std::array<TDualNumber, EWeightsBiases::e_count> weightsBiasesDual;
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
weightsBiasesDual[i].Set(weightsBiases[i], i);

// dual number inputs and desired output
std::array<TDualNumber, 2> inputsDual;
for (size_t i = 0; i < inputsDual.size(); ++i)
inputsDual[i].Set(inputs[i]);

std::array<TDualNumber, 2> desiredOutputDual;
for (size_t i = 0; i < desiredOutputDual.size(); ++i)
desiredOutputDual[i].Set(desiredOutput[i]);

// dual number derivatives

// ==================================================
// ===== BACKPROPAGATION CALCULATED DERIVATIVES =====
// ==================================================

float error;
float cost;
std::array<float, 2> actualOutput;
ForwardPassAndBackpropagation(inputs, desiredOutput, weightsBiases, error, cost, actualOutput, gradientBackPropagation);

// ==========================
// ===== Report Results =====
// ==========================

printf("Neural Network Gradient\n\nBackpropagation     Dual Numbers (Error)       Finite Differences (Error)\n");
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
{
printf("% 08f,         % 08f (% 08f),     % 08f (% 08f)\n",
);
}
printf("\n");

system("pause");
return 0;
}


## How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

1. Explanation
2. Working through examples
3. Simple sample C++ source code using only standard includes
4. Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

# Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);


To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

# Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

$Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)$

Plugging in the specific values for the weights and biases, it looks like this:

$Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)$

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

$Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)$

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

$\sigma(x) = \frac{1}{1+e^{-x}}$

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

$Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295$

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula $y=mx+b$ and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is $\frac{dy}{dx}$, which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by $\frac{dy}{dx}$ to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: $\frac{\partial y}{\partial x}$.

The reason partial derivatives even exist is because if you have a function of multiple variables like $z=f(x,y)=x^2+3y+2$, you have two variables that you can take the derivative of. You can calculate $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$. The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

$\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\$

If you put both of those values together into a vector $(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

1. Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
2. Multiply the gradient by a small “learning rate” value, such as 0.05
3. Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

$Cost = ||target-output||^2$

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

# More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like $\frac{\partial Cost}{\partial Weight0}$, you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

# Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

$dy/dx$

That you can also write it like this:

$dy/du*du/dx$

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

$3/2 = 1.5$

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
$3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5$

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

1. dCost/dOutput: The derivative of how a neuron’s output affects cost
2. dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
3. dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

# Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Let’s work through the training process manually.

We put a 1 as input and calculate Z which is the “weighted input” of the neuron.

$Z = input*weight+bias \\ Z = 1*0.3+0.5 \\ Z = 0.8$

Next, we need to calculate O which is the neuron output. We calculate this by putting the weighted input through the sigmoid activation function.

$O = \sigma(0.8) \\ O = 0.6900$

The last step of the “forward pass” is to calculate the cost. We are going to use half squared error as our cost, to slightly simplify some math.

$Cost = 0.5*||target - actual||^2\\ Cost = 0.5*||0-0.6900||^2\\ Cost = 0.5*0.4761\\ Cost = 0.238$

Now the backpropagation pass begins, which will give us the partial derivatives we need to be able to adjust this network and do one iteration of training.

First we want to calculate dCost/dO which tells us how much the cost changes when the neuron output changes. Thanks to the nice form of our cost function, that value is simple to calculate:

$\frac{\partial Cost}{\partial O} = O - target\\ \frac{\partial Cost}{\partial O} = 0.69 - 0\\ \frac{\partial Cost}{\partial O} = 0.69$

Next we want to calculate dO/dZ which tells us how much the neuron output changes when the neuron’s weighted input changes. The activation function we used also makes this very easy to calculate:

$\frac{\partial O}{\partial Z} = O * (1-O)\\ \frac{\partial O}{\partial Z} = 0.69 * 0.31\\ \frac{\partial O}{\partial Z} = 0.2139$

Next we are going to multiply these two values together to get dCost/dZ which tells us how much the cost changes when the weighted input to the neuron changes.

$\frac{\partial Cost}{\partial Z} = \frac{\partial Cost}{\partial O} * \frac{\partial O}{\partial Z}\\ \frac{\partial Cost}{\partial Z} = 0.69 * 0.2139 \\ \frac{\partial Cost}{\partial Z} = 0.1476$

This value has two special meanings. Firstly, this value represents the amount of error in this neuron. Secondly, this value represents dCost/dBias, which is one of the values we need to do training!

$\frac{\partial Cost}{\partial bias} = \frac{\partial Cost}{\partial Z}\\ \frac{\partial Cost}{\partial bias} = 0.1476\\$

Next, we need dZ/dWeight, which tells us how much the weighted input of the neuron changes when you change the weight. This is just the input value, which is 1. This makes intuitive sense because if you added 1.0 to the weight, Z would grow by whatever the input is.

Now that we have dCost/dZ and dZ/dWeight, we can calculate dCost/dWeight:

$\frac{\partial Cost}{\partial Weight} = \frac{\partial Cost}{\partial Z} * \frac{\partial Z}{\partial Weight}\\ \frac{\partial Cost}{\partial Weight} = 0.1476 * 1\\ \frac{\partial Cost}{\partial Weight} = 0.1476$

We now have the partial derivatives we need to be able to train our network!

Using a learning rate of 0.5, we’ll first update our weight:

$Weight = Weight - \frac{\partial Cost}{\partial Weight} * 0.5\\ Weight = 0.3 - 0.1476 * 0.5\\ Weight = 0.2262$

Then we’ll update our bias:

$Bias = Bias - \frac{\partial Cost}{\partial Bias } * 0.5\\ Bias = 0.5 - 0.1476 * 0.5\\ Bias = 0.4262$

Our network has learned a very small amount!

To verify this, let’s calculate the network’s output and cost with these new values.

$Z = input*weight+bias \\ Z = 1*0.2262+0.4262 \\ Z = 0.6524\\ \\ O = \sigma(0.6524) \\ O = 0.6576\\ \\ Cost = 0.5*||target - actual||^2\\ Cost = 0.5*||0-0.6576||^2\\ Cost = 0.5*0.4324\\ Cost = 0.2162$

Our cost decreased from 0.238 to 0.2162 due to our training, so we have indeed improved the error of the network, hooray!

After 10,000 iterations of this, the cost drops down to 0.000026 (weight = -2.564909, bias = -2.364907). That sounds pretty good, and it is decent, but since that is based on error squared, it looks more accurate than it is. The error at that point is 0.007176. Specifically, that means that when we input a 1, it outputs a 0.007176 instead of zero.

With a larger number of trainings it improves though. At 100,000 iterations it gives 0.002246 as output, and at 1,000,000 iterations it gives 0.000728 as output.

You could also try adjusting the learning rate parameter to see if you can make it get to a higher accuracy more quickly, and then perhaps dropping it down to a smaller number once you got there, to get the higher accuracy. Again, neural networks are an art and there are all sorts of techniques you can use to attempt to make them better (more accurate, learn faster, etc).

# Backpropagation Example 2: Single Neuron, Two Training Examples

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
$\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476$

If we follow the same steps with the “input 0, output 1” training example we get these:
$\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887$

To get the actual derivatives to train the network with, we just average them!
$\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294$

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

# Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225$

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167$

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

$\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360$

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

$\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463$

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463$

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

# Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729$

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984$

Now we start the backward pass to calculate the derivatives for training.

## Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

$\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385$

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

$\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\$

## Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

$\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381$

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

$\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\$

## Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

$\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364$

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

$\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\$

## Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

$\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01$

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\$

## Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

# Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
float input, float desiredOutput,
float weight, float bias,
float& error, float& cost, float& actualOutput,
float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
float Z = input * weight + bias;
float O = 1.0f / (1.0f + std::exp(-Z));

// the actual output of the network is the activation of the neuron
actualOutput = O;

// calculate error
error = std::abs(desiredOutput - actualOutput);

// calculate cost
cost = 0.5f * error * error;

// calculate how much a change in neuron activation affects the cost function
// deltaCost/deltaO = O - target
float deltaCost_deltaO = O - desiredOutput;

// calculate how much a change in neuron weighted input affects neuron activation
// deltaO/deltaZ = O * (1 - O)
float deltaO_deltaZ = O * (1 - O);

// calculate how much a change in a neuron's weighted input affects the cost function.
// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
float neuronError = deltaCost_deltaO * deltaO_deltaZ;
deltaCost_deltaBias = neuronError;

// calculate how much a change in the weight affects the cost function.
// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * input
deltaCost_deltaWeight = neuronError * input;

// As a bonus, calculate how much a change in the input affects the cost function.
// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
// deltaCost/deltaInput = neuronError * weight
deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example1.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\",\"weight\",\"bias\",\"dCost/dWeight\",\"dCost/dBias\",\"dCost/dInput\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 10000;

// training data
// input: 1, output: 0
const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float error = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
// run the network to get error and derivatives
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
}
#endif

weight -= deltaCost_deltaWeight * c_learningRate;
bias -= deltaCost_deltaBias * c_learningRate;
}

printf("Example1 Final Error: %f\n", error);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example2.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\",\"weight\",\"bias\",\"dCost/dWeight\",\"dCost/dBias\",\"dCost/dInput\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 100000;

// training data
// input: 1, output: 0
// input: 0, output: 1
const std::array<std::array<float, 2>, 2> c_trainingData = { {
{1.0f, 0.0f},
{0.0f, 1.0f}
} };

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
avgError = 0.0f;
float avgOutput = 0.0f;
float avgCost = 0.0f;
float avgDeltaCost_deltaWeight = 0.0f;
float avgDeltaCost_deltaBias = 0.0f;
float avgDeltaCost_deltaInput = 0.0f;

// run the network to get error and derivatives for each training example
for (const std::array<float, 2>& trainingData : c_trainingData)
{
float error = 0.0f;
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

avgError += error;
avgOutput += output;
avgCost += cost;
avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
avgDeltaCost_deltaBias += deltaCost_deltaBias;
avgDeltaCost_deltaInput += deltaCost_deltaInput;
}

avgError /= (float)c_trainingData.size();
avgOutput /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
avgDeltaCost_deltaBias /= (float)c_trainingData.size();
avgDeltaCost_deltaInput /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
}
#endif

weight -= avgDeltaCost_deltaWeight * c_learningRate;
bias -= avgDeltaCost_deltaBias * c_learningRate;
}

printf("Example2 Final Error: %f\n", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
std::array<float, 2> m_input;
std::array<float, 2> m_output;
};

void Example3RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 4>& weights, const std::array<float, 2>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// the actual output of the network is the activation of the neurons
actualOutput[0] = O0;
actualOutput[1] = O1;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = O0 - target0
float deltaCost_deltaO0 = O0 - desiredOutput[0];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = O1 - target1
float deltaCost_deltaO1 = O1 - desiredOutput[1];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO0/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example3.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 520000;

// training data: OR/AND
// input: 00, output: 00
// input: 01, output: 10
// input: 10, output: 10
// input: 11, output: 11
const std::array<SExample3Training, 4> c_trainingData = { {
{{0.0f, 0.0f}, {0.0f, 0.0f}},
{{0.0f, 1.0f}, {1.0f, 0.0f}},
{{1.0f, 0.0f}, {1.0f, 0.0f}},
{{1.0f, 1.0f}, {1.0f, 1.0f}},
} };

// starting weight and bias values
std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
std::array<float, 2> biases = { 0.5f, 0.1f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
//float avgCost = 0.0f;
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\"\n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example3 Final Error: %f\n", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 8>& weights, const std::array<float, 4>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// calculate Z2 and O2 for neuron2
float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
float O2 = 1.0f / (1.0f + std::exp(-Z2));

// calculate Z3 and O3 for neuron3
float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
float O3 = 1.0f / (1.0f + std::exp(-Z3));

// the actual output of the network is the activation of the output layer neurons
actualOutput[0] = O2;
actualOutput[1] = O3;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 2 -----

// calculate how much a change in neuron 2 activation affects the cost function
// deltaCost/deltaO2 = O2 - target0
float deltaCost_deltaO2 = O2 - desiredOutput[0];

// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
// deltaO2/deltaZ2 = O2 * (1 - O2)
float deltaO2_deltaZ2 = O2 * (1 - O2);

// calculate how much a change in neuron 2 weighted input affects the cost function.
// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
deltaCost_deltaBiases[2] = neuron2Error;

// calculate how much a change in weight4 affects the cost function.
// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * O0
// similar thing for weight5
deltaCost_deltaWeights[4] = neuron2Error * O0;
deltaCost_deltaWeights[5] = neuron2Error * O1;

//----- Neuron 3 -----

// calculate how much a change in neuron 3 activation affects the cost function
// deltaCost/deltaO3 = O3 - target1
float deltaCost_deltaO3 = O3 - desiredOutput[1];

// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
// deltaO3/deltaZ3 = O3 * (1 - O3)
float deltaO3_deltaZ3 = O3 * (1 - O3);

// calculate how much a change in neuron 3 weighted input affects the cost function.
// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
deltaCost_deltaBiases[3] = neuron3Error;

// calculate how much a change in weight6 affects the cost function.
// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * O0
// similar thing for weight7
deltaCost_deltaWeights[6] = neuron3Error * O0;
deltaCost_deltaWeights[7] = neuron3Error * O1;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO1/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example4.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 5000;

// training data: 0.05, 0.1 in = 0.01, 0.99 out
const std::array<SExample3Training, 1> c_trainingData = { {
{{0.05f, 0.1f}, {0.01f, 0.99f}},
} };

// starting weight and bias values
std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\"\n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example4 Final Error: %f\n", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

int main (int argc, char **argv)
{
Example1();
Example2();
Example3();
Example4();
system("pause");
return 0;
}


The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

1. finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
2. Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

## Multivariable Dual Numbers & Automatic Differentiation

In a previous post I showed how to use dual numbers to be able to get both the value and derivative of a function at the same time:
Dual Numbers & Automatic Differentiation

That post mentions that you can extend it to multivariable functions but doesn’t explain how. This post is that explanation, including simple working C++ code!

Extending this to multivariable functions is useful for ray marching, calculating analytical surface normals and also likely useful for training a neural network if you want an alternative to back propagation. I’m not sure about the efficiency comparison of this versus back propagation but I intend on looking into it (:

# How Does it Work?

It turns out to be really simple to use dual numbers with multivariable functions. The end result is that you want a partial derivative for each variable in the equation, so to do that, you just have a dual number per variable, and process the entire equation for each of those dual numbers separately.

We’ll work through an example. Let’s find the partial derivatives of x and y of the function $3x^2-2y^3$, at input (5,2).

We’ll start by finding the derivative of x, and then the derivative of y.

# Example: df/dx

We start by making a dual number for our x value, remembering that the real part is the actual value for x, and the dual part is the derivative of x, which is 1:

$5+1\epsilon$

or:

$5+\epsilon$

We multiply that value by itself to get the $x^2$ value, keeping in mind that $\epsilon^2$ is zero:
$(5+\epsilon)*(5+\epsilon)= \\ 25+10\epsilon+\epsilon^2= \\ 25+10\epsilon \\$

Next we need to multiply that by 3 to get the $3x^2$ term:

$3*(25+10\epsilon) = 75+30\epsilon$

Putting that aside for a moment, we need to make the $2y^3$ term. We start by making our y value:

$2+0\epsilon$

or:

$2$

If you are wondering why it has a zero for the epsilon term, it’s because when we are calculating the partial derivative of x, y is a constant, so has a derivative of zero.

Next, we multiply this y value by itself twice to get the $y^3$ value:

$2*2*2=8$

We then multiply it by 2 to get the $2y^3$ term:

$8*2=16$

Now that we have our two terms, we subtract the y term from the x term to get our final result:

$75+30\epsilon-16 = \\ 59+30\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30.

That checks out, let’s move on to the derivative of y!

# Example: df/dy

Calculating the derivative of y is very similar to calculating the derivative of x, except that it’s the x term that has an epsilon value (derivative) of 0, instead of the y term. The y term has the epsilon value of 1 this time as well. We’ll work through it to see how it plays out.

First up, we need to make the value for x:

$5+0\epsilon$

or:

$5$

Next we square it and multiply it by 3 to get the $3x^2$ term:

$5*5*3=75$

Next we need to make the value for y, remembering that we use an epsilon value of 1 since the derivative of y is 1 this time around.

$2+\epsilon$

We cube that value and multiply by 2 to get the $2y^3$ term:
$2*(2+\epsilon)*(2+\epsilon)*(2+\epsilon)= \\ 2*(2+\epsilon)*(4+4\epsilon+\epsilon^2)= \\ 2*(2+\epsilon)*(4+4\epsilon)= \\ 2*(8+12\epsilon+4\epsilon^2)= \\ 2*(8+12\epsilon)= \\ 16+24\epsilon$

Now we subtract the y term from the x term to get the final result:

$75 - (16+24\epsilon)= \\ 59-24\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of y at that point is -24.

That also checks out, so we got the correct value and partial derivatives for the equation.

# Reducing Redundancy

There was quite a bit of redundancy when working through the x and y derivatives wasn’t there? Increasing the number of variables will increase the amount of redundancy too, so it doesn’t scale up well.

Luckily there is a way to address this. Basically, instead of making two dual numbers which have two items, you make them share the real value (since it’s the same for both, as is the work to make it) and append the dual values for x and y to it.

$x'=(a+b\epsilon) \\ y'=(a+b\epsilon)$

You have:

$(a+b\epsilon_x+c\epsilon_y)$

Then, in your math or in your program, you treat it as if it’s two different dual numbers packed into one. This lets you do the work for the real number once instead of twice, but still lets you do your dual number work for each variable independently.

While it’s probably easiest to think of these as two dual numbers packed into one value, there is actually a mathematical basis for it as well, which may or may not surprise you.

Check out what happens when we multiply two of these together, keeping in mind that multiplying ANY two epsilon values together becomes zero, even if they are different epsilons:

$(a+b\epsilon_x+c\epsilon_y) * (d+e\epsilon_x+f\epsilon_y)= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + be\epsilon_x^2 + bf\epsilon_x\epsilon_y + cd\epsilon_y + ce\epsilon_x\epsilon_y + cf\epsilon_y^2= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + cd\epsilon_y= \\ ad + (ae+bd)\epsilon_x + (af+cd)\epsilon_y$

The interesting thing is that the above result gives you the same values as if you did the same work for two dual numbers individually.

Let’s see this three component dual number in action by re-doing the example again. Note that this pattern scales up to ANY number of variables!

# Example: Both Derivatives (Gradient Vector)

Our goal is to calculate the value and partial derivatives of the function $3x^2-2y^3$ at location (5,2).

First we make our x value:

$5 + 1\epsilon_x + 0\epsilon_y$

or:

$5 + \epsilon_x$

We square that and multiply it by 3 to get our $3x^2$ term:

$3*(5 + \epsilon_x)*(5 + \epsilon_x)= \\ 3*(25+10\epsilon_x+\epsilon_x^2)= \\ 3*(25+10\epsilon_x)= \\ 75+30\epsilon_x$

Next, we make our y value:

$2 + 0\epsilon_x + 1\epsilon_y$

or:

$2 + \epsilon_y$

We cube it and multiply it by 2 to get our $2x^3$ term:

$16+24\epsilon_y$

Lastly we subtract the y term from the x term to get our final answer:

$(75+30\epsilon_x) - (16+24\epsilon_y)= \\ 59+30\epsilon_x-24\epsilon_y$

The result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30, and the derivative of y at that point is -24.

Neat, right?!

# Example Code

Here is the example code output:

Here is the code that generated it:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>

#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

// constructor to make a constant
CDualNumber (float f = 0.0f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
CDualNumber (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// storage for real and dual values
float							m_real;
std::array<float, NUMVARIABLES> m_dual;
};

//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real + b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real - b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real * b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real / b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
float sqrtReal = sqrt(a.m_real);
ret.m_real = sqrtReal;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = pow(a.m_real, y);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = sin(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = cos(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------

void TestSmoothStep (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// calculate value and derivative using dual numbers
CDualNumber<1> y = SmoothStep(x);

// calculate numeric derivative using central differences
float derivNumeric = (SmoothStep(input + EPSILON) - SmoothStep(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 6.0f * input - 6.0f * input * input;

// show value and derivatives
printf("(smoothstep) y=3x^2-2x^3  (x=%0.4f)\n", input);
printf("  y = %0.4f\n", y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

void TestTrig (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sin
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = sin(x);

// calculate numeric derivative using central differences
float derivNumeric = (sin(input + EPSILON) - sin(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = cos(input);

// show value and derivatives
printf("sin(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// cos
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = cos(x);

// calculate numeric derivative using central differences
float derivNumeric = (cos(input + EPSILON) - cos(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -sin(input);

// show value and derivatives
printf("cos(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// tan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = tan(x);

// calculate numeric derivative using central differences
float derivNumeric = (tan(input + EPSILON) - tan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (cos(input)*cos(input));

// show value and derivatives
printf("tan(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// atan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = atan(x);

// calculate numeric derivative using central differences
float derivNumeric = (atan(input + EPSILON) - atan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (1.0f + input * input);

// show value and derivatives
printf("atan(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}
}

void TestSimple (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sqrt
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = CDualNumber<1>(3.0f) / sqrt(x);

// calculate numeric derivative using central differences
float derivNumeric = ((3.0f / sqrt(input + EPSILON)) - (3.0f / sqrt(input - EPSILON))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -3.0f / (2.0f * pow(input, 3.0f / 2.0f));

// show value and derivatives
printf("3/sqrt(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// pow
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = pow(x + CDualNumber<1>(1.0f), 1.337f);

// calculate numeric derivative using central differences
float derivNumeric = ((pow(input + 1.0f + EPSILON, 1.337f)) - (pow(input + 1.0f - EPSILON, 1.337f))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.337f * pow(input + 1.0f, 0.337f);

// show value and derivatives
printf("(%0.4f+1)^1.337 = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}
}

void Test2D (float inputx, float inputy)
{
// create dual numbers as the value of x and y
CDualNumber<2> x(inputx, 0);
CDualNumber<2> y(inputy, 1);

// z = 3x^2 - 2y^3
{
// calculate value and partial derivatives using dual numbers
CDualNumber<2> z = CDualNumber<2>(3.0f) * x * x - CDualNumber<2>(2.0f) * y * y * y;

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y) {
return 3.0f * x * x - 2.0f * y * y * y;
};
float derivNumericX = (f(inputx + EPSILON, inputy) - f(inputx - EPSILON, inputy)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON) - f(inputx, inputy - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = 6.0f * inputx;
float derivActualY = -6.0f * inputy * inputy;

// show value and derivatives
printf("z=3x^2-2y^3 (x = %0.4f, y = %0.4f)\n", inputx, inputy);
printf("  z = %0.4f\n", z.m_real);
printf("  dual# dz/dx = %0.4f\n", z.m_dual[0]);
printf("  dual# dz/dy = %0.4f\n", z.m_dual[1]);
printf("  actual dz/dx = %0.4f\n", derivActualX);
printf("  actual dz/dy = %0.4f\n", derivActualY);
printf("  numeric dz/dx = %0.4f\n", derivNumericX);
printf("  numeric dz/dy = %0.4f\n\n", derivNumericY);
}
}

void Test3D (float inputx, float inputy, float inputz)
{
// create dual numbers as the value of x and y
CDualNumber<3> x(inputx, 0);
CDualNumber<3> y(inputy, 1);
CDualNumber<3> z(inputz, 2);

// w = sin(x*cos(2*y)) / tan(z)
{
// calculate value and partial derivatives using dual numbers
CDualNumber<3> w = sin(x * cos(CDualNumber<3>(2.0f)*y)) / tan(z);

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y, float z) {
return sin(x*cos(2.0f*y)) / tan(z);
};
float derivNumericX = (f(inputx + EPSILON, inputy, inputz) - f(inputx - EPSILON, inputy, inputz)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON, inputz) - f(inputx, inputy - EPSILON, inputz)) / (2.0f * EPSILON);
float derivNumericZ = (f(inputx, inputy, inputz + EPSILON) - f(inputx, inputy, inputz - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = cos(inputx*cos(2.0f*inputy))*cos(2.0f * inputy) / tan(inputz);
float derivActualY = cos(inputx*cos(2.0f*inputy)) *-2.0f*inputx*sin(2.0f*inputy) / tan(inputz);
float derivActualZ = sin(inputx * cos(2.0f * inputy)) / -(sin(inputz) * sin(inputz));

// show value and derivatives
printf("w=sin(x*cos(2*y))/tan(z) (x = %0.4f, y = %0.4f, z = %0.4f)\n", inputx, inputy, inputz);
printf("  w = %0.4f\n", w.m_real);
printf("  dual# dw/dx = %0.4f\n", w.m_dual[0]);
printf("  dual# dw/dy = %0.4f\n", w.m_dual[1]);
printf("  dual# dw/dz = %0.4f\n", w.m_dual[2]);
printf("  actual dw/dx = %0.4f\n", derivActualX);
printf("  actual dw/dy = %0.4f\n", derivActualY);
printf("  actual dw/dz = %0.4f\n", derivActualZ);
printf("  numeric dw/dx = %0.4f\n", derivNumericX);
printf("  numeric dw/dy = %0.4f\n", derivNumericY);
printf("  numeric dw/dz = %0.4f\n\n", derivNumericZ);
}
}

int main (int argc, char **argv)
{
TestSmoothStep(0.5f);
TestSmoothStep(0.75f);
TestTrig(PI * 0.25f);
TestSimple(3.0f);
Test2D(1.5f, 3.28f);
Test3D(7.12f, 8.93f, 12.01f);
return 0;
}


# Closing

One of the neatest things about dual numbers is that they give precise results. They are not approximations and they are not numerical methods, unlike the central differences method that I compared them to in the example program (More info on numerical derivatives here: Finite Differences). Using dual numbers gives you exact derivatives, within the limitations of (eg) floating point math.

It turns out that backpropagation (the method that is commonly used to train neural networks) is just steepest gradient descent. You can read about that here: Backpropogation is Just Steepest Descent with Automatic Differentiation

That makes me wonder how dual numbers would do in run time speed compared to back propagation as well as numerical methods for getting the gradient to adjust a neural network during training.

If I had to guess, I’d say that dual numbers may be slightly slower than backpropagation, but not as slow as numerical methods which are going to be much, much slower. We’ll see though. It may be much easier to implement neural network learning using dual numbers compared to backpropagation, so may be worth an exploration and write up, even if only to make neural networks a little bit more accessible to people.

Comments, corrections, etc? Let me know in the comments below, or contact me on twitter at @Atrix256

## A Geometric Interpretation of Neural Networks

In the 90s before I was a professional programmer / game developer I looked at neural networks and found them interesting but got scared off by things like back propagation, which I wasn’t yet ready to understand.

With all the interesting machine learning things going on in modern times, I decided to have a look again and have been pleasantly surprised at how simple they are to understand. If you have knowledge of partial derivatives and gradients (like, if you’ve done any ray marching), you have the knowledge it takes to understand it.

Here are some really great resources I recomend for learning the nuts and bolts of how modern neural networks actually work:
Learn TensorFlow and deep learning, without a Ph.D.
Neural Networks and Deep Learning
A Neural Network Playground (Web Based NN sand box from google)

This post doesn’t require any understanding of neural networks or partial derivatives, so don’t worry if you don’t have that knowledge. The harder math comes up when training a neural network, but we are only going to be dealing with evaluating neural networks, which is much simpler.

## A Geometric Interpretation of a Neuron

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron.

To calculate the output value of a neuron, you multiply every input into that neuron by a weight for that input, and add a bias. This value is fed into an activation function (more on activation functions shortly) and the result is the output value of that neuron. Here is a diagram for a single neuron:

A more formal definition of a neuron’s output is below. $b$ is the bias, $w_j$ is the j’th weight and $i_j$ is the j’th input.
$Output = b+\sum_{j=0}^nw_ji_j$

You might notice that the above is really just the dot product of the weight vector and the input vector, with a value added on the end. We could re-write it like that:
$Output = Dot(w,i)+b$

Interestingly, that is the same equation that you use to find the distance of a point to a plane. Let’s say that we have a plane defined by a unit length normal N and a distance to the origin d, and we want to calculate the distance of a point P to that plane. We’d use this formula:
$Distance = Dot(N,P)+d$

This would give us a signed distance, where the value will be negative if we are in the negative half space defined by the plane, and positive otherwise.

This equation works if you are working in 3 dimensional space, but also works in general for any N dimensional point and plane definition.

What does this mean? Well, this tells us that every neuron in a neural network is essentially deciding what side of a hyperplane a point is on. Each neuron is doing a linear classification, saying if something is on side A or side B, and giving a distance of how far it is into A or B.

This also means that when you combine multiple neurons into a network, that an output neuron of that neural network tells you whether the input point is inside or outside of some shape, and by how much.

I find this interesting for two reason.

Firstly, it means that a neural network can be interpreted as encoding SHAPES, which means it could be used for modeling shapes. I’m interested in seeing what sort of shapes it’s capable of, and any sorts of behaviors this representation might have. I don’t expect it to be useful for, say, main stream game development (bonus if it is useful!) but at minimum it ought to be an interesting investigation to help understand neural networks a bit better.

Secondly, there is another machine learning algorithm called Support Vector Machines which are also based on being able to tell you which side of a separation a data point is on. However, unlike the above, SVM separations are not limited to plane tests and can use arbitrary shapes for separation. Does this mean that we are leaving something on the table for neural networks? Could we do better than we are to make networks with fewer layers and fewer neurons that do better classification by doing non linear separation tests?

Quick side note: besides classification, neural nets can help us with something called regression, which is where you fit a network to some analog data, instead of the more discrete problem of classification, which tells you what group something belongs to.

It turns out that the activation function of a neuron can have a pretty big influence on what sort of shapes are possible, which makes it so we aren’t strictly limited to shapes made up of planes and lines, and also means we aren’t necessarily leaving things on the table compared to SVM’s.

This all sort of gives me the feeling though that modern neural networks may not be the best possible algorithm for the types of things we use them for. I feel like we may need to generalize them beyond biological limitations to allow things like multiplications between weighted inputs instead of only sums. I feel like that sort of setup will be closer to whatever the real ideal “neural computation” model might be. However, the modern main stream neural models do have the benefit that they are evaluated very efficiently via dot products. They are particularly well suited for execution on GPUs which excel at performing homogenous operations on lots and lots of data. So, a more powerful and more general “neuron” may come at the cost of increased computational costs, which may make them less desirable in the end.

As a quick tangent, here is a paper from 2011 which shows a neural network model which does in fact allow for multiplication between neuron inputs. You then will likely be wanting exponents and other things, so while it’s a step in the right direction perhaps, it doesn’t yet seem to be the end all be all!
Sum-Product Networks: A New Deep Architecture

It’s also worth while to note that there are other flavors of neural networks, such as convolutional neural networks, which work quite a bit differently.

Let’s investigate this geometric interpretation of neurons as binary classifiers a bit, focusing on some different activation functions!

## Step Activation Function

The Heaviside step function is very simple. If you give it a value greater than zero, it returns a 1, else it returns a 0. That makes our neuron just spit out binary: either a 0 or a 1. The output of a neuron using the step activation function is just the below:

$Output = Dot(w,i)+b > 0$

The output of a neuron using the step activation function is true if the input point is in the positive half space of the plane that this neuron describes, else it returns false.

Let’s think about this in 2d. Let’s make a neural network that takes x and y as input and spits out a value. We can make an image that visualizes the range from (-1,-1) to (1,1). Negative values can be shown in blue, zero in white, and positive values in orange.

To start out, we’ll make a 2d plane (aka a line) that runs vertically and passes through the origin. That means it is a 2d plane with a normal of (1,0) and a distance from the origin of 0. In other words, our network will have a single neuron that has weights of (1,0) and a bias of 0. This is what the visualization looks like:

You can actually play around with the experiments of this post and do your own using an interactive visualization I made for this post. Click here to see this first experiment: Experiment: Vertical Seperation

We can change the normal (weights) to change the angle of the line, and we can also change the bias to move the line to it’s relative left or right. Here is the same network that has it’s weights adjusted to (1,1) with a bias of 0.1.

Experiment: Diagonal Separation

The normal (1,1) isn’t normalized though, which makes it so the distance from origin (aka the bias) isn’t really 0.1 units. The distance from origin is actually divided by the length of the normal to get the REAL distance to origin, so in the above image, where the normal is a bit more than 1.0, the line is actually less than 0.1 units from the origin.

Below is the visualization if we normalize the weights to (0.707,0.707) but leave the bias at 0.1 units. The result is that the line is actually 0.1 units away from the origin.

Experiment: Normalized Diagonal Separation

Recalling the description of our visualization, white is where the network outputs zero, while orange is where the network outputs a positive number (which is 1 in this case).

If we define three lines such that their negative half spaces don’t completely overlap, we can get a triangle where the network outputs a zero, while everywhere else it outputs a positive value. We do this by having three sibling neurons in the first layer which define three separate lines, and then in the output neuron we give them all a weight of 1. This makes it so the area outside the triangle is always a positive value, which step turns into 1, but inside the triangle, the value remains at 0.

We can turn this negative space triangle into a positive space triangle however by making the output neuron have a weight on the inputs of -1, and adding a bias of 0.1. This makes it so that pixels in the positive space of any of the lines will become a negative value. The negative space of those three lines get a small bias to make it be a positive value, resulting in the step function making the values be 0 outside of the triangle, and 1 inside the triangle. This gives us a positive space triangle:

Taking this a bit further, we can make the first layer define 6 lines, which make up two different triangles – a bigger one and a smaller one. We can then have a second layer which has two neurons – one which makes a positive space larger triangle, and one which makes a positive space smaller triangle. Then, in the output neuron we can give the larger triangle neuron a weight of 1, while giving the smaller triangle neuron a weight of -1. The end result is that we have subtracted the smaller triangle from the larger one:

Using the step function we have so far been limited to line based shapes. This has been due to the fact that we can only test our inputs against lines. There is a way around this though: Pass non linear input into the network!

Below is a circle with radius 0.5. The neural network has only a single input which is sqrt(x*x+y*y). The output neuron has a bias of -0.5 and a weight of 1. It’s the bias value which controls the radius of the circle.

You could pass other non linear inputs into the network to get a whole host of other shapes. You could pass sin(x) as an input for example, or x squared.

While the step function is inherently very much limited to linear tests, you can still do quite a lot of interesting non linear shapes (and data separations) by passing non linear input into the network.

Unfortunately though, you as a human would have to know the right non linear inputs to provide. The network can’t learn how to make optimal non linear separations when using the step function. That’s quite a limitation, but as I understand it, that’s how it works with support vector machines as well: a human can define non linear separations, but the human must decide the details of that separation.

BTW it seems like there could be a fun puzzle game here. Something like you have a fixed number of neurons that you can arrange in however many layers you want, and your goal is to separate blue data points from orange ones, or something like that. If you think that’d be a fun game, go make it with my blessing! I don’t have time to pursue it, so have at it (:

## Identity and Relu Activation Functions

The identity activation function doesn’t do anything. It’s the same as if no activation function is used. I’ve heard that it can be useful in regression, but it can also be useful for our geometric interpretation.

Below is the same circle experiment as before, but using the identity activation function instead of the step activation function:

Remembering that orange is positive, blue is negative, and white is zero, you can see that outside the circle is orange (positive) and inside the circle is blue (negative), while the outline of the circle itself is white. We are looking at a signed distance field of the circle! Every point on this image is a scalar value that says how far inside or outside that point is from the surface of the shape.

Signed distance fields are a popular way of rendering vector graphics on the GPU. They are often approximated by storing the distance field in a texture and sampling that texture at runtime. Storing them in a texture only requires a single color channel for storage, and as you zoom in to the shape, they preserve their shape a lot better than regular images. You can read more about SDF textures on my post: Distance Field Textures.

Considering the machine learning perspective, having a signed distance field is also an interesting proposition. It would allow you to do classification of input, but also let you know how deeply that input point is classified within it’s group. This could be a confidence level maybe, or could be interpreted in some other way, but it gives a sort of analog value to classification, which definitely seems like it could come in handy sometimes.

If we take our negative space triangle example from the last section and change it from using step activation to identity activation, we find that our technique doesn’t generalize naively though, as we see below. (It doesn’t generalize for the positive space triangle either)

The reason it doesn’t generalize is that the negatives and positives of pixel distances to each of the lines cancel out. Consider a pixel on the edge of the triangle: you are going to have a near zero value for the edge it’s on, and two larger magnitude negative values from the other edges it is in the negative half spaces of. Adding those all together is going to be a negative value.

To help sort this out we can use an activation function called “relu”, which returns 0 if the value it’s given is less than zero, otherwise it returns the value. This means that all our negative values become 0 and don’t affect the distance summation. If we switch all the neurons to using relu activation, we get this:

If you squint a bit, you can see a triangle in the white. If you open the experiment and turn on “discrete output” to force <= 0 to blue and > 0 to orange you get a nice image that shows you that the triangle is in fact still there.

Our result with relu is better than identity, but there are two problems with our resulting distance field.

Firstly it isn’t a signed distance field – there is no blue as you might notice. It only gives positive distances, for pixels that are outside the shape. This isn’t actually that big of an issue from a rendering perspective, as unsigned distance fields are still useful. It also doesn’t seem that big of an issue from a machine learning perspective, as it still gives some information about how deeply something is classified, even though it is only from one direction.

I think with some clever operations, you could probably create the internal negative signed distance using different operations, and then could compose it with the external positive distance in the output neuron by adding them together.

The second problem is a bigger deal though: The distance field is no longer accurate!

By summing the distance values, the distance is incorrect for points where the closest feature of the triangle is a vertex, because multiple lines are contributing their distance to the final value.

I can’t think of any good ways to correct that problem in the context of a neural network, but the distance is an approximation, and is correct for the edges, and also gets more correct the closer you get to the object, so this is still useful for rendering, and probably still useful for machine learning despite it being an imperfect measurement.

## Sigmoid and Hyperbolic Tangent Activation Function

The sigmoid function is basically a softer version of the step function and gives output between 0 and 1. The hyperbolic tangent activation function is also a softer version of the step function but gives output between -1 and 1.

Sigmoid:

Hyperbolic Tangent:

(images from Wolfram Mathworld)

They have different uses in machine learning, but I’ve found them to be visibly indistinguishable in my experiments after compensating for the different range of output values. It makes me think that smoothstep could probably be a decent activation function too, so long as your input was in the 0 to 1 range (maybe you could clamp input to 0 and 1?).

These activation functions let you get some non linearity in your neural network in a way that the learning algorithms can adjust, which is pretty neat. That puts us more into the realm where a neural net can find a good non linear separation for learned data. For the geometric perspective, this also lets us make more interesting non linear shapes.

Unfortunately, I haven’t been able to get a good understanding of how to use these functions to craft desired results.

It seems like if you add even numbers of hyperbolic tangents together in a neural network that you end up getting a lot of white, like below:

However, if you add an odd number of them together, it starts to look a bit more interesting, like this:

Other than that, it’s been difficult seeing a pattern that I can use to craft things. The two examples above were made by modifying the negative space triangle to use tanh instead of step.

## Closing

We’ve wandered a bit in the idea of interpreting neural networks geometrically but I feel like we’ve only barely scratched the surface. This also hasn’t been a very rigorous exploration, but more has just been about exploring the problem space to get a feeling for what might be possible.

It would be interesting to look more deeply into some of these areas, particularly for the case of distance field generation of shapes, or combining activation functions to get more diverse results.

While stumbling around, it seems like we may have gained some intuition about how neural networks work as well.

It feels like whenever you add a layer, you are adding the ability for a “logical operation” to happen within the network.

For instance, in the triangle cutout experiment, the first layer after the inputs defines the 6 individual lines of the two triangles and classifies input accordingly. The next layer combines those values into the two different triangle shapes. The layer after that converts them from negative space triangles to positive space triangles. Lastly, the output layer subtracts the smaller triangle’s values from the larger triangle’s values to make the final triangle outline shape.

Each layer has a logical operation it performs, which is based on the steps previous to it.

Another piece of intuition I’ve found is that it seems like adding more neurons to a layer allows it to do more work in parallel.

For instance, in the triangle cutout experiment, we created those two triangles in parallel, reserving some neurons in each layer for each triangle. It was only in the final step that we combined the values into a single output.

Since neurons can only access data from the previous network layer, it seems as though adding more neurons to layers can help push data forward from previous layers, to later layers that might need the data. But, it seems like it is most efficient to process input data as early as possible, so that you don’t have to shuttle it forward and waste layers / neurons / memory and computing power.

Here is some info on other activation functions:
Wikipedia:Activation Function

Here’s a link that talks about how perceptrons (step activated neural networks) relate to SVMs:
Hyperplane based Classification: Perceptron and (Intro to) Support Vector Machines

By the way, did I mention you can visualize neural networks in three dimensions as well?

Experiment: 3d Visualization

Here are the two visualizers of neural networks I made for this post using WebGL2:
Neural Network Visualization 2D
Neural Network Visualization 3D

If you play around with this stuff and find anything interesting, please share!

## My Old Master: How to Correct as a Mentor or a Teacher

Preface: I studied martial arts for a little over a decade (shaolin kempo at USSD) and learned a lot while I was there. Our teacher was a great guy who genuinely cared about his students, and in particular, taught my friends and I some really interesting things when we became instructors. I’d like to share some of that information with you in the “My Old Master” posts category. As cliched as it is, many things he taught us apply to all aspects of life, not just martial arts and I’d like other people to benefit from them.

My old master used to say…

“Praise, Correct, Praise. Even if you have to make something up, you need to say something positive.”

Let’s say that I’m teaching you how to punch and you aren’t quite doing it right.

Here are two things I might say to try and correct it:

1. “Keep your wrist straight when you are punching so you don’t hurt your hand” or…
2. “I really love how you are keeping your left hand up while you punch with your right. It’s doing a great job of protecting your head against your opponent hitting you back. Now try keeping your wrist straight when you punch so that you don’t hurt your hand.” Then I watch you try again and I say “Great, just like that, keep it up!”

Think about how those two responses make you feel for a second.

The first one likely makes you feel like you are messing up and need to fix it (a negative thing), while the second makes you feel like you were doing well and are now are doing even better.

What’s the difference? Well, like the opening quote says, I praised, corrected, and then praised. First I found something you were doing well, complimented you on it, gave a suggestion for improvement, and then praised you on doing (or attempting!) the correction.

This can be a great way to give people feedback, in a way that makes them feel better about themselves, and feel better about the feedback you are giving them. Instead of being a negative thing, it becomes a positive thing.

Pretty simple stuff, and if you practice this technique it starts to become second nature.

The quote says that if you can’t find anything positive to say, you should make it up. It shouldn’t be your first choice to make something up, because the more genuine you are about the praise, the better it will be. However, if you really can’t find anything nice to say, yes, you should make something up.

A person’s ego and self worth is a measurable quantity that is increased with praise and decreased with corrections or negative feedback (aka “you suck!”). When this tank of self worth gets too low, your student or mentee will feel worthless, get frustrated and/or start to get resentful at you.

This technique is useful because it allows you to give a correction while minimizing hit to the person’s self worth. In the end allows you to give MORE correction and help them more in the long run, just by phrasing your corrections differently. Another term for this is “complement sandwich” which you may have heard of before.

Another thing to be mindful of however is that you can only give so much feedback at any one time. The ego/self worth tank needs to refill after it’s diminished, and frankly, the person needs to absorb and internalize what you’ve taught them before they are ready for more.

Our teacher would say “it’s better having a mediocre black belt, than having a stellar white belt who quits out of frustration” and that’s very true. It’s better for them since a mediocre black belt is FAR SUPERIOR to a stellar white belt and much better able to protect themselves and their loved ones, but also better for the organization, since we are often teaching or mentoring in a “for profit” situation where the person we are trying to help is either a customer or a co-worker which the company is interested in keeping around.

Before wrapping it up, I heard a funny story regarding this topic about a special needs child and his or her parents. Just like everyone else, this child has a concept of self worth, however being disabled makes it very easy to feel depressed when you realize there are so many things you can’t do that other people can do. It’s difficult too for the parents to help the child feel better about themselves if they really can’t find an area he shines in. One day the parents noticed he loved to use tools and it clicked. They started loosening screws in the house and asking him to tighten them for him. “Jimmy, can you come tighten this screw up for us? You are so good at it!”.

I think that’s a cute story but really shows how we work as humans. Your job as (an effective) teacher, mentor, parent, boss or leader, is to teach whatever you need to teach, correct whatever you need to correct, but also to make sure you do so in a way that is least damaging to the person’s ego and self worth. They feel better about themselves, but you are also more effective at getting the job done. It matters!

So go out there and serve some compliment sandwiches, making sure to be as genuine as possible with your praise!

P.S. Yes people can have over inflated egos and feeding them more is only going to make things worse. That’s the topic of another post (;

## Raytracing Reflection, Refraction, Fresnel, Total Internal Reflection, and Beer’s Law

This post talks about how to render images like the below in real time using ray tracing. Some realism in the images come from reflection and refraction, but the real icing on the cake comes from Fresnel, total internal reflection and Beer’s law. We’ll look at the contributions of these features individually and talk about how to properly combine them for the greatest and most realistic results (:

The renderings come from a shadertoy that accompanies this post: Shadertoy: Reflect Refract TIR Fresnel RayT

My motivation for learning more about this stuff is that I’m starting to make a marble madness inspired game, and I’m planning to do hybrid rendering between rasterization and ray based techniques.

This post will focus more on how to make these features work for you in ray tracing, and less about the reasons for why they work the way they do. This post is more about practical implementations, and less about rigorous explanations.

If you have any questions, comments, corrections or additions, feel free to leave in the comments section at the bottom, or feel free to hit me up on twitter at @Atrix256.

This post is assumes you know how to do basic raytracing with ambient, diffuse and specular lighting, like the image below, which we are going to start with:

# Reflection

The first thing to talk about is reflection. More specifically we are going to be talking about SPECULAR reflection.

Specular is defined by the dictionary to mean “of, relating to, or having the properties of a mirror.”, so what we normally think of as reflection (like from a mirror) is in fact specular reflection.

There is non specular reflection, aka diffuse reflection, which just means that light is reflected off of a surface in a non mirror like way. This is accomplished with diffuse lighting where commonly we dot product the normal of a surface with the direction from the surface to the light to know how much to light that point on the surface.

If specular reflection is how mirrors reflect, and diffuse reflection is how regular diffuse surfaces work, then you might wonder what specular lighting is all about.

A specular highlight is actually just a cheap approximation to doing a mirror like specular reflection of a light source, so it is a cheap kind of specular reflection.

Let’s talk about how to do real mirror like specular reflection in a ray tracer.

When light hits a surface, some amount of the light will be reflected, and some amount of the light will be transmitted into the object. Transmitted light is used for the diffuse shading.

As you might imagine, the amount of light reflected plus the amount of light transmitted must equal the total amount of light that hit the surface. (note that some transmitted energy may be absorbed and given off as heat, or the object may be glowing, so may give off more light than received but let’s ignore that stuff for now.)

A common way to deal with this is to define a reflectivity of a surface as the amount of light it reflects in percent and use 100% minus that amount as the transmitted amount.

You might say that 2% of light that hits a surface reflects. That means that 98% of the light that hits a surface is transmitted and used for the diffuse shading.

When shading a point on the surface, you would calculate both the reflected color at that point on the surface, and the diffuse shaded color at that point, but you multiply the reflected color by 0.02 and the diffuse shaded color by 0.98 and add them together. Doing this gives a result like the below:

The higher the reflectivity percent, the more reflection you get, and the less diffuse shading you get. Turning down reflectivity has the opposite effect.

How do you calculate the reflected color of a point on a surface though?

You simply calculate the ray that reflected off of the surface, and calculate the color of what that ray hit.

To calculate a reflected ray, you need to know the direction that the ray was traveling when it hit the surface (The incident direction), you need to know the location that the ray hit the surface (the surface location), and you need to know the normal of the surface at the intersection point.

If you have those things you can calculate the reflected ray direction as this:

$ReflectRayLocation = SurfaceLocation \\ ReflectRayDirection = IncidentDirection - 2 * SurfaceNormal * DotProduct(IncidentDirection, SurfaceNormal)$

Note that in hlsl and glsl there is a function called “reflect” which takes the incident direction, and the surface normal, and returns you the reflected direction.

Doing the above is mathematically correct, but if you then try to raytrace from that ray, you may hit the same object you are reflecting off. One way to fight that problem is to push the ray positin a small amount away from the surface to make sure it misses it. You can do that for example like this:

$ReflectRayLocation = ReflectRayLocation + ReflectRayDirection * 0.01$

There are other ways to make sure you don’t hit the same object again, but pushing the ray away from the object a small amount really does work pretty nicely in practice.

# Refraction

I mentioned in the last section that whatever light wasn’t reflected when it hit a surface was transmitted. I also said that the transmitted light was used for diffuse shading.

In reality, it’s passed through the “bidirectional scattering distribution function” aka BSDF. You may have heard the term “bidirectional reflection distribution function” aka BRDF. A BRDF only deals with the hemisphere (half a sphere) that surrounds the surface normal. The BSDF on the other hand deals with the full sphere surrounding a surface normal so BRDF is a subset of what is possible with the BSDF.

Because the BSDF deals with an entire sphere, that means that it can handle reflection (specular and non specular) like the BRDF can, but it can also deal with transparency and refraction, where some of the light travels THROUGH an object.

In a path tracer where everything is physically accurate and mathematically precise, we would be interested in dealing with a BSDF and integrating over the sphere, but since we are working with a ray tracer, our physical accuracy needs are a lot lower – we only want a result that looks plausible.

What we are going to do for our transparency is calculate a direction that the ray is going to travel through the object, ray trace that ray to get a color, and use the transmitted light (the portion of light that isn’t reflected) as a multiplier of that color, that we add to the reflected amount of light.

If we have an object with 10% reflectivity, and 90% transmittance, but use that transmitted light for transparency, we’ll have a rendering like below:

Now that we have transparency, let’s talk about refraction. Refraction is a physical phenomenon where light bends (“changes speed” i guess but that sounds a bit suspicious for light), as it hits a boundary between two different surfaces.

Every material has a refractive index, and in fact, may have different refractive indices per light frequency. For our purposes, we’ll assume surfaces have the same refractive index for every frequency of light. There’s a list of refractive indices for a lot of different materials here: Index of refraction

How a ray bends when it refracts depends on the ration of the refractive index that it’s leaving to the refractive index that it’s entering. AKA outside/inside.

HLSL and GLSL have a function called refract which take the incident vector, the surface normal vector, and that ration of refractive indices, and return the refracted ray.

When you do a raytrace down the refracted ray, you will have the same problem as when tracing the reflected ray, that you may hit the same surface you just hit again erroneously. To help that, you once again just move the ray slightly away from the surface.

$RefractRayLocation = RefractRayLocation + RefractRayDirection * 0.01$

Here is a rendering where the sphere has 10% reflectivity, 90% transmittance, an air refractive index of 1.0, and a refractive index of 1.125 for the sphere. You can see how the light bends as it goes through the object and looks pretty neat!

# Fresnel

There is an interesting property in our world: If you look at something straight on, it’s the least reflective it will be. If you turn it nearly 90 degrees where it’s nearly edge on, it will be the most reflective it can be. Many surfaces will become almost perfectly reflective when you view them almost edge on. Go try it out with a wall in your house or a glass or other things.

Weird huh? That’s called Fresnel, and is something we can also make use of in ray tracing to get a more realistic image.

Instead of just always using the reflectivity amount of the surface, we use the Fresnel equation to figure out how much reflectivity an object should have based on the view angle versus the surface normal, so that when it’s more edge on it becomes more reflective. At minimum the reflectivity will be the reflectivity of the surface, and at maximum the reflectivity will be 100%.

The amount we transmit for either refraction or diffuse will be 100% minus however much percentage is reflective.

Here is the image showing normal reflection again:

Here is the image with Fresnel:

It looks quite a bit better with fresnel doesn’t it?!

Here’s a GLSL function of Schlick’s Fresnel approximation function. Notice that it takes the surface normal and incident vector, as well as the refractive index being left (n1) and the refractive index being entered (n2):

float FresnelReflectAmount (float n1, float n2, vec3 normal, vec3 incident)
{
// Schlick aproximation
float r0 = (n1-n2) / (n1+n2);
r0 *= r0;
float cosX = -dot(normal, incident);
if (n1 > n2)
{
float n = n1/n2;
float sinT2 = n*n*(1.0-cosX*cosX);
// Total internal reflection
if (sinT2 > 1.0)
return 1.0;
cosX = sqrt(1.0-sinT2);
}
float x = 1.0-cosX;
float ret = r0+(1.0-r0)*x*x*x*x*x;

// adjust reflect multiplier for object reflectivity
ret = (OBJECT_REFLECTIVITY + (1.0-OBJECT_REFLECTIVITY) * ret);
return ret;
}


Our tale of reflection is complete, so let’s go back to refraction / transparency and finish up the last two items.

# Total Internal Reflection

The way that the Fresnel equation works, it’s possible that when moving from a material with a higher index of refraction to a lower one, that the amount of refraction can actually drop to zero percent. In this case, the light doesn’t exit the higher refractive index object and instead ONLY reflects back into the object, because the reflective amount becomes 100%.

When this happens, it’s called “Total Internal Reflection” for hopefully obvious reasons (:

There isn’t a whole lot to say about this, because you can see that this is even accounted for in the GLSL Fresnel function from the last section.

However, if you are ever under water in a swimming pool and look up to see the water surface looking like a mirror, that is total internal reflection in action.

You can also see it in the render below, where you can see reflections on the inside of the walls of the object, especially on the bottom (floor) of the object:

# Beer’s Law

Beer’s law is the last item to talk about, and relates to transparent surfaces. Beer’s law deals with light being absorbed over distance as it travels through a material.

Beer’s law is why a thin piece of jello is mostly colorless, but a thicker piece of jello has a much richer and deeper color.

Here’s a cube with beer’s law absorbing red and green light over distance. You should notice that where the light travels less distance through the cube that it’s not as blue, because not as much red and green light has been absorbed:

To apply beer’s law, you first figure out how long a ray has traveled through the absorbing material (by tracing the ray inside the object to see where it hits the back side). You then do this calculation:

vec3 color = RayTrace(rayPos, rayDir);
vec3 absorb = exp(-OBJECT_ABSORB * absorbDistance);
color *= absorb;


OBJECT_ABSORB is an RGB value that describes how much of each color channel absorbs over distance. For example, a value of (8.0, 2.0, 0.1) would make red get absorbed the fastest, then green, then blue, so would result in a blueish, slightly green color object.

# Putting it All Together

Now that we have the individual components worked out let’s talk about how to put it together.

Firstly, when a ray hits any surface, you need to use the Fresnel equation to get the multiplier for the reflected and transmitted light.

Next you calculate the reflected and transmitted light by recursively raytracing. The transmitted light is either used for the diffuse shading, the transparency/refracted ray, or some combination of those two (technically, it’s all up to the BSDF we are approximating).

Then, you multiply the reflected light by the reflected amount from the Fresnel equation, and the transmitted amount by 100%-reflectionAmount and add the results together.

(Quick side note, if you have emissive color on the surface aka the object glows, you would also add that in here).

Since raytracing is recursive, you would do this each time a ray intersected with an object. In other words, each intersection causes the ray to split into two rays.

As you can imagine, this can make the rendering quite complex, especially on the GPU where you can’t even do real recursion.

One way to help limit the recursiveness a bit is when you are calculating your reflection and transmittance amounts, you can choose a threshold like say 1% where if the reflection is under 1% it clamps it to 0%, and if it’s greater than 99% it clamps it at 100%. You can then choose not to recurse down a specific ray if the ray’s multiplier is 0. The end result will be that reflections or transmittance rays that don’t contribute much to the end result won’t be followed at all.

If you are willing to sacrifice some visual quality to not have to split your ray into two at each object intersection, you could also figure out if reflection or transmittance has the higher multiplier, and make the ray only follow one of the paths. If you were doing this in a path tracer, you could choose which one to follow randomly, using the multiplier as a weight for the random selection.

The problem in both of these two optimizations is that the multiplier is only half of the information though so may incorrectly choose the less meaningful contribution. The other half of the information is the color of the ray if you followed it. The reason for this is that you might have a low multiplier for a really bright spot (caustics have this problem commonly!), or vice versa you may have a high multiplier for a dull featureless spot. With path tracing, if you take enough samples, it all washes out in the averages, and with ray tracing, maybe you accept that it will do the wrong thing sometimes, to stay a real time algorithm, but it’s important to know how this type of choice can fail for you. (Side note, this sort of stuff is what importance sampling in path tracing is all about – trying to make rays follow more meaningful paths to get better results quicker).

When doing real time raytracing you also often have to decide how many times you want to allow a ray to bounce around. In the shadertoy that goes with this post, that parameter is MAX_RAY_BOUNCES and I have it set to 10.

Interestingly, setting it to 1 has no visible impact on the sphere at all, which is a nice improvement. For the box, a value of 3 seems to be the maximum it needs. 3 also seems to be the magic number for the geometric gem type shape.

So, 10 is overkill, but i left it at that in case people play with parameters and change them to values which would require more bounces.

Lastly I wanted to mention that in the scene that I rendered, I did a small “trick” to make it so I didn’t need to do full recursive splitting of rays at each intersection. I did this by making it so the main object in the center of the scene was the only object that had reflection.

This way, I only need to split the ray into two if i hit the main object. Furthermore, when I’m splitting the ray at the main object, the ray that gets the color for the outside world (versus the inside of the object) is a single non recursive ray cast since it can’t hit anything reflective. The result is that at each intersection of the sphere, i do a simple non recursive ray cast for the ray that is going outside of the main object, then i continue the iterative ray on the inside of the object until i run out of bounces.

Doing this causes a recursive process to become an iterative one, which is much friendlier on the gpu.

Below is the final render again from the shadertoy. The parameters are:

• The refractive index of the air is 1.00029
• The refractive index inside the objects are 1.125
• Reflectivity is 1%
• The absorption for beers law is (8.0, 8.0, 3.0)