Improved Storage Space Efficiency of GPU Texture Sampler Bezier Curve Evaluation

This is an extension of a paper I wrote which shows how to use the linear texture sampling capabilities of the GPU to calculate points on Bezier curves (also just polynomials in general as well as rational polynomials, and also surfaces and volumes made by tensor products). You store the control points in the texture, then sample along the texture’s diagonal to get points on the curve:
GPU Texture Sampler Bezier Curve Evaluation

This extension improves on the efficiency of the storage space usage, allowing a higher density of curve data per pixel, but the post also talks about some caveats and limitations.

This post is divided into the following sections:

  1. Basic Idea of Extension
  2. 2D Texture / Quadratic Piecewise Curves
  3. 2D Texture / Quadratic Piecewise Curves – C0 Continuity
  4. 2D Texture / Quadratic Piecewise Curves – Storage Efficiency
  5. Real World Limitations
  6. 3D Texture / Cubic Piecewise Curves
  7. 3D Texture / Cubic Piecewise Curves – Multiple Curves?
  8. 3D Texture / Cubic Piecewise Curves – C0 Continuity
  9. 3D Texture / Cubic Piecewise Curves – Storage Efficiency
  10. Generalizing The Unit Hyper Cube
  11. Closing
  12. Code

1. Basic Idea of Extension

Let’s talk about the base technique before going into the details of the extension.

The image below shows how bilinear interpolation across the diagonal between pixels can calculate points on curves. Bilinear interpolation is exactly equivalent to the De Casteljau algorithm when the u and v coordinate are the same value.

Linear interpolation between two values A and B at time t is done with this formula:
A(1-t) + Bt

I’ve found useful to replace (1-t) with it’s own symbol s. That makes it become this:
As + Bt

Now, if you bilinear interpolate between 4 values, you have two rows. One row has A,B in it and the other row has C,D in it. To bilinear interpolate between these four values at time (t,t), the formula is this:
(As + Bt)s + (Cs+Dt)t

If you expand that and collect like terms you come up with this equation:
As^2 + (B+C)st + Dt^2

At this point, the last step is to make B and C the same value (make them both into B) and then rename D to C since that letter is unused. The resulting formula turns out to be the formula for a quadratic Bezier curve. This shows that mathematically, bilinear interpolation can be made to be mathematically the same as the quadratic Bezier formula. (Note: there are extensions to get higher order curves and surfaces as well)
As^2 + 2Bst + Ct^2

However, for this extension we are going to take one step back to the prior equation:
As^2 + (B+C)st + Dt^2

What you may notice is that the two values in the corners of the 2×2 bilinear interpolation don’t have to be the exact value of the middle control point of the quadratic Bezier curve – they only have to AVERAGE to that value.

This is interesting because to encode two different piecewise quadratic curves (C0-C2 and C3-C5) into a 2d texture before this extension, I would do it like this:

A = C_0 \\ B = C_1 \\ C = C_2 \\ D = C_3 \\ E = C_4 \\ F = C_5\\

That uses 8 pixels to store the 6 control points of the two quadratic curves.

With the ideas of this extension, one way it could look now is this:

A = C_0 \\ B + C = 2*C_1 : B = 2*C_1 - C_3 \\ D = C_2 \\ C = C_3 \\ D + E = 2*C_4 : E = 2*C_4 - C_2\\ F = C_5\\

The result is that 6 pixels are used instead of 8, for storing the 6 control points of the two quadratic curves.

That isn’t the only result though, so let’s explore the details (:

2. 2D Texture / Quadratic Piecewise Curves

Let’s start by more formally looking at the 2d texture / quadratic curve case.

We are going to number the pixels by their texture coordinate location (in the form of Pyx) instead of using letters. Later on that will help show a pattern of generalization. We are still using the same notation for control points where C0 is the first control point, C1 is the second control point and so on.

Looking at a single quadratic curve we have this texture which has these constraints on it’s pixel values:

P_{00} = C_0 \\ P_{01}+P_{10} = 2*C_1 \\ P_{11} = C_2 \\

To analyze this, let’s make an augmented matrix. The left matrix is a 3×4 matrix where each column is a pixel and each row is the left side of the equation for a constraint. The right matrix is a 3×3 matrix where each column is a control point and each row is the right side of the equation for a constraint. The first row of the matrix is column labels to help see what’s going on more easily.

Note that i put my pixel columns and control point columns in ascending order in the matrix, but if you put them in a different order, you’d get the same (or equivalent) results as I did. It’s just my convention they are in this order.

\left[\begin{array}{rrrr|rrr} P_{00} & P_{01} & P_{10} & P_{11} & C_0 & C_1 & C_2 \\ 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 & 2 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 1 \end{array}\right]

The next step would be to put this matrix into reduced row echelon form to solve the equations to see what the values of the pixels need to be, but the matrix is in fact already in rref form! (For more information on rref, check out my last post: Solving N equations and N unknowns: The Fine Print (Gauss Jordan Elimination))

What we can see by looking at the rref of the matrix is that either P01 or P10 can be a free variable – meaning we can choose whatever value we want for it. After we choose a value for either of those variables (pixels), the rest of the pixels are fully defined.

Deciding that P10 is the free variable (just by convention that it isn’t the leading non zero value), the second equation (constraint) becomes P01 = 2*C1-P10.

If we choose the value C1 for P10, that means that P01 must equal C1 too (this is how the original technique worked). If we choose 0 for P10, that means that P01 must equal 2*C1. This is because P01 must always equal 2*C1-P10. We then are in the new territory of this extension, where the pixels representing the middle control point have some freedom about what values they can take on, so long as they average to the middle control point value.

Let’s add a row of pixels and try encoding a second quadratic curve:

P_{00} = C_0 \\ P_{01}+P_{10} = 2*C_1 \\ P_{11} = C_2 \\ P_{10} = C_3 \\ P_{11}+P_{20} = 2*C_4 \\ P_{21} = C_5

Let’s again make an augmented matrix with pixels on the left and control points on the right.

\left[\begin{array}{rrrrrr|rrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5\\ 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1  \end{array}\right]

Putting that into rref to solve for the pixel values we get this:

\left[\begin{array}{rrrrrr|rrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5\\ 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & -1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & -1 & 0 & 2 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1  \end{array}\right]

We got the identity matrix on the left, so we don’t have any inconsistencies or free variables.

If we turn that matrix back into equations we get this:

P_{00} = C_0 \\ P_{01} = 2*C_1 - C_3 \\ P_{10} = C_3 \\ P_{11} = C_2 \\ P_{20} = 2*C_4 - C_2 \\ P_{21} = C_5

We were successful! We can store two piecewise Bezier curves in 6 pixels by setting the pixel values to these specific values.

The last example we’ll show is the next stage, where it falls apart. We’ll add another row of pixels and try to encode 3 Bezier curves (9 control points) into those 8 pixels.

P_{00} = C_0 \\ P_{01}+P_{10} = 2*C_1 \\ P_{11} = C_2 \\ P_{10} = C_3 \\ P_{11}+P_{20} = 2*C_4 \\ P_{21} = C_5 \\ P_{20} = C_6 \\ P_{21}+P_{30} = 2*C_7 \\ P_{31} = C_8

This is the augmented matrix with pixels on the left and control points on the right:

\left[\begin{array}{rrrrrrrr|rrrrrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & P_{30} & P_{31} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5 & C_6 & C_7 & C_8\\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{array}\right]

The rref form is:

\left[\begin{array}{rrrrrrrr|rrrrrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & P_{30} & P_{31} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5 & C_6 & C_7 & C_8\\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & -1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & -1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & -1 & 0 & 2 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & -2 & 0 & 1 & 0 & 0 \\ \end{array}\right]

Let’s turn that back into equations.

P_{00} = C_0 \\ P_{01} = 2*C_1 - C_3 \\ P_{10} = C_3 \\ P_{11} = 2*C_4 - C_6 \\ P_{20} = C_6 \\ P_{21} = C_5 \\ P_{30} = 2*C_7 - C_5 \\ P_{31} = C_8 \\ 0 = C_2 - 2*C_4 + C_6

We have a problem unfortunately! The bottom row says this:

0 = C_2 - 2*C_4 + C_6

That means that we can only store these curves in this pixel configuration if we limit the values of the control points 2,4,6 to values that make that last equation true.

Since my desire is to be able to store curves in textures without “unusual” restrictions on what the control points can be, I’m going to count this as a failure for a general case solution.

It only gets worse from here for the case of trying to add another row of pixels for each curve you want to add.

It looks like storing two quadratic curves in a 2×6 group of pixels is the most optimal (data dense) storage. If you go any higher, it puts restrictions on the control points. If you go any lower, you have a free variable, which means you aren’t making full use of all of the pixels you have.

This means that if you are storing piecewise quadratic curves in 2d textures, doing it this way will cause you to use 3/4 as many pixels as doing it the other way, and you will be using 1 pixel per control point stored, instead of 1.333 pixels per control point stored.

This isn’t the end of the story though, so let’s continue (:

3. 2D Texture / Quadratic Piecewise Curves – C0 Continuity

If we add the requirement that our piecewise curves must be connected (aka that they have C0 continuity), we can actually do something pretty interesting. Take a look at this setup:

P_{00} = C_0 \\ P_{01}+P_{10} = 2*C_1 \\ P_{11} = C_2 \\ P_{11} = C_3 \\ P_{10}+P_{21} = 2*C_4 \\ P_{20} = C_5

Putting this into matrix form looks like this:

\left[\begin{array}{rrrrrr|rrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5\\ 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{array}\right]

In rref it becomes this:

\left[\begin{array}{rrrrrr|rrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5\\ 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & -1 & 0 & 2 & 0 & 0 & -2 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \\ \end{array}\right]

Turning the rref back into equations we get:

P_{00} = C_0 \\ P_{01} - P_{21} = 2*C_1-2*C_4 \\ P_{10} + P_{21} = 2*C_4 \\ P_{11} = C_3 \\ P_{20} = C_5 \\ 0 = C_2 - C_3

P21 is a free variable, so we can set it to whatever we want. Once we choose a value, the pixel values P01 and P10 will be fully defined.

The bottom equation might have you worried, because it looks like an inconsistency (aka restriction) but it is actually expected.

That last equation says 0 = C2-C3 which can be rearranged into C2 = C3. That just means that the end of our first curve has to equal the beginning of our second curve. That is C0 just the continuity we already said we’d agree to.

So, it worked! Let’s try adding a row of pixels and another curve to see what happens.

P_{00} = C_0 \\ P_{01}+P_{10} = 2*C_1 \\ P_{11} = C_2 \\ P_{11} = C_3 \\ P_{10}+P_{21} = 2*C_4 \\ P_{20} = C_5\\ P_{20} = C_6\\ P_{21}+P_{30} = 2*C_7\\ P_{31} = C_8

Putting that into matrix form:

\left[\begin{array}{rrrrrrrr|rrrrrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & P_{30} & P_{31} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5 & C_6 & C_7 & C_8\\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\ \end{array}\right]

And in rref:

\left[\begin{array}{rrrrrrrr|rrrrrrrrr} P_{00} & P_{01} & P_{10} & P_{11} & P_{20} & P_{21} & P_{30} & P_{31} & C_0 & C_1 & C_2 & C_3 & C_4 & C_5 & C_6 & C_7 & C_8\\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 2 & 0 & 0 & -2 & 0 & 0 & 2 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0 & 2 & 0 & 0 & -2 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0\\ \end{array}\right]

Turning the rref back into equations:

P_{00} = C_0 \\ P_{01}+P_{30} = 2*C_1 - 2*C_4+2*C_7 \\ P_{10}-P_{30} = 2*C_4-2*C_7 \\ P_{11} = C_3 \\ P_{20} = C_6 \\ P_{21} + P_{30} = 2*C_7\\ P_{31} = C_8\\ 0 = C_2 - C_3\\ 0 = C_5 - C_6\\

We see that P30 is a free variable, and the last two rows show us we have the C0 continuity requirements: C2 = C3 and C5 = C6.

The last section without C0 continuity reached it’s limit of storage space efficiency after storing two curves (6 control points) in 6 pixels.

When we add the C0 continuity requirement, we were able to take it further and store 3 curves in 8 pixels. Technically those 3 curves have 9 control points, but because the end point of each curve has to be the same as the start point of the next curve it makes it so in reality there is only 3 control points for the first curve and then 2 additional control points for each additional curve. That makes 8 control points for 3 curves, not 9.

Unlike the last section, using this zigzag pattern with C0 continuity, you can encode any number of curves. I am not sure how to prove it, but from observation, there is no sign of any shrinking of capacity as we increase the number of curves, adding two more rows of pixels for each curve. If you know how to prove this more formally, please let me know!

Note that instead of explicitly having 3 control points per curve, where the first control point of a curve has to equal the last control point of the previous curve, you can instead describe the piecewise curves with fewer control points. You need 3 control points for the first curve, and then 2 control points for each curve after that.

Mathematically both ways are equivelant and you’ll get to the same answer. The accompanying source code works that way, but I show this example in this longer way to more explicitly show how things work.

4. 2D Texture / Quadratic Piecewise Curves – Storage Efficiency

Let’s compare the storage efficiency of the last two sections to each other, as well as to the original technique.

\begin{array}{|cccccc|} \hline & & \rlap{\text{2d / Quadratic - Extension}} & & & \\ \hline \text{Curves} & \text{Dimensions} & \text{Pixels} & \text{Control Points} & \text{Pixels Per Control Point} & \text{Pixels Per Curve} \\ \hline 1 & 2x2 & 4 & 3 & 1.33 & 4.00 \\ 2 & 2x3 & 6 & 6 & 1.00 & 3.00 \\ 3 & 2x5 & 10 & 9 & 1.11 & 3.33 \\ 4 & 2x6 & 12 & 12 & 1.00 & 3.00 \\ 5 & 2x8 & 16 & 15 & 1.06 & 3.20 \\ 6 & 2x9 & 18 & 18 & 1.00 & 3.00 \\ \hline \end{array}

\begin{array}{|cccccc|} \hline & & \rlap{\text{2d / Quadratic - Extension + C0 Continuity}} & & & \\ \hline \text{Curves} & \text{Dimensions} & \text{Pixels} & \text{Control Points} & \text{Pixels Per Control Point} & \text{Pixels Per Curve} \\ \hline 1 & 2x2 & 4 & 3 & 1.33 & 4.00 \\ 2 & 2x3 & 6 & 5 & 1.20 & 3.00 \\ 3 & 2x4 & 8 & 7 & 1.14 & 2.66 \\ 4 & 2x5 & 10 & 9 & 1.11 & 2.50 \\ 5 & 2x6 & 12 & 11 & 1.09 & 2.40 \\ 6 & 2x7 & 14 & 13 & 1.08 & 2.33 \\ \hline \end{array}

\begin{array}{|cccccc|} \hline & & \rlap{\text{2d / Quadratic - Original Technique}} & & & \\ \hline \text{Curves} & \text{Dimensions} & \text{Pixels} & \text{Control Points} & \text{Pixels Per Control Point} & \text{Pixels Per Curve} \\ \hline 1 & 2x2 & 4 & 3 & 1.33 & 4.00 \\ 2 & 2x4 & 8 & 6 & 1.33 & 4.00 \\ 3 & 2x6 & 12 & 9 & 1.33 & 4.00 \\ 4 & 2x8 & 16 & 12 & 1.33 & 4.00 \\ 5 & 2x10 & 20 & 15 & 1.33 & 4.00 \\ 6 & 2x12 & 24 & 18 & 1.33 & 4.00 \\ \hline \end{array}

The tables show that the first method uses fewer pixels per control point, while the second method uses fewer pixels per curve.

The first method can get you to what I believe to be the maximum density of 1 pixel per control point if you store an even number of curves. It can also give you a curve for every 3 pixels of storage.

The second method approaches the 1 pixel per control point as you store more and more curves and also approaches 2 pixels of storage per curve stored. Note that the second method’s table is using the convention of 3 control points are used for the first curve, and 2 additional control points for each curve after that.

The deciding factor for which method to use is probably going to be whether or not you want to force C0 continuity of your curve data. If so, you’d use the second technique, else you’d use the first.

The original technique uses a constant 1.33 pixels per control point, and 4 pixels to store each curve. Those numbers shows how this extension improves on the storage efficiency of the original technique.

5. Real World Limitations

This extension has a problem that the original technique does not have unfortunately.

While the stuff above is correct mathematically, there are limitations on the values we can store in actual textures. For instance, if we have 8 bit uint8 color channels we can only store values 0 to 255.

Looking at one of the equations P_{01} = 2*C_1 - C_3 , if C1 is 255 and C3 is 0, we are going to need to store 510 in the 8 bits we have for P01, which we can’t. If C1 is 0 and C3 is not zero, we are going to have to store a negative value in the 8 bits we have for P01, which we can’t.

This becomes less of a problem when using 16 bit floats per color channel, and is basically solved when using 32 bit floats per color channels, but that makes the technique hungrier for storage and less efficient again.

While that limits the usefulness of this extension, there are situations where this would still be appropriate – like if you already have your data stored in 16 or 32 bit color channels like some data (eg position data) would require..

The extension goes further, into 3d textures and beyond though, so let’s explore a little bit more.

6. 3D Texture / Cubic Piecewise Curves

The original technique talks about how to use a 2x2x2 3d volume texture to store a cubic Bezier curve (per color channel) and to retrieve it by doing a trilinear interpolated texture read.

If you have four control points A,B,C,D then the first slice of the volume texture will be a 2d texture storing the quadratic Bezier curve defined by A,B,C and the second slice will store B,C,D. You still sample along the diagonal of the texture but this time it’s a 3d diagonal instead of 2d. Here is that setup, where the texture is sampled along the diagonal from from A to D:

A = C_0 \\ B = C_1 \\ C = C_2 \\ D = C_3

Let’s look at what this extension means for 3d textures / cubic curves.

The equation for a cubic Bezier curve looks like this:

As^3 + 3Bs^2t + 3Cst^2 + Dt^3

If we derive that from trilinear interpolation between 8 points A,B,C,D,E,F,G,H, the second to last step would look like this:

As^3 + (B+C+E)s^2t + (D+F+G)st^2 + Ht^3

So, similar to our 2d setup, we have some freedom about our values.

In the original technique, B,C,E would have to be equal to the second control point, and D,F,G would have to be equal to the third control point. With the new extension, in both cases, those groups of values only have to AVERAGE to their specific control points. Once again, this gives us some freedoms for the values we can use, and lets us use our pixels more efficiently.

Here is the setup, again using texture coordinates (in the form Pzyx) for the pixels instead of letters.

P_{000} = C_0\\ P_{001}+P_{010}+P_{100} = 3*C_1\\ P_{011}+P_{101}+P_{110} = 3*C_2\\ P_{111} = C_3

here’s how the equations look in matrix form, which also happens to already be in rref:

\left[\begin{array}{rrrrrrrr|rrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{100} & P_{101} & P_{110} & P_{111} &    C_0 & C_1 & C_2 & C_3 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &    1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 1 & 0 & 0 & 0 &    0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 &    0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 &    0 & 0 & 0 & 1 \\ \end{array}\right]

P010 and P100 are free variables and so are P101 and P110, making a total of four free variables. They can be set to any value desired, which will then define the value that P001 and P011 need to be.

Let’s add another piecewise cubic Bezier curve, and another row of pixels to the texture to see what happens.

P_{000} = C_{0}\\ P_{001} + P_{010} + P_{100} = 3C_{1}\\ P_{011} + P_{101} + P_{110} = 3C_{2}\\ P_{111} = C_{3}\\ P_{010} = C_{4}\\ P_{011} + P_{020} + P_{110} = 3C_{5}\\ P_{021} + P_{111} + P_{120} = 3C_{6}\\ P_{121} = C_{7}\\

Here are the equations in matrix form:

\left[\begin{array}{rrrrrrrrrrrr|rrrrrrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{020} & P_{021} & P_{100} & P_{101} & P_{110} & P_{111} & P_{120} & P_{121} & C_{0} & C_{1} & C_{2} & C_{3} & C_{4} & C_{5} & C_{6} & C_{7} \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{array}\right]

Here it is in rref:

\left[\begin{array}{rrrrrrrrrrrr|rrrrrrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{020} & P_{021} & P_{100} & P_{101} & P_{110} & P_{111} & P_{120} & P_{121} & C_{0} & C_{1} & C_{2} & C_{3} & C_{4} & C_{5} & C_{6} & C_{7} \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & -1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & -3 & 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & -1 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{array}\right]

Putting that back into equations we have this:

P_{000} = C_{0}\\ P_{001} + P_{100} = 3C_{1} + -C_{4}\\ P_{010} = C_{4}\\ P_{011} + P_{101} + P_{110} = 3C_{2}\\ P_{020} + -P_{101} = -3C_{2} + 3C_{5}\\ P_{021} + P_{120} = -C_{3} + 3C_{6}\\ P_{111} = C_{3}\\ P_{121} = C_{7}\\

The result is that we still have four free variables: P100, P101, P110 and P120. When we give values to those pixels, we will then be able to calculate the values for P001, P011, P020 and P021.

There is a limit to this pattern though. Where the maximum number of curves to follow the pattern was 2 with the 2d / quadratic case, the maximum number of curves to follow this pattern with the 3d / cubic case is 3. As soon as you try to put 4 curves in this pattern it fails by having constraints. Interestingly, we still have 4 free variables when putting 3 curves in there, so it doesn’t follow the 2d case where free variables disappeared as we put more curves in, indicating when the failure would happen.

If you know how to more formally analyze when these patterns of equations will fail, please let me know!

7. 3D Texture / Cubic Piecewise Curves – Multiple Curves?

Looking at the 3d texture case of 2x2x2 storing a single curve, I saw that there were 4 free variables. Since it takes 4 control points to define a cubic curve, I wondered if we could use those 4 free variables to encode another cubic curve.

Here’s a setup where the x axis is flipped for the second curve. It’s a little bit hard to tell from the diagram, but the blue line does still go through the center of the 3d cube. It goes from P001 to P110, while the first curve still goes from P000 to P111.

Here’s what the equations look like:

P_{000} = C_{0}\\ P_{001} + P_{010} + P_{100} = 3*C_{1}\\ P_{011} + P_{101} + P_{110} = 3*C_{2}\\ P_{111} = C_{3}\\ P_{001} = C_{4}\\ P_{000} + P_{011} + P_{101} = 3*C_{5}\\ P_{010} + P_{100} + P_{111} = 3*C_{6}\\ P_{110} = C_{7}\\

And in matrix form:

\left[\begin{array}{rrrrrrrr|rrrrrrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{100} & P_{101} & P_{110} & P_{111} & C_{0} & C_{1} & C_{2} & C_{3} & C_{4} & C_{5} & C_{6} & C_{7} \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{array}\right]

After putting the matrix in rref to solve the equations, we get this matrix:

\left[\begin{array}{rrrrrrrr|rrrrrrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{100} & P_{101} & P_{110} & P_{111} & C_{0} & C_{1} & C_{2} & C_{3} & C_{4} & C_{5} & C_{6} & C_{7} \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & -3 & 0 & 0 & 3 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & -1 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 3 & 0 & 0 & -3 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & \frac{1}{3} & \frac{-1}{3} & 0 & -1 & 0 \\ \end{array}\right]

Which is this set of equations:

P_{000} = -3C_{2} + 3C_{5} + C_{7}\\ P_{001} = C_{4}\\ P_{010} + P_{100} = -C_{3} + 3C_{6}\\ P_{011} + P_{101} = 3C_{2} + -C_{7}\\ P_{110} = C_{7}\\ P_{111} = C_{3}\\ 0 = C_{0} + 3C_{2} - 3C_{5} - C_{7}\\ 0 = C_{1} + C_{3}/3 - C_{4}/3 + -C_{6}\\

In the end there are 2 free variables, but also 2 constraints on the values that the control points can take. The constraints mean it doesn’t work which is unfortunate. That would have been a nice way to bring the 3d / cubic case to using 1 pixel per control point!

I also tried other variations like flipping y or z along with x (flipping all three just makes the first curve in the reverse direction!) but couldn’t find anything that worked. Too bad!

8. 3D Texture / Cubic Piecewise Curves – C0 Continuity

Since the regular 3d texture / cubic curve pattern has a limit (3 curves), let’s look at the C0 continuity version like we did for the 2d texture / quadratic case where we sample zig zag style.

Since the sampling has to pass through the center of the cube, we need to flip both x and z each curve.

That gives us a setup like this:

Here are the constraints for the pixel values:

P_{000} = C_{0}\\ P_{001} + P_{010} + P_{100} = 3C_{1}\\ P_{011} + P_{101} + P_{110} = 3C_{2}\\ P_{111} = C_{3}\\ P_{011} + P_{110} + P_{121} = 3C_{4}\\ P_{010} + P_{021} + P_{120} = 3C_{5}\\ P_{020} = C_{6}\\

Which looks like this in matrix form:

\left[\begin{array}{rrrrrrrrrrrr|rrrrrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{020} & P_{021} & P_{100} & P_{101} & P_{110} & P_{111} & P_{120} & P_{121} & C_{0} & C_{1} & C_{2} & C_{3} & C_{4} & C_{5} & C_{6} \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{array}\right]

Here is the matrix solved in rref:

\left[\begin{array}{rrrrrrrrrrrr|rrrrrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{020} & P_{021} & P_{100} & P_{101} & P_{110} & P_{111} & P_{120} & P_{121} & C_{0} & C_{1} & C_{2} & C_{3} & C_{4} & C_{5} & C_{6} \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & -1 & 1 & 0 & 0 & 0 & -1 & 0 & 0 & 3 & 0 & 0 & 0 & -3 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & -1 & 0 & 0 & 3 & 0 & -3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ \end{array}\right]

And here is that matrix put back into equations form:

P_{000} = C_{0}\\ P_{001} - P_{021} + P_{100} - P_{120} = 3C_{1} - 3C_{5}\\ P_{010} + P_{021} + P_{120} = 3C_{5}\\ P_{011} + P_{110} + P_{121} = 3C_{4}\\ P_{020} = C_{6}\\ P_{101} - P_{121} = 3C_{2} - 3C_{4}\\ P_{111} = C_{3}\\

It worked! It also has 5 free variables.

This pattern works for as many curves as i tried (21 of them), and each time you add another curve / row of this pattern you gain another free variable.

So, storing 2 curves results in 6 free variables, 3 curves has 7 free variables, 4 curves has 8 free variables and so on.

9. 3D Texture / Cubic Piecewise Curves – Storage Efficiency

Let’s compare storage efficiency of these 3d texture / cubic curve techniques like we did for the 2d texture / quadratic curve techniques.

\begin{array}{|cccccc|} \hline & & \rlap{\text{3d / Cubic - Extension}} & & & \\ \hline \text{Curves} & \text{Dimensions} & \text{Pixels} & \text{Control Points} & \text{Pixels Per Control Point} & \text{Pixels Per Curve} \\ \hline 1 & 2x2x2 & 8 & 4 & 2.00 & 8.00 \\ 2 & 2x3x2 & 12 & 8 & 1.50 & 6.00 \\ 3 & 2x4x2 & 16 & 12 & 1.33 & 5.33 \\ 4 & 2x6x2 & 24 & 16 & 1.50 & 6.00 \\ 5 & 2x7x2 & 28 & 20 & 1.40 & 5.60 \\ 6 & 2x8x2 & 32 & 24 & 1.33 & 5.33 \\ \hline \end{array}

\begin{array}{|cccccc|} \hline & & \rlap{\text{3d / Quadratic - Extension + C0 Continuity}} & & & \\ \hline \text{Curves} & \text{Dimensions} & \text{Pixels} & \text{Control Points} & \text{Pixels Per Control Point} & \text{Pixels Per Curve} \\ \hline 1 & 2x2x2 & 8 & 4 & 2.00 & 8.00 \\ 2 & 2x3x2 & 12 & 7 & 1.71 & 6.00 \\ 3 & 2x4x2 & 16 & 10 & 1.60 & 5.33 \\ 4 & 2x5x2 & 20 & 13 & 1.54 & 5.00 \\ 5 & 2x6x2 & 24 & 16 & 1.50 & 4.80 \\ 6 & 2x7x2 & 28 & 19 & 1.47 & 4.67 \\ \hline \end{array}

\begin{array}{|cccccc|} \hline & & \rlap{\text{3d / Cubic - Original Technique}} & & & \\ \hline \text{Curves} & \text{Dimensions} & \text{Pixels} & \text{Control Points} & \text{Pixels Per Control Point} & \text{Pixels Per Curve} \\ \hline 1 & 2x2x2 & 8 & 4 & 2.00 & 8.00 \\ 2 & 2x4x2 & 16 & 8 & 2.00 & 8.00 \\ 3 & 2x6x2 & 24 & 12 & 2.00 & 8.00 \\ 4 & 2x8x2 & 32 & 16 & 2.00 & 8.00 \\ 5 & 2x10x2 & 40 & 20 & 2.00 & 8.00 \\ 6 & 2x12x2 & 48 & 24 & 2.00 & 8.00 \\ \hline \end{array}

The original technique had a constant 2 pixels per control point and 8 pixels per cubic curve.

The basic extension lets you bring that down to 1.33 pixels per control point, and 5.33 pixels per curve.

If C0 continuity is desired, as you store more and more curves the extension can bring things down towards 1.33 pixels per control point, and 4 pixels per curve. (Remember that with the C0 extension you have 4 control points for the first curve and then 3 more for each subsequent curve, so that 1.33 pixels per control point isn’t exactly an apples to apples comparison vs the basic extension).

The pattern continues for 4D textures and higher (for higher than cubic curves too!), but working through the 2d and 3d cases for quadratic / cubic curves is the most likely usage case both because 4d textures and higher are kind of excessive (probably you’d need to do multiple texture reads to simulate them), but also when fitting curves to data, quadratic and cubic curves tend to do well in that they don’t usually overfit the data or have as many problems with ringing.

Despite that, I do think it’s useful to look at it from an N dimensional point of view to see the larger picture, so let’s do that next.

10. Generalizing The Unit Hyper Cube

Let’s ignore the zig zag sampling pattern and storing multiple curves in a texture and just get back to the basic idea.

Given an N dimensional texture that is 2x2x…x2 that you are going to sample across the diagonal to get a degree N Bezier curve from, how do you know what values to put in which control points to use this technique?

You could derive it from N-linear interpolation, but that is a lot of work.

The good news is it turns out there is a simple pattern, that is also pretty interesting.

Let’s check out the 1d, 2d and 3d cases to see what patterns we might be able to see.

1d Texture / linear Bezier / linear interpolation:

P_{0} = C_0 \\ P_{1} = C_1 \\

\left[\begin{array}{rr|rr} P_{0} & P_{1} & C_0 & C_1\\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ \end{array}\right]

2d Texture / Quadratic Bezier:

P_{00} = C_0 \\ P_{01}+P_{10} = 2*C_1 \\ P_{11} = C_2 \\

\left[\begin{array}{rrrr|rrr} P_{00} & P_{01} & P_{10} & P_{11} & C_0 & C_1 & C_2 \\ 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 & 2 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 1 \end{array}\right]

3d Texture / Cubic Bezier:

P_{000} = C_0\\ P_{001}+P_{010}+P_{100} = 3*C_1\\ P_{011}+P_{101}+P_{110} = 3*C_2\\ P_{111} = C_3

\left[\begin{array}{rrrrrrrr|rrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{100} & P_{101} & P_{110} & P_{111} &    C_0 & C_1 & C_2 & C_3 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &    1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 1 & 0 & 0 & 0 &    0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 &    0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 &    0 & 0 & 0 & 1 \\ \end{array}\right]

The first pattern you might see is that the right side of the equations for an N dimensional hypercube is the identity matrix, but instead of using 1 for the value along the diagonal, it uses values from Pascal’s Triangle (binomial coefficients).

To simplify this a bit though, we could also notice that the number on the right side of the equation equals the sum of the numbers on the left side of the equation. Mathematically it would be the same to say that the numbers on the left side of the equation have to sum up to 1. This would make the matrix on the right just be the identity matrix and we can forget about Pascal’s triangle numbers (they will show up implicitly as divisors of the left side equation coefficients but there’s no need to explicitly calculate them).

But then we are still left with the matrix on the left. How do we know which pixels belong in which rows?

It turns out there is another interesting pattern here. In all the matrices above it follows this pattern:

  • Row 0 has a “1” wherever the pixel coordinate has 0 ones set
  • Row 1 has a “1” wherever the pixel coordinate has 1 ones set
  • Row 2 has a “1” wherever the pixel coordinate has 2 ones set
  • Row 3 has a “1” wherever the pixel coordinate has 3 ones set
  • ….

That pattern continues indefinitely, but don’t forget that the numbers (coefficients) on the left side of the equation must add up to one.

Here is the matrix form of 1d / linear, 2d / quadratic, and 3d / cubic again with the right matrix being the identity matrix, and the equations below them. Notice the pattern about counts of one bits in each row!

1D:

\left[\begin{array}{rr|rr} P_{0} & P_{1} & C_0 & C_1\\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ \end{array}\right]

P_0 = C_0 \\ P_1 = C_1 \\

2D:

\left[\begin{array}{rrrr|rrr} P_{00} & P_{01} & P_{10} & P_{11} & C_0 & C_1 & C_2 \\ 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 1 \end{array}\right]

P_{00} = C_0 \\ P_{01}/2 + P_{10}/2 = C_1 \\ P_{11} = C_2 \\

3D:

\left[\begin{array}{rrrrrrrr|rrrr} P_{000} & P_{001} & P_{010} & P_{011} & P_{100} & P_{101} & P_{110} & P_{111} &    C_0 & C_1 & C_2 & C_3 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 &    1 & 0 & 0 & 0 \\ 0 & \frac{1}{3} & \frac{1}{3} & 0 & \frac{1}{3} & 0 & 0 & 0 &    0 & 1 & 0 & 0 \\ 0 & 0 & 0 & \frac{1}{3} & 0 & \frac{1}{3} & \frac{1}{3} & 0 &    0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 &    0 & 0 & 0 & 1 \\ \end{array}\right]

P_{000} = C_0 \\ P_{001}/3 + P_{010}/3 + P_{100}/3 = C_1 \\ P_{011}/3 + P_{101}/3 + P_{110}/3 = C_2 \\ P_{111} = C_3 \\

Here are the formulas for linear, quadratic and cubic Bezier curves to show a different way of looking at this. Below each is the same formula but with the 1d, 2d and 3d pixels in the formula instead of the control points, using the formulas above which relate pixel values to control point values. Note that I have replaced (1-t) with s for easier reading.

f(t) = As + Bt \\ \\ f(t) = P_0s + P_1t\\

f(t) = As^2 + 2Bst + Ct^2 \\ \\ f(t) = P_{00}s^2 + (P_{01}+P_{10})st + P_{11}t^2 \\

f(t) = As^3 + 3Bs^2t + 3Cst^2 + Dt^3 \\ f(t) = P_{000}s^3 + (P_{001}+P_{010}+P_{100})s^2t + (P_{011}+P_{101}+P_{110})st^2 + P_{111}t^3

I think it’s really interesting how in the last equation as an example, “3B” literally becomes 3 values which could have the value of B. In the plain vanilla technique they did have the value of B. In this extension, the only requirement is that they average to B.

It’s also interesting to notice that if you have an N bit number and you count how many permutations have each possible number of bits turned on, the resulting counts is the Pascal’s triangle row. That is nothing new, but it seems like there might be a fun way to convert a set of random numbers (white noise) into a Gaussian distribution, just by counting how many one bits there were in each number. That isn’t new either, and there are better algorithms, but still I think it’s an interesting idea, and may be useful in a pinch since it seems pretty computationally inexpensive.

11. Closing

This extension makes storage efficiency a bit better than the plain vanilla technique, especially if you are interested in C0 continuous curves.

The extension does come at a price though, as you may find yourself in a situation where you need to store a value that is outside of the possible values for common data formats to store (such as needing to store a negative number or a larger than 255 number in a uint8).

Even so, if these three criteria are met:

  1. You are already storing data in textures. (Counter point: compute is usually preferred over texture lookup these days)
  2. You are relying on the texture interpolator to interpolate values between data points. (Counter point: if you don’t want the interpolation, use a buffer instead so you fit more of the data you actually care about in the cache)
  3. You are storing data in 16 or 32 bit real numbers. (Counter point: uint8 is half as large as 16 bit and a quarter as large as 32 bit already)

Then this may be an attractive solution for you, even over the plain vanilla technique.

For future work, I think it would be interesting to see how this line of thinking applies to surfaces.

I also think there is probably some fertile ground looking into what happens when sampling off of the diagonal of the textures. Intuitively it seems you might be able to store some special case higher order curves in lower dimension / storage textures, but I haven’t looked into it yet.

A common usage case when encoding data in a texture would probably include putting curves side by side on the x axis of the texture. It could be interesting to look into whether curves need to be completely separate from each other horizontally (aka 2 pixel of width for each track of curves in the texture), or if you could perhaps fit two curves side by side in a 3 pixel width, or any similar ideas.

Lastly, when looking at these groups of points on these N dimensional hyper-cubes, I can’t help but wonder what kinds of shapes they are. Are they simplices? If so, is there a pattern to the dimensions they are of?

It’s a bit hard to visualize, but taking a look at the first few rows of pascal’s triangle / hyper cubes here’s what I found:

  • Dimension 1 (line) : Row 2 = 1,1. Those are both points, so are simplices of 0 dimension.
  • Dimension 2 (square) : Row 3 = 1,2,1. The 1’s are points, the 2 is a line, so are simplices of dimension 0, 1, 0.
  • Dimension 3 (cube) : Row 4 = 1,3,3,1. The 1’s are still points. The 3’s are in fact triangles, I checked. So, simplices of dimension 0, 2, 2, 0.
  • Dimension 4 (hypercube) : Row 5 = 1,4,6,4,1. The 1’s are points. The 4’s are tetrahedrons. The 6 is a 3 dimensional object. I’m not sure it’s shape but that makes it not be a simplex. Possibly it’s two simplices fused together some how. I don’t really know. So, the dimensions anyways are: 0, 3, 3, 3, 0.
  • Beyond? That’s as far as I looked. If you look further / deeper and find anything interesting please share!

Update: PBS Infinite Series ended up posting a video on the topic of hypercube slices and pascal’s triangle (seriously!). Give it a watch if you are interested in how these things relate: Dissecting Hypercubes with Pascal’s Triangle | Infinite Series

Questions, comments, corrections, etc? Post a comment below or hit me up on twitter at @Atrix256.

If you have a usage case for this or any of the related techniques, I’d love to hear about them.

Thanks for reading!

12. Code

It’s easy to talk about things and claim that everything is correct, when in fact, the moment you try it, everything falls apart.

I made up some simple standalone c++ code that goes through the cases we talked about, doing the math we did, and also verifying that the texture interpolation is equivalent to actually calculating Bezier curves (using Bernstein polynomials).

You can also use this code as a starting point to explore higher curve counts or other storage patterns. It uses only standard includes and no libraries, so it should be real easy to drop this into a compiler and start experimenting.

Here’s some example output, which shows 6 cubic curves stored in a 3d texture using the zig zag sampling pattern.

Here’s the code:

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdlib.h>
#include <array>
#include <algorithm>
#include <unordered_set>
#include <random>
#include <vector>

#define SHOW_MATHJAX_MATRIX() 0
#define SHOW_MATHJAX_EQUATIONS() 0
#define SHOW_EQUATIONS_BEFORE_SOLVE() 0
#define EQUALITY_TEST_SAMPLES 1000

typedef int32_t TINT;

TINT CalculateGCD (TINT smaller, TINT larger);
TINT CalculateLCM (TINT smaller, TINT larger);

// A rational number, to handle fractional numbers without typical floating point issues
struct CRationalNumber
{
	CRationalNumber (TINT numerator = 0, TINT denominator = 1)
		: m_numerator(numerator)
		, m_denominator(denominator)
	{ }

	TINT m_numerator;
	TINT m_denominator;

	CRationalNumber Reciprocal () const
	{
		return CRationalNumber(m_denominator, m_numerator);
	}

	void Reduce ()
	{
		if (m_numerator != 0 && m_denominator != 0)
		{
			TINT div = CalculateGCD(m_numerator, m_denominator);
			m_numerator /= div;
			m_denominator /= div;
		}

		if (m_denominator < 0)
		{
			m_numerator *= -1;
			m_denominator *= -1;
		}
		
		if (m_numerator == 0)
			m_denominator = 1;
	}

	bool IsZero () const
	{
		return m_numerator == 0 && m_denominator != 0;
	}

	// NOTE: the functions below assume Reduce() has happened
	bool IsOne () const
	{
		return m_numerator == 1 && m_denominator == 1;
	}

	bool IsMinusOne () const
	{
		return m_numerator == -1 && m_denominator == 1;
	}

	bool IsWholeNumber () const
	{
		return m_denominator == 1;
	}
};

// Define a vector as an array of rational numbers
template<size_t N>
using TVector = std::array<CRationalNumber, N>;

// Define a matrix as an array of vectors
template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

//===================================================================================================================================
//                                              GCD / LCM
//===================================================================================================================================

// from my blog post: https://blog.demofox.org/2015/01/24/programmatically-calculating-gcd-and-lcm/

TINT CalculateGCD (TINT smaller, TINT larger)
{
	// make sure A <= B before starting
	if (larger < smaller)
		std::swap(smaller, larger);

	// loop
	while (1)
	{
		// if the remainder of larger / smaller is 0, they are the same
		// so return smaller as the GCD
		TINT remainder = larger % smaller;
		if (remainder == 0)
			return smaller;

		// otherwise, the new larger number is the old smaller number, and
		// the new smaller number is the remainder
		larger = smaller;
		smaller = remainder;
	}
}

TINT CalculateLCM (TINT A, TINT B)
{
	// LCM(A,B) = (A/GCD(A,B))*B
	return (A / CalculateGCD(A, B))*B;
}

//===================================================================================================================================
//                                              RATIONAL NUMBER MATH
//===================================================================================================================================

void CommonDenominators (CRationalNumber& a, CRationalNumber& b)
{
	TINT lcm = CalculateLCM(a.m_denominator, b.m_denominator);

	a.m_numerator *= lcm / a.m_denominator;
	b.m_numerator *= lcm / b.m_denominator;

	a.m_denominator = lcm;
	b.m_denominator = lcm;
}

bool operator == (const CRationalNumber& a, const CRationalNumber& b)
{
	CRationalNumber _a(a), _b(b);
	CommonDenominators(_a, _b);
	return _a.m_numerator == _b.m_numerator;
}

void operator *= (CRationalNumber& a, const CRationalNumber& b)
{
	a.m_numerator *= b.m_numerator;
	a.m_denominator *= b.m_denominator;
}

CRationalNumber operator * (const CRationalNumber& a, const CRationalNumber& b)
{
	return CRationalNumber(a.m_numerator * b.m_numerator, a.m_denominator * b.m_denominator);
}

void operator -= (CRationalNumber& a, const CRationalNumber& b)
{
	CRationalNumber _b(b);
	CommonDenominators(a, _b);
	a.m_numerator -= _b.m_numerator;
}

//===================================================================================================================================
//                                              GAUSS-JORDAN ELIMINATION CODE
//===================================================================================================================================

// From my blog post: https://blog.demofox.org/2017/04/10/solving-n-equations-and-n-unknowns-the-fine-print-gauss-jordan-elimination/

// Make a specific row have a 1 in the colIndex, and make all other rows have 0 there
template <size_t M, size_t N>
bool MakeRowClaimVariable (TMatrix<M, N>& matrix, size_t rowIndex, size_t colIndex)
{
    // Find a row that has a non zero value in this column and swap it with this row
    {
        // Find a row that has a non zero value
        size_t nonZeroRowIndex = rowIndex;
        while (nonZeroRowIndex < M && matrix[nonZeroRowIndex][colIndex].IsZero())
            ++nonZeroRowIndex;
 
        // If there isn't one, nothing to do
        if (nonZeroRowIndex == M)
            return false;
 
        // Otherwise, swap the row
        if (rowIndex != nonZeroRowIndex)
            std::swap(matrix[rowIndex], matrix[nonZeroRowIndex]);
    }
 
    // Scale this row so that it has a leading one
    CRationalNumber scale = matrix[rowIndex][colIndex].Reciprocal();
	for (size_t normalizeColIndex = colIndex; normalizeColIndex < N; ++normalizeColIndex)
	{
		matrix[rowIndex][normalizeColIndex] *= scale;
		matrix[rowIndex][normalizeColIndex].Reduce();
	}
 
    // Make sure all rows except this one have a zero in this column.
    // Do this by subtracting this row from other rows, multiplied by a multiple that makes the column disappear.
    for (size_t eliminateRowIndex = 0; eliminateRowIndex < M; ++eliminateRowIndex)
    {
        if (eliminateRowIndex == rowIndex)
            continue;
 
        CRationalNumber scale = matrix[eliminateRowIndex][colIndex];
		for (size_t eliminateColIndex = 0; eliminateColIndex < N; ++eliminateColIndex)
		{
			matrix[eliminateRowIndex][eliminateColIndex] -= matrix[rowIndex][eliminateColIndex] * scale;
			matrix[eliminateRowIndex][eliminateColIndex].Reduce();
		}
    }
 
    return true;
}
 
// make matrix into reduced row echelon form
template <size_t M, size_t N>
void GaussJordanElimination (TMatrix<M, N>& matrix)
{
    size_t rowIndex = 0;
    for (size_t colIndex = 0; colIndex < N; ++colIndex)
    {
        if (MakeRowClaimVariable(matrix, rowIndex, colIndex))
        {
            ++rowIndex;
            if (rowIndex == M)
                return;
        }
    }
}

//===================================================================================================================================
//                                                           Shared Testing Code
//===================================================================================================================================

template <size_t M, size_t N, typename LAMBDA>
void PrintEquations (
	TMatrix<M, N>& augmentedMatrix,
	size_t numPixels,
	LAMBDA& pixelIndexToCoordinates
)
{
	char pixelCoords[10];

#if SHOW_MATHJAX_MATRIX()
	// print the matrix opening stuff
	printf("\left[\begin{array}{");
	for (size_t i = 0; i < N; ++i)
	{
		if (i == numPixels)
			printf("|");
		printf("r");
	}
	printf("}n");
	// print the header row
	for (size_t i = 0; i < numPixels; ++i)
	{
		pixelIndexToCoordinates(i, pixelCoords);
		if (i == 0)
			printf("P_{%s}", pixelCoords);
		else
			printf(" & P_{%s}", pixelCoords);
	}
	for (size_t i = numPixels; i < N; ++i)
	{
		printf(" & C_{%zu}", i-numPixels);
	}
	printf(" \\n");

	// Print the matrix
	for (const TVector<N>& row : augmentedMatrix)
	{
		bool first = true;
		for (const CRationalNumber& n : row)
		{
			if (first)
				first = false;
			else
				printf(" & ");

			if (n.IsWholeNumber())
				printf("%i", n.m_numerator);
			else
				printf("\frac{%i}{%i}", n.m_numerator, n.m_denominator);
		}
		printf(" \\n");
	}

	// print the matrix closing stuff
	printf("\end{array}\right]n");
#endif

	// print equations
	for (const TVector<N>& row : augmentedMatrix)
	{
		// indent
		#if SHOW_MATHJAX_EQUATIONS() == 0
			printf("    ");
		#endif

		// left side of the equation
		bool leftHasATerm = false;
		for (size_t i = 0; i < numPixels; ++i)
		{
			if (!row[i].IsZero())
			{
				if (leftHasATerm)
					printf(" + ");
				pixelIndexToCoordinates(i, pixelCoords);

				#if SHOW_MATHJAX_EQUATIONS()
					if (row[i].IsOne())
						printf("P_{%s}", pixelCoords);
					else if (row[i].IsMinusOne())
						printf("-P_{%s}", pixelCoords);
					else if (row[i].IsWholeNumber())
						printf("%iP_{%s}", row[i].m_numerator, pixelCoords);
					else if (row[i].m_numerator == 1)
						printf("P_{%s}/%i", pixelCoords, row[i].m_denominator);
					else
						printf("P_{%s} * %i/%i", pixelCoords, row[i].m_numerator, row[i].m_denominator);
				#else
					if (row[i].IsOne())
						printf("P%s", pixelCoords);
					else if (row[i].IsMinusOne())
						printf("-P%s", pixelCoords);
					else if (row[i].IsWholeNumber())
						printf("%iP%s", row[i].m_numerator, pixelCoords);
					else if (row[i].m_numerator == 1)
						printf("P%s/%i", pixelCoords, row[i].m_denominator);
					else
						printf("P%s * %i/%i", pixelCoords, row[i].m_numerator, row[i].m_denominator);
				#endif
				leftHasATerm = true;
			}
		}
		if (!leftHasATerm)
			printf("0 = ");
		else
			printf(" = ");

		// right side of the equation
		bool rightHasATerm = false;
		for (size_t i = numPixels; i < N; ++i)
		{
			if (!row[i].IsZero())
			{
				if (rightHasATerm)
					printf(" + ");

				#if SHOW_MATHJAX_EQUATIONS()
					if (row[i].IsOne())
						printf("C_{%zu}", i - numPixels);
					else if (row[i].IsMinusOne())
						printf("-C_{%zu}", i - numPixels);
					else if (row[i].IsWholeNumber())
						printf("%iC_{%zu}", row[i].m_numerator, i - numPixels);
					else if (row[i].m_numerator == 1)
						printf("C_{%zu}/%i", i - numPixels, row[i].m_denominator);
					else
						printf("C_{%zu} * %i/%i", i - numPixels, row[i].m_numerator, row[i].m_denominator);
				#else
					if (row[i].IsOne())
						printf("C%zu", i - numPixels);
					else if (row[i].IsMinusOne())
						printf("-C%zu", i - numPixels);
					else if (row[i].IsWholeNumber())
						printf("%iC%zu", row[i].m_numerator, i - numPixels);
					else if (row[i].m_numerator == 1)
						printf("C%zu/%i", i - numPixels, row[i].m_denominator);
					else
						printf("C%zu * %i/%i", i - numPixels, row[i].m_numerator, row[i].m_denominator);
				#endif
				rightHasATerm = true;
			}
		}

		#if SHOW_MATHJAX_EQUATIONS()
			printf("\\n");
		#else
			printf("n");
		#endif
	}
}

template <size_t M, size_t N, typename LAMBDA>
bool SolveMatrixAndPrintEquations (
	TMatrix<M, N>& augmentedMatrix,
	size_t numPixels,
	std::unordered_set<size_t>& freeVariables,
	LAMBDA& pixelIndexToCoordinates
)
{
	#if SHOW_EQUATIONS_BEFORE_SOLVE()
	printf("   Initial Equations:n");
	PrintEquations(augmentedMatrix, numPixels, pixelIndexToCoordinates);
	printf("   Solved Equations:n");
	#endif

	// put augmented matrix into rref
	GaussJordanElimination(augmentedMatrix);

	// Print equations
	PrintEquations(augmentedMatrix, numPixels, pixelIndexToCoordinates);

	// Get free variables and check for control point constraint
	bool constraintFound = false;
	for (const TVector<N>& row : augmentedMatrix)
	{
		bool leftHasATerm = false;
		for (size_t i = 0; i < numPixels; ++i)
		{
			if (!row[i].IsZero())
			{
				if (leftHasATerm)
					freeVariables.insert(i);
				else
					leftHasATerm = true;
			}
		}

		bool rightHasATerm = false;
		for (size_t i = numPixels; i < N; ++i)
		{
			if (!row[i].IsZero())
				rightHasATerm = true;
		}

		if (!leftHasATerm && rightHasATerm)
			constraintFound = true;
	}

	printf("  %zu free variables.n", freeVariables.size());

	if (constraintFound)
	{
		printf("  Constraint Found.  This configuration doesn't work for the general case!nn");
		return false;
	}

	return true;
}

float lerp (float t, float a, float b)
{
	return a * (1.0f - t) + b * t;
}

template <size_t NUMPIXELS, size_t NUMCONTROLPOINTS, size_t NUMEQUATIONS>
void FillInPixelsAndControlPoints (
	std::array<float, NUMPIXELS>& pixels,
	std::array<float, NUMCONTROLPOINTS>& controlPoints,
	const TMatrix<NUMEQUATIONS, NUMPIXELS+ NUMCONTROLPOINTS>& augmentedMatrix,
	const std::unordered_set<size_t>& freeVariables)
{
	// come up with random values for the control points and free variable pixels
	static std::random_device rd;
	static std::mt19937 mt(rd());
	static std::uniform_real_distribution<float> dist(-10.0f, 10.0f);
	for (float& cp : controlPoints)
		cp = dist(mt);
	for (size_t var : freeVariables)
		pixels[var] = dist(mt);

	// fill in the non free variable pixels per the equations
	for (const TVector<NUMPIXELS + NUMCONTROLPOINTS>& row : augmentedMatrix)
	{
		// the first non zero value is the non free pixel we need to set.
		// all other non zero values are free variables that we previously calculated values for
		bool foundPixel = false;
		size_t pixelIndex = 0;
		for (size_t i = 0; i < NUMPIXELS; ++i)
		{
			if (!row[i].IsZero())
			{
				// we are setting the first pixel we find
				if (!foundPixel)
				{
					pixelIndex = i;
					foundPixel = true;
				}
				// subtract out all free variables which is the same as moving them to the right side of the equation
				else
				{
					pixels[pixelIndex] -= pixels[i] * float(row[i].m_numerator) / float(row[i].m_denominator);
				}
			}
		}

		// if there is no pixel value to set on the left side of the equation, ignore this row
		if (!foundPixel)
			continue;

		// add in the values from the right side of the equation
		for (size_t i = NUMPIXELS; i < NUMPIXELS + NUMCONTROLPOINTS; ++i)
		{
			if (!row[i].IsZero())
				pixels[pixelIndex] += controlPoints[i - NUMPIXELS] * float(row[i].m_numerator) / float(row[i].m_denominator);
		}
	}
}

size_t TextureCoordinateToPixelIndex2d (size_t width, size_t height, size_t y, size_t x)
{
	return y * width + x;
};

void PixelIndexToTextureCoordinate2d (size_t width, size_t height, size_t pixelIndex, size_t& y, size_t& x)
{
	x = pixelIndex % width;
	y = pixelIndex / width;
}

size_t TextureCoordinateToPixelIndex3d (size_t width, size_t height, size_t depth, size_t z, size_t y, size_t x)
{
	return 
		z * width * height + 
		y * width +
		x;
};

void PixelIndexToTextureCoordinate3d (size_t width, size_t height, size_t depth, size_t pixelIndex, size_t& z, size_t& y, size_t& x)
{
	x = pixelIndex % width;

	pixelIndex = pixelIndex / width;

	y = pixelIndex % height;

	pixelIndex = pixelIndex / height;

	z = pixelIndex;
}

void PiecewiseCurveTime (float time, size_t numCurves, size_t& outCurveIndex, float& outTime)
{
	time *= float(numCurves);
	outCurveIndex = size_t(time);

	if (outCurveIndex == numCurves)
	{
		outCurveIndex = numCurves - 1;
		outTime = 1.0f;
	}
	else
	{
		outTime = std::fmodf(time, 1.0f);
	}
}



//===================================================================================================================================
//                                                       2D Textures / Quadratic Curves
//===================================================================================================================================
//
// Find the limitations of this pattern and show equivalence to Bernstein Polynomials (Bezier Curve Equations). Pattern details below.
//
//  --- For first curve, do:
//
//  P00 P01
//  P10 P11
//
//  P00 = C0                        0
//  P01 + P10 = 2 * C1              1 2
//  P11 = C2                        3
//
//  --- For each additional curve, add two points to the end like this:
//
//  P00 P01
//  P10 P11
//  P20 P21
//
//  P00 = C0                        0
//  P01 + P10 = 2 * C1              1 2
//  P11 = C2                        3
//
//  P10 = C3                        1
//  P11 + P20 = 2 * C4              3 4
//  P21 = C5                        5
//
//  and so on...
//  each equation is then multiplied by a value so the right side is identity and left side coefficients add up to 1.
//
//  --- Other details:
//  
//  * 3 control points per curve.
//  * image width it 2
//  * image height is 1 + NumCurves.
//  * there are 3 equations per curve, so 3 rows in the augmented matrix per curve.
//  * augmented matrix columns = num pixels (left columns) + num control points (right columns)
//

template <size_t N>
float EvaluateBernsteinPolynomial2DQuadratic (float totalTime, const std::array<float, N>& coefficients)
{
	const size_t c_numCurves = N / 3;

	float t;
	size_t startCurve;
	PiecewiseCurveTime(totalTime, c_numCurves, startCurve, t);

	size_t offset = startCurve * 3;

	float s = 1.0f - t;
	return
		coefficients[offset + 0] * s * s +
		coefficients[offset + 1] * s * t * 2.0f +
		coefficients[offset + 2] * t * t;
}

template <size_t N>
float EvaluateLinearInterpolation2DQuadratic (float totalTime, const std::array<float, N>& pixels)
{
	const size_t c_numCurves = (N / 2) - 1;

	float t;
	size_t startRow;
	PiecewiseCurveTime(totalTime, c_numCurves, startRow, t);

	float row0 = lerp(t, pixels[startRow * 2], pixels[startRow * 2 + 1]);
	float row1 = lerp(t, pixels[(startRow + 1) * 2], pixels[(startRow + 1) * 2 + 1]);
	return lerp(t, row0, row1);
}

template <size_t NUMCURVES>
void Test2DQuadratic ()
{
	const size_t c_imageWidth = 2;
	const size_t c_imageHeight = NUMCURVES + 1;
	const size_t c_numPixels = c_imageWidth * c_imageHeight;
	const size_t c_numControlPoints = NUMCURVES * 3;
	const size_t c_numEquations = NUMCURVES * 3;

	// report values for this test
	printf("  %zu curves.  %zu control points.  2x%zu texture = %zu pixels.n", NUMCURVES, c_numControlPoints, c_imageHeight, c_numPixels);
	printf("  %f pixels per curve.  %f pixels per control point.n", float(c_numPixels) / float(NUMCURVES), float(c_numPixels) / float(c_numControlPoints));

	// lambdas to convert between pixel index and texture coordinates
	auto TextureCoordinateToPixelIndex = [&](size_t y, size_t x) -> size_t
	{
		return TextureCoordinateToPixelIndex2d(c_imageWidth, c_imageHeight, y, x);
	};
	auto pixelIndexToCoordinates = [&](size_t pixelIndex, char pixelCoords[10])
	{
		size_t y, x;
		PixelIndexToTextureCoordinate2d(c_imageWidth, c_imageHeight, pixelIndex, y, x);
		sprintf(pixelCoords, "%zu%zu", y, x);
	};

	// create the equations
	TMatrix<c_numEquations, c_numPixels + c_numControlPoints> augmentedMatrix;
	for (size_t i = 0; i < c_numEquations; ++i)
	{
		TVector<c_numPixels + c_numControlPoints>& row = augmentedMatrix[i];

		// left side of the equation goes in this yx coordinate pattern:
		//   00 
		//   01 10
		//   11
		// But, curve index is added to the y index.
		// Also, left side coefficients must add up to 1.
		size_t curveIndex = i / 3;
		switch (i % 3)
		{
			case 0:
			{
				row[TextureCoordinateToPixelIndex(curveIndex + 0, 0)] = CRationalNumber(1, 1);
				break;
			}
			case 1:
			{
				row[TextureCoordinateToPixelIndex(curveIndex + 0, 1)] = CRationalNumber(1, 2);
				row[TextureCoordinateToPixelIndex(curveIndex + 1, 0)] = CRationalNumber(1, 2);
				break;
			}
			case 2:
			{
				row[TextureCoordinateToPixelIndex(curveIndex + 1, 1)] = CRationalNumber(1, 1);
				break;
			}
		}

		// right side of the equation is identity
		row[c_numPixels + i] = CRationalNumber(1);
	}

	// solve the matrix if possible and print out the equations
	std::unordered_set<size_t> freeVariables;
	if (!SolveMatrixAndPrintEquations(augmentedMatrix, c_numPixels, freeVariables, pixelIndexToCoordinates))
		return;

	// Next we need to show equality between the N-linear interpolation of our pixels and bernstein polynomials with our control points as coefficients

	// Fill in random values for our control points and free variable pixels, and fill in the other pixels as the equations dictate 
	std::array<float, c_numPixels> pixels = { 0 };
	std::array<float, c_numControlPoints> controlPoints = { 0 };
	FillInPixelsAndControlPoints<c_numPixels, c_numControlPoints, c_numEquations>(pixels, controlPoints, augmentedMatrix, freeVariables);

	// do a number of samples of each method at the same time values, and report the largest difference (error)
	float largestDifference = 0.0f;
	for (size_t i = 0; i < EQUALITY_TEST_SAMPLES; ++i)
	{
		float t = float(i) / float(EQUALITY_TEST_SAMPLES - 1);

		float value1 = EvaluateBernsteinPolynomial2DQuadratic(t, controlPoints);
		float value2 = EvaluateLinearInterpolation2DQuadratic(t, pixels);

		largestDifference = std::max(largestDifference, std::abs(value1 - value2));
	}
	printf("  %i Samples, Largest Error = %fnn", EQUALITY_TEST_SAMPLES, largestDifference);
}

void Test2DQuadratics ()
{
	printf("Testing 2D Textures / Quadratic Curvesnn");

	Test2DQuadratic<1>();
	Test2DQuadratic<2>();
	Test2DQuadratic<3>();

	system("pause");
}

//===================================================================================================================================
//                                    2D Textures / Quadratic Curves With C0 Continuity
//===================================================================================================================================
//
// Find the limitations of this pattern and show equivalence to Bernstein Polynomials (Bezier Curve Equations). Pattern details below.
//
//  --- For first curve, do:
//
//  P00 P01
//  P10 P11
//
//  P00 = C0                        0
//  P01 + P10 = 2 * C1              1 2
//  P11 = C2                        3
//
//  --- For second curve, do:
//
//  P00 P01
//  P10 P11
//  P20 P21
//
//  P00 = C0                        0
//  P01 + P10 = 2 * C1              1 2
//  P11 = C2                        3
//
//  P10 + P21 = 2 * C3              2 5
//  P20 = C4                        4
//
//  --- For third curve, do:
//
//  P00 P01
//  P10 P11
//  P20 P21
//  P30 P31
//
//  P00 = C0
//  P01 + P10 = 2 * C1
//  P11 = C2
//
//  P10 + P21 = 2 * C3
//  P20 = C4
//
//  P21 + P30 = 2 * C5
//  P31 = C6
//
//  and so on...
//  each equation is then multiplied by a value so the right side is identity and left side coefficients add up to 1.
//
//  --- Other details:
//  
//  * control points: 1 + NumCurves*2.
//  * image width it 2
//  * image height is 1 + NumCurves.
//  * equations: 1 + NumCurves*2.  This many rows in the augmented matrix.
//  * augmented matrix columns = num pixels (left columns) + num control points (right columns)
//

template <size_t N>
float EvaluateBernsteinPolynomial2DQuadraticC0 (float totalTime, const std::array<float, N>& coefficients)
{
	const size_t c_numCurves = (N - 1) / 2;

	float t;
	size_t startCurve;
	PiecewiseCurveTime(totalTime, c_numCurves, startCurve, t);

	size_t offset = startCurve * 2;

	float s = 1.0f - t;
	return
		coefficients[offset + 0] * s * s +
		coefficients[offset + 1] * s * t * 2.0f +
		coefficients[offset + 2] * t * t;
}

template <size_t N>
float EvaluateLinearInterpolation2DQuadraticC0 (float totalTime, const std::array<float, N>& pixels)
{
	const size_t c_numCurves = (N / 2) - 1;

	float t;
	size_t startRow;
	PiecewiseCurveTime(totalTime, c_numCurves, startRow, t);

	// Note we flip x axis direction every odd row to get the zig zag
	float horizT = (startRow % 2) == 0 ? t : 1.0f - t;

	float row0 = lerp(horizT, pixels[startRow * 2], pixels[startRow * 2 + 1]);
	++startRow;
	float row1 = lerp(horizT, pixels[startRow * 2], pixels[startRow * 2 + 1]);
	return lerp(t, row0, row1);
}

template <size_t NUMCURVES>
void Test2DQuadraticC0 ()
{
	const size_t c_imageWidth = 2;
	const size_t c_imageHeight = NUMCURVES + 1;
	const size_t c_numPixels = c_imageWidth * c_imageHeight;
	const size_t c_numControlPoints = 1 + NUMCURVES * 2;
	const size_t c_numEquations = 1 + NUMCURVES * 2;

	// report values for this test
	printf("  %zu curves.  %zu control points.  2x%zu texture = %zu pixels.n", NUMCURVES, c_numControlPoints, c_imageHeight, c_numPixels);
	printf("  %f pixels per curve.  %f pixels per control point.n", float(c_numPixels) / float(NUMCURVES), float(c_numPixels) / float(c_numControlPoints));

	// lambdas to convert between pixel index and texture coordinates
	auto TextureCoordinateToPixelIndex = [&] (size_t y, size_t x) -> size_t
	{
		return TextureCoordinateToPixelIndex2d(c_imageWidth, c_imageHeight, y, x);
	};
	auto pixelIndexToCoordinates = [&] (size_t pixelIndex, char pixelCoords[10])
	{
		size_t y, x;
		PixelIndexToTextureCoordinate2d(c_imageWidth, c_imageHeight, pixelIndex, y, x);
		sprintf(pixelCoords, "%zu%zu", y, x);
	};

	// create the equations
	TMatrix<c_numEquations, c_numPixels + c_numControlPoints> augmentedMatrix;
	for (size_t i = 0; i < c_numEquations; ++i)
	{
		TVector<c_numPixels + c_numControlPoints>& row = augmentedMatrix[i];

		// left side of the equation has a pattern like this:
		//   00
		//   01 10
		//
		// But, pattern index is added to the y index.
		// Also, the x coordinates flip from 0 to 1 on those after each pattern.
		// Also, left side coefficients must add up to 1.

		size_t patternIndex = i / 2;
		size_t xoff = patternIndex % 2 == 1;
		size_t xon = patternIndex % 2 == 0;
		switch (i % 2)
		{
			case 0:
			{
				row[TextureCoordinateToPixelIndex(patternIndex + 0, xoff)] = CRationalNumber(1, 1);
				break;
			}
			case 1:
			{
				row[TextureCoordinateToPixelIndex(patternIndex + 0, xon)] = CRationalNumber(1, 2);
				row[TextureCoordinateToPixelIndex(patternIndex + 1, xoff)] = CRationalNumber(1, 2);
				break;
			}
		}

		// right side of the equation is identity
		row[c_numPixels + i] = CRationalNumber(1);
	}
	
	// solve the matrix if possible and print out the equations
	std::unordered_set<size_t> freeVariables;
	if (!SolveMatrixAndPrintEquations(augmentedMatrix, c_numPixels, freeVariables, pixelIndexToCoordinates))
		return;

	// Next we need to show equality between the N-linear interpolation of our pixels and bernstein polynomials with our control points as coefficients

	// Fill in random values for our control points and free variable pixels, and fill in the other pixels as the equations dictate 
	std::array<float, c_numPixels> pixels = { 0 };
	std::array<float, c_numControlPoints> controlPoints = { 0 };
	FillInPixelsAndControlPoints<c_numPixels, c_numControlPoints, c_numEquations>(pixels, controlPoints, augmentedMatrix, freeVariables);

	// do a number of samples of each method at the same time values, and report the largest difference (error)
	float largestDifference = 0.0f;
	for (size_t i = 0; i < EQUALITY_TEST_SAMPLES; ++i)
	{
		float t = float(i) / float(EQUALITY_TEST_SAMPLES - 1);

		float value1 = EvaluateBernsteinPolynomial2DQuadraticC0(t, controlPoints);
		float value2 = EvaluateLinearInterpolation2DQuadraticC0(t, pixels);

		largestDifference = std::max(largestDifference, std::abs(value1 - value2));
	}
	printf("  %i Samples, Largest Error = %fnn", EQUALITY_TEST_SAMPLES, largestDifference);
}

void Test2DQuadraticsC0 ()
{
	printf("nTesting 2D Textures / Quadratic Curves with C0 continuitynn");

	Test2DQuadraticC0<1>();
	Test2DQuadraticC0<2>();
	Test2DQuadraticC0<3>();
	Test2DQuadraticC0<4>();

	system("pause");
}

//===================================================================================================================================
//                                             3D Textures / Cubic Curves
//===================================================================================================================================
//
// Find the limitations of this pattern and show equivalence to Bernstein Polynomials (Bezier Curve Equations). Pattern details below.
//
//  --- For first curve, do:
//
//  P000 P001    P100 P101
//  P010 P011    P110 P111
//
//  P000 = C0                       0
//  P001 + P010 + P100 = 3 * C1     1 2 4
//  P011 + P101 + P110 = 3 * C2     3 5 6
//  P111 = C3                       7
//
//  --- For second curve, do:
//
//  P000 P001    P100 P101
//  P010 P011    P110 P111
//  P020 P021    P120 P121
//
//  P000 = C0                       0
//  P001 + P010 + P100 = 3 * C1     1 2 4
//  P011 + P101 + P110 = 3 * C2     3 7 8
//  P111 = C3                       9
//
//  P010 = C4                       2
//  P011 + P020 + P110 = 3 * C5     3 4 8
//  P021 + P111 + P120 = 3 * C6     5 9 10
//  P121 = C7                       11
//
//  and so on...
//  each equation is then multiplied by a value so the right side is identity and left side coefficients add up to 1.
//
//  --- Other details:
//  
//  * control points: 4 * NumCurves.
//  * image width it 2
//  * image depth is 2
//  * image height is 1 + NumCurves.
//  * equations: 4 * NumCurves.  This many rows in the augmented matrix.
//  * augmented matrix columns = num pixels (left columns) + num control points (right columns)
//

template <size_t N>
float EvaluateBernsteinPolynomial3DCubic (float totalTime, const std::array<float, N>& coefficients)
{
	const size_t c_numCurves = N / 4;

	float t;
	size_t startCurve;
	PiecewiseCurveTime(totalTime, c_numCurves, startCurve, t);

	size_t offset = startCurve * 4;

	float s = 1.0f - t;
	return
		coefficients[offset + 0] * s * s * s +
		coefficients[offset + 1] * s * s * t * 3.0f +
		coefficients[offset + 2] * s * t * t * 3.0f +
		coefficients[offset + 3] * t * t * t;
}

template <size_t N, typename LAMBDA>
float EvaluateLinearInterpolation3DCubic (float totalTime, const std::array<float, N>& pixels, LAMBDA& TextureCoordinateToPixelIndex)
{
	const size_t c_numCurves = (N / 4) - 1;

	float t;
	size_t startRow;
	PiecewiseCurveTime(totalTime, c_numCurves, startRow, t);

	//    rowZYX
	float row00x = lerp(t, pixels[TextureCoordinateToPixelIndex(0, startRow + 0, 0)], pixels[TextureCoordinateToPixelIndex(0, startRow + 0, 1)]);
	float row01x = lerp(t, pixels[TextureCoordinateToPixelIndex(0, startRow + 1, 0)], pixels[TextureCoordinateToPixelIndex(0, startRow + 1, 1)]);
	float row0yx = lerp(t, row00x, row01x);

	float row10x = lerp(t, pixels[TextureCoordinateToPixelIndex(1, startRow + 0, 0)], pixels[TextureCoordinateToPixelIndex(1, startRow + 0, 1)]);
	float row11x = lerp(t, pixels[TextureCoordinateToPixelIndex(1, startRow + 1, 0)], pixels[TextureCoordinateToPixelIndex(1, startRow + 1, 1)]);
	float row1yx = lerp(t, row10x, row11x);

	return lerp(t, row0yx, row1yx);
}

template <size_t NUMCURVES>
void Test3DCubic ()
{
	const size_t c_imageWidth = 2;
	const size_t c_imageHeight = NUMCURVES + 1;
	const size_t c_imageDepth = 2;
	const size_t c_numPixels = c_imageWidth * c_imageHeight * c_imageDepth;
	const size_t c_numControlPoints = NUMCURVES * 4;
	const size_t c_numEquations = NUMCURVES * 4;

	// report values for this test
	printf("  %zu curves.  %zu control points.  2x%zux2 texture = %zu pixels.n", NUMCURVES, c_numControlPoints, c_imageHeight, c_numPixels);
	printf("  %f pixels per curve.  %f pixels per control point.n", float(c_numPixels) / float(NUMCURVES), float(c_numPixels) / float(c_numControlPoints));

	// lambdas to convert between pixel index and texture coordinates
	auto TextureCoordinateToPixelIndex = [&] (size_t z, size_t y, size_t x) -> size_t
	{
		return TextureCoordinateToPixelIndex3d(c_imageWidth, c_imageHeight, c_imageDepth, z, y, x);
	};
	auto pixelIndexToCoordinates = [&] (size_t pixelIndex, char pixelCoords[10])
	{
		size_t z, y, x;
		PixelIndexToTextureCoordinate3d(c_imageWidth, c_imageHeight, c_imageDepth, pixelIndex, z, y, x);
		sprintf(pixelCoords, "%zu%zu%zu", z,y,x);
	};

	// create the equations
	TMatrix<c_numEquations, c_numPixels + c_numControlPoints> augmentedMatrix;
	for (size_t i = 0; i < c_numEquations; ++i)
	{
		TVector<c_numPixels + c_numControlPoints>& row = augmentedMatrix[i];

		// left side of the equation goes in this zyx coordinate pattern:
		//   000 
		//   001 010 100 
		//   011 101 110
		//   111
		// But, curve index is added to the y index.
		// Also, left side coefficients must add up to 1.
		size_t curveIndex = i / 4;
		switch (i % 4)
		{
			case 0:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 0, 0)] = CRationalNumber(1, 1);
				break;
			}
			case 1:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 0, 1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(0, curveIndex + 1, 0)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 0, 0)] = CRationalNumber(1, 3);
				break;
			}
			case 2:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 1, 1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 0, 1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 1, 0)] = CRationalNumber(1, 3);
				break;
			}
			case 3:
			{
				row[TextureCoordinateToPixelIndex(1, curveIndex + 1, 1)] = CRationalNumber(1, 1);
				break;
			}
		}

		// right side of the equation is identity
		row[c_numPixels + i] = CRationalNumber(1);
	}

	// solve the matrix if possible and print out the equations
	std::unordered_set<size_t> freeVariables;
	if (!SolveMatrixAndPrintEquations(augmentedMatrix, c_numPixels, freeVariables, pixelIndexToCoordinates))
		return;

	// Next we need to show equality between the N-linear interpolation of our pixels and bernstein polynomials with our control points as coefficients

	// Fill in random values for our control points and free variable pixels, and fill in the other pixels as the equations dictate 
	std::array<float, c_numPixels> pixels = { 0 };
	std::array<float, c_numControlPoints> controlPoints = { 0 };
	FillInPixelsAndControlPoints<c_numPixels, c_numControlPoints, c_numEquations>(pixels, controlPoints, augmentedMatrix, freeVariables);

	// do a number of samples of each method at the same time values, and report the largest difference (error)
	float largestDifference = 0.0f;
	for (size_t i = 0; i < EQUALITY_TEST_SAMPLES; ++i)
	{
		float t = float(i) / float(EQUALITY_TEST_SAMPLES - 1);

		float value1 = EvaluateBernsteinPolynomial3DCubic(t, controlPoints);
		float value2 = EvaluateLinearInterpolation3DCubic(t, pixels, TextureCoordinateToPixelIndex);

		largestDifference = std::max(largestDifference, std::abs(value1 - value2));
	}
	printf("  %i Samples, Largest Error = %fnn", EQUALITY_TEST_SAMPLES, largestDifference);
}

void Test3DCubics ()
{
	printf("nTesting 3D Textures / Cubic Curvesnn");

	Test3DCubic<1>();
	Test3DCubic<2>();
	Test3DCubic<3>();
	Test3DCubic<4>();

	system("pause");
}

//===================================================================================================================================
//                                         3D Textures / Cubic Curves Multiple Curves
//===================================================================================================================================
//
// Find the limitations of this pattern and show equivalence to Bernstein Polynomials (Bezier Curve Equations). Pattern details below.
//
// This is the same as 3D Textures / Cubic Curves, but there is a second curve stored by flipping x coordinates.
//
//  --- Other details:
//  
//  * control points: 4 * NumCurves.
//  * image width it 2
//  * image depth is 2
//  * image height is 1 + (NumCurves/2).
//  * equations: 4 * NumCurves.  This many rows in the augmented matrix.
//  * augmented matrix columns = num pixels (left columns) + num control points (right columns)
//

template <size_t HALFNUMCURVES>
void Test3DCubicMulti ()
{
	const size_t NUMCURVES = HALFNUMCURVES * 2;
	const size_t c_imageWidth = 2;
	const size_t c_imageHeight = HALFNUMCURVES + 1;
	const size_t c_imageDepth = 2;
	const size_t c_numPixels = c_imageWidth * c_imageHeight * c_imageDepth;
	const size_t c_numControlPoints = NUMCURVES * 4;
	const size_t c_numEquations = NUMCURVES * 4;

	// report values for this test
	printf("  %zu curves.  %zu control points.  2x%zux2 texture = %zu pixels.n", NUMCURVES, c_numControlPoints, c_imageHeight, c_numPixels);
	printf("  %f pixels per curve.  %f pixels per control point.n", float(c_numPixels) / float(NUMCURVES), float(c_numPixels) / float(c_numControlPoints));

	// lambdas to convert between pixel index and texture coordinates
	auto TextureCoordinateToPixelIndex = [&] (size_t z, size_t y, size_t x) -> size_t
	{
		return TextureCoordinateToPixelIndex3d(c_imageWidth, c_imageHeight, c_imageDepth, z, y, x);
	};
	auto pixelIndexToCoordinates = [&] (size_t pixelIndex, char pixelCoords[10])
	{
		size_t z, y, x;
		PixelIndexToTextureCoordinate3d(c_imageWidth, c_imageHeight, c_imageDepth, pixelIndex, z, y, x);
		sprintf(pixelCoords, "%zu%zu%zu", z,y,x);
	};

	// create the first set of equations
	TMatrix<c_numEquations, c_numPixels + c_numControlPoints> augmentedMatrix;
	for (size_t i = 0; i < c_numEquations / 2; ++i)
	{
		TVector<c_numPixels + c_numControlPoints>& row = augmentedMatrix[i];

		// left side of the equation goes in this zyx coordinate pattern:
		//   000 
		//   001 010 100 
		//   011 101 110
		//   111
		// But, curve index is added to the y index.
		// Also, left side coefficients must add up to 1.
		size_t curveIndex = i / 4;
		switch (i % 4)
		{
			case 0:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 0, 0)] = CRationalNumber(1, 1);
				break;
			}
			case 1:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 0, 1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(0, curveIndex + 1, 0)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 0, 0)] = CRationalNumber(1, 3);
				break;
			}
			case 2:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 1, 1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 0, 1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 1, 0)] = CRationalNumber(1, 3);
				break;
			}
			case 3:
			{
				row[TextureCoordinateToPixelIndex(1, curveIndex + 1, 1)] = CRationalNumber(1, 1);
				break;
			}
		}

		// right side of the equation is identity
		row[c_numPixels + i] = CRationalNumber(1);
	}

	// create the second set of equations
	for (size_t i = 0; i < c_numEquations / 2; ++i)
	{
		TVector<c_numPixels + c_numControlPoints>& row = augmentedMatrix[i + c_numEquations / 2];

		// left side of the equation goes in this zyx coordinate pattern, which is the same as above but x axis flipped.
		//   001
		//   000 011 101 
		//   010 100 111
		//   110
		// But, curve index is added to the y index.
		// Also, left side coefficients must add up to 1.
		size_t curveIndex = i / 4;
		switch (i % 4)
		{
			case 0:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 0, 1)] = CRationalNumber(1, 1);
				break;
			}
			case 1:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 0, 0)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(0, curveIndex + 1, 1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 0, 1)] = CRationalNumber(1, 3);
				break;
			}
			case 2:
			{
				row[TextureCoordinateToPixelIndex(0, curveIndex + 1, 0)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 0, 0)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(1, curveIndex + 1, 1)] = CRationalNumber(1, 3);
				break;
			}
			case 3:
			{
				row[TextureCoordinateToPixelIndex(1, curveIndex + 1, 0)] = CRationalNumber(1, 1);
				break;
			}
		}

		// right side of the equation is identity
		row[c_numPixels + i + c_numEquations / 2] = CRationalNumber(1);
	}

	// solve the matrix if possible and print out the equations
	std::unordered_set<size_t> freeVariables;
	SolveMatrixAndPrintEquations(augmentedMatrix, c_numPixels, freeVariables, pixelIndexToCoordinates);
}

void Test3DCubicsMulti ()
{
	printf("nTesting 3D Textures / Cubic Curves with Multiple Curvesnn");

	Test3DCubicMulti<1>();

	system("pause");
}

//===================================================================================================================================
//                                       3D Textures / Cubic Curves With C0 Continuity
//===================================================================================================================================
//
// Find the limitations of this pattern and show equivalence to Bernstein Polynomials (Bezier Curve Equations). Pattern details below.
//
//  --- For first curve, do:
//
//  P000 P001    P100 P101
//  P010 P011    P110 P111
//
//  P000 = C0                       
//  P001 + P010 + P100 = 3 * C1     
//  P011 + P101 + P110 = 3 * C2     
//  P111 = C3                       
//
//  --- For second curve, do:
//
//  P000 P001    P100 P101
//  P010 P011    P110 P111
//  P020 P021    P120 P121
//
//  P000 = C0                       
//  P001 + P010 + P100 = 3 * C1     
//  P011 + P101 + P110 = 3 * C2     
//  P111 = C3                       
//                       
//  P011 + P110 + P121 = 3 * C4     
//  P010 + P021 + P110 = 3 * C5     
//  P020 = C6     
//
//  --- For third curve, do:
//
//  P000 P001    P100 P101
//  P010 P011    P110 P111
//  P020 P021    P120 P121
//  P030 P031    P130 P131
//
//  P000 = C0                       
//  P001 + P010 + P100 = 3 * C1     
//  P011 + P101 + P110 = 3 * C2     
//  P111 = C3                       
//                       
//  P011 + P110 + P121 = 3 * C4     
//  P010 + P021 + P110 = 3 * C5     
//  P020 = C6     
//
//  P021 + P030 + P120 = 3 * C7     
//  P031 + P121 + P130 = 3 * C8     
//  P131 = C9   
//
//  and so on...
//  each equation is then multiplied by a value so the right side is identity and left side coefficients add up to 1.
//
//  --- Other details:
//  
//  * control points: 1 + 3 * NumCurves.
//  * image width it 2
//  * image depth is 2
//  * image height is 1 + NumCurves.
//  * equations: 1 + 3 * NumCurves.  This many rows in the augmented matrix.
//  * augmented matrix columns = num pixels (left columns) + num control points (right columns)
//

template <size_t N>
float EvaluateBernsteinPolynomial3DCubicC0 (float totalTime, const std::array<float, N>& coefficients)
{
	const size_t c_numCurves = (N-1) / 3;

	float t;
	size_t startCurve;
	PiecewiseCurveTime(totalTime, c_numCurves, startCurve, t);

	size_t offset = startCurve * 3;

	float s = 1.0f - t;
	return
		coefficients[offset + 0] * s * s * s +
		coefficients[offset + 1] * s * s * t * 3.0f +
		coefficients[offset + 2] * s * t * t * 3.0f +
		coefficients[offset + 3] * t * t * t;
}

template <size_t N, typename LAMBDA>
float EvaluateLinearInterpolation3DCubicC0 (float totalTime, const std::array<float, N>& pixels, LAMBDA& TextureCoordinateToPixelIndex)
{
	const size_t c_numCurves = (N / 4) - 1;

	float t;
	size_t startRow;
	PiecewiseCurveTime(totalTime, c_numCurves, startRow, t);

	// Note we flip x and z axis direction every odd row to get the zig zag

	//    rowZYX
	float xzT = (startRow % 2) == 0 ? t : 1.0f - t;
	float row00x = lerp(xzT, pixels[TextureCoordinateToPixelIndex(0, startRow + 0, 0)], pixels[TextureCoordinateToPixelIndex(0, startRow + 0, 1)]);
	float row01x = lerp(xzT, pixels[TextureCoordinateToPixelIndex(0, startRow + 1, 0)], pixels[TextureCoordinateToPixelIndex(0, startRow + 1, 1)]);
	float row0yx = lerp(t, row00x, row01x);

	float row10x = lerp(xzT, pixels[TextureCoordinateToPixelIndex(1, startRow + 0, 0)], pixels[TextureCoordinateToPixelIndex(1, startRow + 0, 1)]);
	float row11x = lerp(xzT, pixels[TextureCoordinateToPixelIndex(1, startRow + 1, 0)], pixels[TextureCoordinateToPixelIndex(1, startRow + 1, 1)]);
	float row1yx = lerp(t, row10x, row11x);

	return lerp(xzT, row0yx, row1yx);
}

template <size_t NUMCURVES>
void Test3DCubicC0 ()
{

	const size_t c_imageWidth = 2;
	const size_t c_imageHeight = NUMCURVES + 1;
	const size_t c_imageDepth = 2;
	const size_t c_numPixels = c_imageWidth * c_imageHeight * c_imageDepth;
	const size_t c_numControlPoints = 1 + NUMCURVES * 3;
	const size_t c_numEquations = 1 + NUMCURVES * 3;

	// report values for this test
	printf("  %zu curves.  %zu control points.  2x%zux2 texture = %zu pixels.n", NUMCURVES, c_numControlPoints, c_imageHeight, c_numPixels);
	printf("  %f pixels per curve.  %f pixels per control point.n", float(c_numPixels) / float(NUMCURVES), float(c_numPixels) / float(c_numControlPoints));

	// lambdas to convert between pixel index and texture coordinates
	auto TextureCoordinateToPixelIndex = [&] (size_t z, size_t y, size_t x) -> size_t
	{
		return TextureCoordinateToPixelIndex3d(c_imageWidth, c_imageHeight, c_imageDepth, z, y, x);
	};
	auto pixelIndexToCoordinates = [&] (size_t pixelIndex, char pixelCoords[10])
	{
		size_t z, y, x;
		PixelIndexToTextureCoordinate3d(c_imageWidth, c_imageHeight, c_imageDepth, pixelIndex, z, y, x);
		sprintf(pixelCoords, "%zu%zu%zu", z,y,x);
	};

	// create the equations
	TMatrix<c_numEquations, c_numPixels + c_numControlPoints> augmentedMatrix;
	for (size_t i = 0; i < c_numEquations; ++i)
	{
		TVector<c_numPixels + c_numControlPoints>& row = augmentedMatrix[i];

		// left side of the equation has a pattern like this:
		//   000
		//   001 010 100
		//   011 101 110
		//
		// But, pattern index is added to the y index.
		// Also, the x and z coordinates flip from 0 to 1 on those after each pattern.
		// Also, left side coefficients must add up to 1.
		size_t patternIndex = i / 3;
		size_t xz0 = patternIndex % 2 == 1;
		size_t xz1 = patternIndex % 2 == 0;
		switch (i % 3)
		{
			case 0:
			{
				row[TextureCoordinateToPixelIndex(xz0, patternIndex + 0, xz0)] = CRationalNumber(1, 1);
				break;
			}
			case 1:
			{
				row[TextureCoordinateToPixelIndex(xz0, patternIndex + 0, xz1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(xz0, patternIndex + 1, xz0)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(xz1, patternIndex + 0, xz0)] = CRationalNumber(1, 3);
				break;
			}
			case 2:
			{
				row[TextureCoordinateToPixelIndex(xz0, patternIndex + 1, xz1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(xz1, patternIndex + 0, xz1)] = CRationalNumber(1, 3);
				row[TextureCoordinateToPixelIndex(xz1, patternIndex + 1, xz0)] = CRationalNumber(1, 3);
				break;
			}
		}

		// right side of the equation is identity
		row[c_numPixels + i] = CRationalNumber(1);
	}

	// solve the matrix if possible and print out the equations
	std::unordered_set<size_t> freeVariables;
	if (!SolveMatrixAndPrintEquations(augmentedMatrix, c_numPixels, freeVariables, pixelIndexToCoordinates))
		return;

	// Next we need to show equality between the N-linear interpolation of our pixels and bernstein polynomials with our control points as coefficients

	// Fill in random values for our control points and free variable pixels, and fill in the other pixels as the equations dictate 
	std::array<float, c_numPixels> pixels = { 0 };
	std::array<float, c_numControlPoints> controlPoints = { 0 };
	FillInPixelsAndControlPoints<c_numPixels, c_numControlPoints, c_numEquations>(pixels, controlPoints, augmentedMatrix, freeVariables);

	// do a number of samples of each method at the same time values, and report the largest difference (error)
	float largestDifference = 0.0f;
	for (size_t i = 0; i < EQUALITY_TEST_SAMPLES; ++i)
	{
		float t = float(i) / float(EQUALITY_TEST_SAMPLES - 1);

		float value1 = EvaluateBernsteinPolynomial3DCubicC0(t, controlPoints);
		float value2 = EvaluateLinearInterpolation3DCubicC0(t, pixels, TextureCoordinateToPixelIndex);

		largestDifference = std::max(largestDifference, std::abs(value1 - value2));
	}
	printf("  %i Samples, Largest Error = %fnn", EQUALITY_TEST_SAMPLES, largestDifference);
}

void Test3DCubicsC0 ()
{

	printf("nTesting 3D Textures / Cubic Curves with C0 continuitynn");

	Test3DCubicC0<1>();
	Test3DCubicC0<2>();
	Test3DCubicC0<3>();
	Test3DCubicC0<4>();
	Test3DCubicC0<5>();
	Test3DCubicC0<6>();

	system("pause");
}

//===================================================================================================================================
//                                                                 main
//===================================================================================================================================

int main (int agrc, char **argv)
{
	Test2DQuadratics();
	Test2DQuadraticsC0();
	Test3DCubics();
	Test3DCubicsMulti();
	Test3DCubicsC0();

	return 0;
}

Solving N equations and N unknowns: The Fine Print (Gauss Jordan Elimination)

In basic algebra we were taught that if we have three unknowns (variables), it takes three equations to solve for them.

There’s some fine print though that isn’t talked about until quite a bit later.

Let’s have a look at three unknowns in two equations:

A + B + C = 2 \\ B = 5

If we just need a third equation to solve this, why not just modify the second equation to make a third?

-B = -5

That obviously doesn’t work, because it doesn’t add any new information! If you try it out, you’ll find that adding that equation doesn’t get you any closer to solving for the variables.

So, it takes three equations to solve for three unknowns, but the three equations have to provide unique, meaningful information. That is the fine print.

How can we know if an equation provides unique, meaningful information though?

It turns out that linear algebra gives us a neat technique for simplifying a system of equations. It actually solves for individual variables if it’s able to, and also gets rid of redundant equations that don’t add any new information.

This simplest form is called the Reduced Row Echelon Form (Wikipedia) which you may also see abbreviated as “rref” (perhaps a bit of a confusing term for programmers) and it involves you putting the equations into a matrix and then performing an algorithm, such as Gauss–Jordan elimination (Wikipedia) to get the rref.

Equations as a Matrix

Putting n set of equations into a matrix is really straight forward.

Each row of a matrix is a separate equation, and each column represents the coefficient of a variable.

Let’s see how with this set of equations:

3x + y = 5\\ 2y = 7\\ y + z = 14

Not every equation has every variable in it, so let’s fix that by putting in zero terms for the missing variables, and let’s make the one terms explicit as well:

3x + 1y + 0z = 5\\ 0x + 2y + 0z = 7\\ 0x + 1y + 1z = 14

Putting those equations into a matrix looks like this:

\left[\begin{array}{rrr} 3 & 1 & 0 \\ 0 & 2 & 0 \\ 0 & 1 & 1 \end{array}\right]

If you also include the constants on the right side of the equation, you get what is called an augmented matrix, which looks like this:

\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]

Reduced Row Echelon Form

Wikipedia explains the reduced row echelon form this way:

  • all nonzero rows (rows with at least one nonzero element) are above any rows of all zeroes (all zero rows, if any, belong at the bottom of the matrix), and
  • the leading coefficient (the first nonzero number from the left, also called the pivot) of a nonzero row is always strictly to the right of the leading coefficient of the row above it.
  • Every leading coefficient is 1 and is the only nonzero entry in its column.

This is an example of a 3×5 matrix in reduced row echelon form:
\left[\begin{array}{rrrrr} 1 & 0 & a_1 & 0 & b_1 \\ 0 & 1 & a_2 & 0 & b_2 \\ 0 & 0 & 0 & 1 & b_3 \end{array}\right]

Basically, the lower left triangle of the matrix (the part under the diagonal) needs to be zero, and the first number in each row needs to be one.

Looking back at the augmented matrix we made:

\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]

If we put it into reduced row echelon form, we get this:

\left[\begin{array}{rrr|r} 1 & 0 & 0 & 0.5 \\ 0 & 1 & 0 & 3.5 \\ 0 & 0 & 1 & 10.5 \end{array}\right]

There’s something really neat about the reduced row echelon form. If we take the above augmented matrix and turn it back into equations, look what we get:

1x + 0y + 0z = 0.5\\ 0x + 1y + 0z = 3.5\\ 0x + 0y + 1z = 10.5

Or if we simplify that:

x = 0.5\\ y = 3.5\\ z = 10.5

Putting it into reduced row echelon form simplified our set of equations so much that it actually solved for our variables. Neat!

How do we put a matrix into rref? We can use Gauss–Jordan elimination.

Gauss–Jordan Elimination

Gauss Jordan Elimination is a way of doing operations on rows to be able to manipulate the matrix to get it into the desired form.

It’s often explained that there are three row operations you can do:

  • Type 1: Swap the positions of two rows.
  • Type 2: Multiply a row by a nonzero scalar.
  • Type 3: Add to one row a scalar multiple of another.

You might notice that the first two rules are technically just cases of using the third rule. I find that easier to remember, maybe you will too.

The algorithm for getting the rref is actually pretty simple.

  1. Starting with the first column of the matrix, find a row which has a non zero in that column, and make that row be the first row by swapping it with the first row.
  2. Multiply the first row by a value so that the first column has a 1 in it.
  3. Subtract a multiple of the first row from every other row in the matrix so that they have a zero in the first column.

You’ve now handled one column (one variable) so move onto the next.

  1. Continuing on, we consider the second column. Find a row which has a non zero in that column and make that row be the second row by swapping it with the second row.
  2. Multiply the second row by a value so that the second column has a 1 in it.
  3. Subtract a multiple of the second row from every other row in the matrix so that they have a zero in the second column.

You repeat this process until you either run out of rows or columns, at which point you are done.

Note that if you ever find a column that has only zeros in it, you just skip that row.

Let’s work through the example augmented matrix to see how we got it into rref. Starting with this:

\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]

We already have a non zero in the first column, so we multiply the top row by 1/3 to get this:

\left[\begin{array}{rrr|r} 1 & 0.3333 & 0 & 1.6666 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]

All the other rows have a zero in the first column so we move to the second row and the second column. The second row already has a non zero in the second column, so we multiply the second row by 1/2 to get this:

\left[\begin{array}{rrr|r} 1 & 0.3333 & 0 & 1.6666 \\ 0 & 1 & 0 & 3.5 \\ 0 & 1 & 1 & 14 \end{array}\right]

To make sure the second row is the only row that has a non zero in the second column, we subtract the second row times 1/3 from the first row. We also subtract the second row from the third row. That gives us this:

\left[\begin{array}{rrr|r} 1 & 0 & 0 & 0.5 \\ 0 & 1 & 0 & 3.5 \\ 0 & 0 & 1 & 10.5 \end{array}\right]

Since the third row has a 1 in the third column, and all other rows have a 0 in that column we are done.

That’s all there is to it! We put the matrix into rref, and we also solved the set of equations. Neat huh?

You may notice that the ultimate rref of a matrix is just the identity matrix. This is true unless the equations can’t be fully solved.

Overdetermined, Underdetermined & Inconsistent Equations

Systems of equations are overdetermined when they have more equations than unknowns, like the below which has three equations and two unknowns:

x + y = 3 \\ x = 1 \\ y = 2 \\

Putting that into (augmented) matrix form gives you this:

\left[\begin{array}{rr|r} 1 & 1 & 3 \\ 1 & 0 & 1 \\ 0 & 1 & 2 \end{array}\right]

If you put that into rref, you end up with this:

\left[\begin{array}{rr|r} 1 & 0 & 1 \\ 0 & 1 & 2 \\ 0 & 0 & 0 \end{array}\right]

The last row became zeroes, which shows us that there was redundant info in the system of equations that disappeared. We can easily see that x = 1 and y = 2, and that satisfies all three equations.

Just like we talked about in the opening of this post, if you have equations that don’t add useful information beyond what the other equations already give, it will disappear when you put it into rref. That made our over-determined system become just a determined system.

What happens though if we change the third row in the overdetermined system to be something else? For instance, we can say y=10 instead of y=2:

x + y = 3 \\ x = 1 \\ y = 10 \\

The augmented matrix for that is this:

\left[\begin{array}{rr|r} 1 & 1 & 3 \\ 1 & 0 & 1 \\ 0 & 1 & 10 \end{array}\right]

If we put that in rref, we get the identity matrix out which seems like everything is ok:

\left[\begin{array}{rr|r} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right]

However, if we turn it back into a set of equations, we can see that we have a problem:

x = 0 \\ x = 0 \\ 0 = 1 \\

The result says that 0 = 1, which is not true. Having a row of “0 = 1” in rref is how you detect that a system of equations is inconsistent, or in other words, that the equations give contradictory information.

A system of equations can also be underderdetermined, meaning there isn’t enough information to solve the equations. Let’s use the example from the beginning of the post:

A + B + C = 2 \\ B = 5 \\

In an augmented matrix, that looks like this:

\left[\begin{array}{rrr|r} 1 & 1 & 1 & 2 \\ 0 & 1 & 0 & 5 \\ \end{array}\right]

Putting that in rref we get this:

\left[\begin{array}{rrr|r} 1 & 0 & 1 & -3 \\ 0 & 1 & 0 & 5 \\ \end{array}\right]

Converting the matrix back into equations we get this:

A + C = -3 \\ B = 5 \\

This says there isn’t enough information to fully solve the equations, and shows how A and C are related, even though B is completely determined.

Note that another way of looking at this is that “A” and “C” are “free variables”. That means that if your equations specify constraints, that you are free to choose a value for either A or C. If you choose a value for one, the other becomes defined. B is not a free variable because it’s value is determined.

Let’s finish the example from the beginning of the post, showing what happens when we “make up” an equation by transforming one of the equations we already have:

A + B + C = 2 \\ B = 5\\ -B = -5

The augmented matrix looks like this:

\left[\begin{array}{rrr|r} 1 & 1 & 1 & 2 \\ 0 & 1 & 0 & 5 \\ 0 & -1 & 0 & -5 \\ \end{array}\right]

Putting it in rref, we get this:

\left[\begin{array}{rrr|r} 1 & 0 & 1 & -3 \\ 0 & 1 & 0 & 5 \\ 0 & 0 & 0 & 0 \\ \end{array}\right]

Which as you can see, our rref matrix is the same as it was without the extra “made up” equation besides the extra row of zeros in the result.

The number of non zero rows in a matrix in rref is known as the rank of the matrix. In these last two examples, the rank of the matrix was two in both cases. That means that you can tell if adding an equation to a system of equations adds any new, meaningful information or not by seeing if it changes the rank of the matrix for the set of equations. If the rank is the same before and after adding the new equation, it doesn’t add anything new. If the rank does change, that means it does add new information.

This concept of “adding new, meaningful information” actually has a formalized term: linear independence. If a new equation is linearly independent from the other equations in the system, it will change the rank of the rref matrix, else it won’t.

The rank of a matrix for a system of equations just tells you the number of linearly independent equations there actually are, and actually gives you what those equations are in their simplest form.

Lastly I wanted to mention that the idea of a system of equations being inconsistent is completely separate from the idea of a system of equations being under determined or over determined. They can be both over determined and inconsistent, under determined and inconsistent, over determined and consistent or under determined and consistent . The two ideas are completely separate, unrelated things.

Inverting a Matrix

Interestingly, Gauss-Jordan elimination is also a common way for efficiently inverting a matrix!

How you do that is make an augmented matrix where on the left side you have the matrix you want to invert, and on the right side you have the identity matrix.

Let’s invert a matrix I made up pretty much at random:

\left[\begin{array}{rrr|rrr} 1 & 0 & 1 & 1 & 0 & 0 \\ 0 & 3 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1\\ \end{array}\right]

Putting that matrix in rref, we get this:

\left[\begin{array}{rrr|rrr} 1 & 0 & 0 & 1 & 0 & -1 \\ 0 & 1 & 0 & 0 & 0.3333 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1\\ \end{array}\right]

The equation on the right is the inverse of the original matrix we had on the left!

You can double check by using an online matrix inverse calculator if you want: Inverse Matrix Calculator

Note that not all matrices are invertible though! When you get an inconsistent result, or the result is not the identity matrix, it wasn’t invertible.

Solving Mx = b

Let’s say that you have two vectors x and b, and a matrix M. Let’s say that we know the matrix M and the vector b, and that we are trying to solve for the vector x.

This comes up more often that you might suspect, including when doing “least squares fitting” of an equation to a set of data points (more info on that: Incremental Least Squares Curve Fitting).

One way to solve this equation would be to calculate the inverse matrix of M and multiply that by vector b to get vector x:

Mx = b\\ x = M^{-1} * b

However, Gauss-Jordan elimination can help us here too.

If we make an augmented matrix where on the left we have M, and on the right we have b, we can put the matrix into rref, which will essentially multiply vector b by the inverse of M, leaving us with the vector x.

For instance, on the left is our matrix M that scales x,y,z by 2. On the right is our vector b, which is the matrix M times our unknown vector x:

\left[\begin{array}{rrr|r} 2 & 0 & 0 & 2 \\ 0 & 2 & 0 & 4 \\ 0 & 0 & 2 & 8 \\ \end{array}\right]

Putting that into rref form we get this:

\left[\begin{array}{rrr|r} 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 2 \\ 0 & 0 & 1 & 4 \\ \end{array}\right]

From this, we know that the value of vector x is the right side of the augmented matrix: (1,2,4)

This only works when the matrix is invertible (aka when the rref goes to an identity matrix).

Source Code

Here is some C++ source code which does Gauss-Jordan elimination. It’s written mainly to be readable, not performant!

#include <stdio.h>
#include <array>
#include <vector>
#include <assert.h>

// Define a vector as an array of floats
template<size_t N>
using TVector = std::array<float, N>;

// Define a matrix as an array of vectors
template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

// Helper function to fill out a matrix
template <size_t M, size_t N>
TMatrix<M, N> MakeMatrix (std::initializer_list<std::initializer_list<float>> matrixData)
{
    TMatrix<M, N> matrix;

    size_t m = 0;
	assert(matrixData.size() == M);
    for (const std::initializer_list<float>& rowData : matrixData)
    {
		assert(rowData.size() == N);
        size_t n = 0;
        for (float value : rowData)
        {
            matrix[m][n] = value;
            ++n;
        }
        ++m;
    }
   
    return matrix;
}

// Make a specific row have a 1 in the colIndex, and make all other rows have 0 there
template <size_t M, size_t N>
bool MakeRowClaimVariable (TMatrix<M, N>& matrix, size_t rowIndex, size_t colIndex)
{
	// Find a row that has a non zero value in this column and swap it with this row
	{
		// Find a row that has a non zero value
		size_t nonZeroRowIndex = rowIndex;
		while (nonZeroRowIndex < M && matrix[nonZeroRowIndex][colIndex] == 0.0f)
			++nonZeroRowIndex;

		// If there isn't one, nothing to do
		if (nonZeroRowIndex == M)
			return false;

		// Otherwise, swap the row
		if (rowIndex != nonZeroRowIndex)
			std::swap(matrix[rowIndex], matrix[nonZeroRowIndex]);
	}

	// Scale this row so that it has a leading one
	float scale = 1.0f / matrix[rowIndex][colIndex];
	for (size_t normalizeColIndex = colIndex; normalizeColIndex < N; ++normalizeColIndex)
		matrix[rowIndex][normalizeColIndex] *= scale;

	// Make sure all rows except this one have a zero in this column.
	// Do this by subtracting this row from other rows, multiplied by a multiple that makes the column disappear.
	for (size_t eliminateRowIndex = 0; eliminateRowIndex < M; ++eliminateRowIndex)
	{
		if (eliminateRowIndex == rowIndex)
			continue;

		float scale = matrix[eliminateRowIndex][colIndex];
		for (size_t eliminateColIndex = 0; eliminateColIndex < N; ++eliminateColIndex)
			matrix[eliminateRowIndex][eliminateColIndex] -= matrix[rowIndex][eliminateColIndex] * scale;
	}

	return true;
}

// make matrix into reduced row echelon form
template <size_t M, size_t N>
void GaussJordanElimination (TMatrix<M, N>& matrix)
{
	size_t rowIndex = 0;
	for (size_t colIndex = 0; colIndex < N; ++colIndex)
	{
		if (MakeRowClaimVariable(matrix, rowIndex, colIndex))
		{
			++rowIndex;
			if (rowIndex == M)
				return;
		}
	}
}

int main (int argc, char **argv)
{
	auto matrix = MakeMatrix<3, 4>(
	{
		{ 2.0f, 0.0f, 0.0f, 2.0f },
		{ 0.0f, 2.0f, 0.0f, 4.0f },
        { 0.0f, 0.0f, 2.0f, 8.0f },
	});

    GaussJordanElimination(matrix);

    return 0;
}

I hope you enjoyed this post and/or learned something from it. This is a precursor to an interesting (but maybe obscure) topic for my next blog post, which involves a graphics / gamedev thing.

Any comments, questions or corrections, let me know in the comments below or on twitter at @Atrix256

Orthogonal Projection Matrix Plainly Explained

“Scratch a Pixel” has a really nice explanation of perspective and orthogonal projection matrices.

It inspired me to make a very simple / plain explanation of orthogonal projection matrices that hopefully will help them be less opaque for folks and more intuitive.

Original article: Scratch A Pixel: The Perspective and Orthographic Projection Matrix

Let’s Get To It!

The whole purpose of an orthogonal matrix is to take x,y and z as input and output x,y and z such that valid points on the screen will have x,y,z values between -1 and 1.

If we transform a point and get an x,y or z that is outside of that range, we know the point is outside of the screen either because it’s too far left, right, up or down, or because it’s too close or too far on the z axis.

Let’s think about how we’d do this, thinking only about the x coordinate for now.

To map some range of x values from -1 to 1, we’ll need to decide on what x value maps to -1 and what x value maps to 1. We’ll call these “left” and “right”.

Given a left and right value, and an x value we want to map to the range, perhaps the most straight forward way to do it would be this:

XOut = \frac{X-Left}{Right-Left} * 2 - 1

The division calculates the percentage of how far X is between left and right. Multiplying that by 2 and subtracting 1 changes it so instead of valid points being from 0 to 1 (aka from 0% to 100%), they are instead between -1 and 1.

Let’s change this formula so that there is one term that is multiplied by X and another term that has everything else. (Wondering why? It’s because I’m cheating and know the final form. Don’t feel bad if it isn’t intuitive why we’d do this!)

\frac{X-Left}{Right-Left} * 2 - 1 =\\ \\ \frac{2(X-Left)}{Right-Left} - 1 =\\ \\ \frac{2X-2*Left}{Right-Left} - 1 =\\ \\ \frac{2X-2*Left}{Right-Left} - \frac{Right-Left}{Right-Left} =\\ \\ \frac{2X-2*Left-(Right-Left)}{Right-Left} =\\ \\ \frac{2X-2*Left-Right+Left}{Right-Left} =\\ \\ \frac{2X-Left-Right}{Right-Left} =\\ \\ \frac{2X}{Right-Left} - \frac{Right+Left}{Right-Left} =\\ \\ \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left}\\

Setting up the formula this way allows us to transform the x component of an (x,y,z,1) point using a dot product:

(x,y,z,1) \cdot (\frac{2}{Right-Left},0,0,-\frac{Right+Left}{Right-Left}) = \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left}

A dot product is what happens during matrix multiplication, so if we put this into a 4×4 matrix, we get the same result. Let’s check that out.

We start with an identity matrix. If we use it to transform an (x,y,z,1) point, we get the same point as output aka nothing happens.

\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

Now let’s put the x transform we came up with into the matrix:

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ -\frac{Right+Left}{Right-Left} & 0 & 0 & 1 \\ \end{bmatrix}

If we use that matrix to transform an (x,y,z,1) point, it will transform our x component as we described (valid ranges of x that are between left and right will be between -1 and 1), while leaving the other components of the point alone.

As you might imagine, it’s pretty simple to get our formulas for y and z as well. Starting with the x formula, we can just change x with y and z, and right/left with top/bottom and far/near.

XOut = \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left} \\ \\ YOut = \frac{2}{Top-Bottom}Y - \frac{Top+Bottom}{Top-Bottom} \\ \\ ZOut = \frac{2}{Far-Near}Z - \frac{Far+Near}{Far-Near}

We can put those into our matrix to get a full orthographic projection matrix.

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{2}{Far-Near} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

There we go, that’s all there is to making an orthographic projection matrix. It’s whole purpose is to convert x,y,z values to be between -1 and 1 so that the GPU knows whether points are inside our outside the screen – and thus whether they need to be clipped or not.

Variations

While the projection matrix we made is a valid orthographic projection matrix in OpenGL, we actually need a tweak for it to be valid for DirectX. The reason for this is because while in OpenGL the clip space for z is between -1 and 1, it’s actually between 0 and 1 for DirectX!

If you leave off the *2-1 for the z formula, but leave it for x and y, you’ll end up with a matrix like this one:

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{1}{Near-Far} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Near}{Near-Far} & 1 \\ \end{bmatrix}

Another variation you’ll see is a version where the camera is centered on the origin for the x and y axis. In other words, left = -right, and top = -bottom. When that is true, right+left and top+bottom become zero which simplifies the matrix to this:

\begin{bmatrix} \frac{2}{Width} & 0 & 0 & 0 \\ 0 & \frac{2}{Height} & 0 & 0 \\ 0 & 0 & \frac{2}{Far-Near} & 0 \\ 0 & 0 & -\frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

Another variation you’ll see is that the matrix is transposed. You’ll see this when switching between pre and post multiplication, or when switching from column major matrices to row matrices. Either is valid and it’s basically just a notation and convention thing. Here is the origional matrix we made transposed.

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & -\frac{Right+Left}{Right-Left}  \\ 0 & \frac{2}{Top-Bottom} & 0 & -\frac{Top+Bottom}{Top-Bottom} \\ 0 & 0 & \frac{2}{Far-Near} & -\frac{Far+Near}{Far-Near} \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

Lastly, the above matrices were all for a “left handed” system. That means that it assumes the positive x axis goes to the right, the positive y axis goes up, and the positive z goes into your screen (aka, the camera is looking down the positive z axis). Positive Z values will map to the valid -1 to 1 range, while negative z values will be outside the valid range.

A variation on the orthographic projection matrix we made that you’ll see is the matrix being a “right handed” matrix which is the same as the left handed matrix, except that the positive z axis goes out from your screen (aka the camera is looking down the negative z axis). Negative Z values will map to the valid -1 to 1 range, while positive z values will be outside the valid range.

To switch the handedness of the matrix, you just flip the sign of the element at (3,3), so here is our original orthographic projection matrix, but converted to right handed instead of left handed.

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & -\frac{2}{Far-Near} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

You may also just see the denominator changed from Far-Near to Near-Far which has the same effect, and would give you something like this:

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{2}{Near-Far} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

Fun trivia: the term “sinister” comes from latin, meaning “left handed”. So, when talking to someone about their graphics engine, you can ask them whether or not they use sinister projection 😛

Links

Scratch A Pixel: The Perspective and Orthographic Projection Matrix

D3DXMatrixOrthoRH (DirectX) – shows the resulting matrix. Also links to left handed and off center variants.

glOrtho (OpenGL) – shows resulting matrix.

Plastic Bag Ban – Semi Reusable Bag Kiosks a Better Solution?

I an in favor of people generating less trash, and have been amazed that where I live (southern California), people have taken a “plastic bag ban” so well in stride. It felt like one of those things where we couldn’t live without it, but it turns out we can quite easily. Maybe there’s a lesson there, but that’s not the point of this post so I’ll get back to it (;

Where I live, you can BUY plastic bags for use from the grocery store for about 10 cents each, which is cost prohibitive enough that people tend to bring their own bags.

Now let me tell you, I’m no dummy. I am pretty sure those re-usable plastic bags everyone has gotten use far more resources to make and create more pollution in the process so seem like they will take some usage before they hit a break even point compared to the disposable bags. Google seems to give mixed results, saying that some bags are better than others: Google: are reusable bags better for the environment.

I think that the sentiment of what’s going on is good for sure though, and I’m hoping it’s a net positive (?), and I think that there is some winning to be had here environmentally – if done right.

However, let’s put the environmental concerns on the back burner for a second.

Grocery store baggers now get handed all sorts of differently shaped bags of various capacity, and folks often want their stuff bagged specific ways to make sure it all fits in however many they brought. It’s also very common for people to forget their bags at home, leave them in the car, etc. This makes things a bit awkward and definitely not as fast and smooth as it used to be with disposable bags.

My idea to address these issues is this:

  • When you check out from the store, there are semi-reusable bags in bails similar to how disposable bags are in bails now. You pay to use the bags, but the cost is mostly a refundable deposit.
  • You take the bags home, unload, etc like normal, but ideally you do not throw the bags away.
  • The next time you come to the store, you feed your empty bags into a kiosk at the front of the store (similar to the coin to cash machines many currently have) and it prints you out a voucher saying how many bags you returned.
  • The machine bails the bags back up for use by the store employees at the register, or perhaps some off site service does this and washes the bags, gets rid of damaged ones etc.
  • You shop as normal, but when you check out, you give them your bag voucher. If the voucher is for fewer bags than you need for this trip, you can pay for some extra ones. If your voucher is for more than you need this time, you get the deposit refunded on the ones you don’t need.

The idea here is that at the end of the day…

  • The baggers have a uniform type of bag to work with, which makes their job easier and allows them to bag more quickly (like the old days!).
  • The bags are ideally as environmentally friendly made as possible (heck, they could be made from corn, hemp, burlap, whatever).
  • It isn’t a big deal if you forgot your bags at home or in the car – you pay a little extra to use new bags, but when you return them, you get most of that back.

There are some obvious issues to work through, including:

  • Getting those kiosks into the stores, and ideally having all stores use compatible bags.
  • Dealing with damaged or dirty bags. It would be nice if the machine or whatever process bails them is able to detect and segregate them somehow, and ideally have some washing / minimal recycling process off site to make new ones (like pulp and re-form?)
  • The plastic bags are a profit center for grocery stores now. They would need sufficient incentive, or pressure from customers to make it happen.

So, that’s my idea. Environmentally friendly semi-reusable bags – but with the convenience we all have come to enjoy from disposable bags. The best of both worlds.

As a video game programmer this is far outside of my interest and ability, so please take this business idea if you have the desire and the means. Let’s make it happen!

Neural Network Recipe: Recognize Handwritten Digits With 95% Accuracy

This post is a recipe for making a neural network which is able to recognize hand written numeric digits (0-9) with 95% accuracy.

The intent is that you can use this recipe (and included simple C++ code, and interactive web demo!) as a starting point for some hands on experimentation.

A recent post of mine talks about all the things used in this recipe so give it a read if you want more info about anything: How to Train Neural Networks With Backpropagation.

This recipe is also taken straight from this amazing website (but coded from scratch in C++ by myself), where it’s implemented in python: Using neural nets to recognize handwritten digits.

Recipe

The neural network takes as input 28×28 greyscale images, so there will be 784 input neurons.

There is one hidden layer that has 30 neurons.

The final layer is the output layer which has 10 neurons.

The output neuron with the highest activation is the digit that was recognized. For instance if output neuron 0 had the highest activation, the network detected a 0. If output neuron 2 was highest, the network detected a 2.

The neurons use the sigmoid activation function, and the cost function used is half mean squared error.

Training uses a learning rate of 3.0 and the training data is processed by the network 30 times (aka 30 training epochs), using a minibatch size of 10.

A minibatch size of 10 just means that we calculate the gradient for 10 training samples at a time and adjust the weights and biases using that gradient. We do that for the entire (shuffled) 60,000 training items and call that a single epoch. 30 epochs mean we do this full process 30 times.

There are 60,000 items in the training data, mapping 28×28 greyscale images to what digit 0-9 they actually represent.

Besides the 60,000 training data items, there are also 10,000 separate items that are the test data. These test data items are items never seen by the network during training and are just used as a way to see how well the network has learned about the problem in general, versus learning about the specific training data items.

The test and training data is the MNIST data set. I have a link to zip file I made with the data in it below, but this is where I got the data from: The MNIST database of handwritten digits.

That is the entire recipe!

Results

The 30 training epochs took 1 minute 22 seconds on my machine in release x64 (with REPORT_ERROR_WHILE_TRAINING() set to 0 to speed things up), but the code could be made to run faster by using SIMD, putting it on the GPU, getting multithreading involved or other things.

Below is a graph of the accuracy during the training epochs.

Notice that most learning happened very early on and then only slowly improved from there. This is due to our neuron activation functions and also our cost function. There are better choices for both, but this is also an ongoing area of research to improve in neural networks.

The end result of my training run is 95.32% accuracy but you may get slightly higher or lower due to random initialization of weights and biases. That sounds pretty high, but if you were actually using this, 4 or 5 numbers wrong out of 100 is a pretty big deal! The record for MNIST is 99.77% accuracy using “a committee of convolutional networks” where they distorted the input data during training to make it learn in a more generalized way (described as “committee of 35 conv. net, 1-20-P-40-P-150-10 [elastic distortions]”).

A better cost function would probably be the cross entropy cost function, a better activation function than sigmoid would probably be an ELU (Exponential Linear Unit). A soft max layer could be used instead of just taking the maximum output neuron as the answer. The weights could be initialized to smarter values. We could also use a convolutional layer to help let the network learn features in a way that didn’t also tie the features to specific locations in the images.

Many of these things are described in good detail at http://neuralnetworksanddeeplearning.com/, particularly in chapter 3 where they make a python implementation of a convolutional neural network which performs better than this one. I highly recommend checking that website out!

HTML5 Demo

You can play with a network created with this recipe here: Recognize Handwritten Digit 95% Accuracy

Here is an example of it correctly detecting that I drew a 4.

The demo works “pretty well” but it does have a little less than 95% accuracy.

The reason for this though is that it isn’t comparing apples to apples.

A handwritten digit isn’t quite the same as a digit drawn with a mouse. Check out the image below to see 100 of the training images and see what i mean.

The demo finds the bounding box of the drawn image and rescales that bounding box to a 20×20 image, preserving the aspect ratio. It then puts that into a 28×28 image, using the center of mass of the pixels to center the smaller image in the larger one. This is how the MNIST data was generated, so makes the demo more accurate, but it also has the nice side effect of making it so you can draw a number of any size, in any part of the box, and it will treat it the same as if you drew it at a difference size, or in a different part of the box.

The code that goes with this post outputs the weights, biases and network structure in a json format that is very easy to drop into the html5 demo. This way, if you want to try different things in the network, it should be fairly low effort to adjust the demo to try your adjustments there as well.

Lastly, it might be interesting to get the derivatives of the inputs and play around with the input you gave it. Some experiments I can think of:

  1. When it misclassifies what number you drew, have it adjust what you drew (the input) to be more like what the network would expect to see for that digit. This could help show why it misclassified your number.
  2. Start with a well classified number and make it morph into something recognized by the network as a different number.
  3. Start with a random static (noise) image and adjust it until the network recognizes it as a digit. It would be interesting to see if it looked anything like a number, or if it was still just static.

Source Code

The source code and mnist data is on github at MNIST1, but is also included below for your convenience.

If grabbing the source code from below instead of github, you will need to extract this zip file into the working directory of the program as well. It contains the test data used for training the network.
mnist.zip

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <random>
#include <array>
#include <vector>
#include <algorithm>
#include <chrono>

typedef uint32_t uint32;
typedef uint16_t uint16;
typedef uint8_t uint8;

// Set to 1 to have it show error after each training and also writes it to an Error.csv file.
// Slows down the process a bit (+~50% time on my machine)
#define REPORT_ERROR_WHILE_TRAINING() 1 

const size_t c_numInputNeurons = 784;
const size_t c_numHiddenNeurons = 30;  // NOTE: setting this to 100 hidden neurons can give better results, but also can be worse other times.
const size_t c_numOutputNeurons = 10;

const size_t c_trainingEpochs = 30;
const size_t c_miniBatchSize = 10;
const float c_learningRate = 3.0f;

// ============================================================================================
//                                     SBlockTimer
// ============================================================================================
// times a block of code
struct SBlockTimer
{
	SBlockTimer (const char* label)
	{
		m_start = std::chrono::high_resolution_clock::now();
		m_label = label;
	}

	~SBlockTimer ()
	{
		std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
		printf("%s%0.2f secondsn", m_label, seconds.count());
	}

	std::chrono::high_resolution_clock::time_point m_start;
	const char* m_label;
};

// ============================================================================================
//                                    MNIST DATA LOADER
// ============================================================================================

inline uint32 EndianSwap (uint32 a)
{
	return (a<<24) | ((a<<8) & 0x00ff0000) |
           ((a>>8) & 0x0000ff00) | (a>>24);
}

// MNIST data and file format description is from http://yann.lecun.com/exdb/mnist/
class CMNISTData
{
public:
	CMNISTData ()
	{
		m_labelData = nullptr;
		m_imageData = nullptr;

		m_imageCount = 0;
		m_labels = nullptr;
		m_pixels = nullptr;
	}

	bool Load (bool training)
	{
		// set the expected image count
		m_imageCount = training ? 60000 : 10000;

		// read labels
		const char* labelsFileName = training ? "train-labels.idx1-ubyte" : "t10k-labels.idx1-ubyte";
		FILE* file = fopen(labelsFileName,"rb");
		if (!file)
		{
			printf("could not open %s for reading.n", labelsFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		long fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_labelData = new uint8[fileSize];
		fread(m_labelData, fileSize, 1, file);
		fclose(file);

		// read images
		const char* imagesFileName = training ? "train-images.idx3-ubyte" : "t10k-images.idx3-ubyte";
		file = fopen(imagesFileName, "rb");
		if (!file)
		{
			printf("could not open %s for reading.n", imagesFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_imageData = new uint8[fileSize];
		fread(m_imageData, fileSize, 1, file);
		fclose(file);

		// endian swap label file if needed, just first two uint32's.  The rest is uint8's.
		uint32* data = (uint32*)m_labelData;
		if (data[0] == 0x01080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
		}

		// verify that the label file has the right header
		if (data[0] != 2049 || data[1] != m_imageCount)
		{
			printf("Label data had unexpected header values.n");
			return false;
		}
		m_labels = (uint8*)&(data[2]);

		// endian swap the image file if needed, just first 4 uint32's. The rest is uint8's.
		data = (uint32*)m_imageData;
		if (data[0] == 0x03080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
			data[2] = EndianSwap(data[2]);
			data[3] = EndianSwap(data[3]);
		}

		// verify that the image file has the right header
		if (data[0] != 2051 || data[1] != m_imageCount || data[2] != 28 || data[3] != 28)
		{
			printf("Label data had unexpected header values.n");
			return false;
		}
		m_pixels = (uint8*)&(data[4]);

		// convert the pixels from uint8 to float
		m_pixelsFloat.resize(28 * 28 * m_imageCount);
		for (size_t i = 0; i < 28 * 28 * m_imageCount; ++i)
			m_pixelsFloat[i] = float(m_pixels[i]) / 255.0f;

		// success!
		return true;
	}

	~CMNISTData ()
	{
		delete[] m_labelData;
		delete[] m_imageData;
	}

	size_t NumImages () const { return m_imageCount; }

	const float* GetImage (size_t index, uint8& label) const
	{
		label = m_labels[index];
		return &m_pixelsFloat[index * 28 * 28];
	}

private:
	void* m_labelData;
	void* m_imageData;

	size_t m_imageCount;
	uint8* m_labels;
	uint8* m_pixels;

	std::vector<float> m_pixelsFloat;
};

// ============================================================================================
//                                    NEURAL NETWORK
// ============================================================================================

template <size_t INPUTS, size_t HIDDEN_NEURONS, size_t OUTPUT_NEURONS>
class CNeuralNetwork
{
public:
	CNeuralNetwork ()
	{
		// initialize weights and biases to a gaussian distribution random number with mean 0, stddev 1.0
		std::random_device rd;
		std::mt19937 e2(rd());
		std::normal_distribution<float> dist(0, 1);

		for (float& f : m_hiddenLayerBiases)
			f = dist(e2);

		for (float& f : m_outputLayerBiases)
			f = dist(e2);

		for (float& f : m_hiddenLayerWeights)
			f = dist(e2);

		for (float& f : m_outputLayerWeights)
			f = dist(e2);
	}

	void Train (const CMNISTData& trainingData, size_t miniBatchSize, float learningRate)
	{
		// shuffle the order of the training data for our mini batches
		if (m_trainingOrder.size() != trainingData.NumImages())
		{
			m_trainingOrder.resize(trainingData.NumImages());
			size_t index = 0;
			for (size_t& v : m_trainingOrder)
			{
				v = index;
				++index;
			}
		}
		static std::random_device rd;
		static std::mt19937 e2(rd());
		std::shuffle(m_trainingOrder.begin(), m_trainingOrder.end(), e2);

		// process all minibatches until we are out of training examples
		size_t trainingIndex = 0;
		while (trainingIndex < trainingData.NumImages())
		{
			// Clear out minibatch derivatives.  We sum them up and then divide at the end of the minimatch
			std::fill(m_miniBatchHiddenLayerBiasesDeltaCost.begin(), m_miniBatchHiddenLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerBiasesDeltaCost.begin(), m_miniBatchOutputLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchHiddenLayerWeightsDeltaCost.begin(), m_miniBatchHiddenLayerWeightsDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerWeightsDeltaCost.begin(), m_miniBatchOutputLayerWeightsDeltaCost.end(), 0.0f);

			// process the minibatch
			size_t miniBatchIndex = 0;
			while (miniBatchIndex < miniBatchSize && trainingIndex < trainingData.NumImages())
			{
				// get the training item
				uint8 imageLabel = 0;
				const float* pixels = trainingData.GetImage(m_trainingOrder[trainingIndex], imageLabel);

				// run the forward pass of the network
				uint8 labelDetected = ForwardPass(pixels, imageLabel);

				// run the backward pass to get derivatives of the cost function
				BackwardPass(pixels, imageLabel);

				// add the current derivatives into the minibatch derivative arrays so we can average them at the end of the minibatch via division.
				for (size_t i = 0; i < m_hiddenLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerBiasesDeltaCost[i] += m_hiddenLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchOutputLayerBiasesDeltaCost[i] += m_outputLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_hiddenLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerWeightsDeltaCost[i] += m_hiddenLayerWeightsDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchOutputLayerWeightsDeltaCost[i] += m_outputLayerWeightsDeltaCost[i];

				// note that we've added another item to the minibatch, and that we've consumed another training example
				++trainingIndex;
				++miniBatchIndex;
			}

			// divide minibatch derivatives by how many items were in the minibatch, to get the average of the derivatives.
			// NOTE: instead of doing this explicitly like in the commented code below, we'll do it implicitly
			// by dividing the learning rate by miniBatchIndex.
			/*
			for (float& f : m_miniBatchHiddenLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchHiddenLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			*/

			float miniBatchLearningRate = learningRate / float(miniBatchIndex);

			// apply training to biases and weights
			for (size_t i = 0; i < m_hiddenLayerBiases.size(); ++i)
				m_hiddenLayerBiases[i] -= m_miniBatchHiddenLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerBiases.size(); ++i)
				m_outputLayerBiases[i] -= m_miniBatchOutputLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_hiddenLayerWeights.size(); ++i)
				m_hiddenLayerWeights[i] -= m_miniBatchHiddenLayerWeightsDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerWeights.size(); ++i)
				m_outputLayerWeights[i] -= m_miniBatchOutputLayerWeightsDeltaCost[i] * miniBatchLearningRate;
		}
	}

	// This function evaluates the network for the given input pixels and returns the label it thinks it is from 0-9
	uint8 ForwardPass (const float* pixels, uint8 correctLabel)
	{
		// first do hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			float Z = m_hiddenLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				Z += pixels[inputIndex] * m_hiddenLayerWeights[HiddenLayerWeightIndex(inputIndex, neuronIndex)];

			m_hiddenLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// then do output layer
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			float Z = m_outputLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				Z += m_hiddenLayerOutputs[inputIndex] * m_outputLayerWeights[OutputLayerWeightIndex(inputIndex, neuronIndex)];

			m_outputLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// calculate error.
		// this is the magnitude of the vector that is Desired - Actual.
		// Commenting out because it's not needed.
		/*
		{
			error = 0.0f;
			for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
			{
				float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;
				float diff = (desiredOutput - m_outputLayerOutputs[neuronIndex]);
				error += diff * diff;
			}
			error = std::sqrt(error);
		}
		*/

		// find the maximum value of the output layer and return that index as the label
		float maxOutput = m_outputLayerOutputs[0];
		uint8 maxLabel = 0;
		for (uint8 neuronIndex = 1; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			if (m_outputLayerOutputs[neuronIndex] > maxOutput)
			{
				maxOutput = m_outputLayerOutputs[neuronIndex];
				maxLabel = neuronIndex;
			}
		}
		return maxLabel;
	}

	// Functions to get weights/bias values. Used to make the JSON file.
	const std::array<float, HIDDEN_NEURONS>& GetHiddenLayerBiases () const { return m_hiddenLayerBiases; }
	const std::array<float, OUTPUT_NEURONS>& GetOutputLayerBiases () const { return m_outputLayerBiases; }
	const std::array<float, INPUTS * HIDDEN_NEURONS>& GetHiddenLayerWeights () const { return m_hiddenLayerWeights; }
	const std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>& GetOutputLayerWeights () const { return m_outputLayerWeights; }

private:

	static size_t HiddenLayerWeightIndex (size_t inputIndex, size_t hiddenLayerNeuronIndex)
	{
		return hiddenLayerNeuronIndex * INPUTS + inputIndex;
	}

	static size_t OutputLayerWeightIndex (size_t hiddenLayerNeuronIndex, size_t outputLayerNeuronIndex)
	{
		return outputLayerNeuronIndex * HIDDEN_NEURONS + hiddenLayerNeuronIndex;
	}

	// this function uses the neuron output values from the forward pass to backpropagate the error
	// of the network to calculate the gradient needed for training.  It figures out what the error
	// is by comparing the label it came up with to the label it should have come up with (correctLabel).
	void BackwardPass (const float* pixels, uint8 correctLabel)
	{
		// since we are going backwards, do the output layer first
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each output neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			//
			// deltaCost/deltaO = O - desiredOutput
			// deltaO/deltaZ = O * (1 - O)
			//
			float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;

			float deltaCost_deltaO = m_outputLayerOutputs[neuronIndex] - desiredOutput;
			float deltaO_deltaZ = m_outputLayerOutputs[neuronIndex] * (1.0f - m_outputLayerOutputs[neuronIndex]);

			m_outputLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				m_outputLayerWeightsDeltaCost[OutputLayerWeightIndex(inputIndex, neuronIndex)] = m_outputLayerBiasesDeltaCost[neuronIndex] * m_hiddenLayerOutputs[inputIndex];
		}

		// then do the hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each hidden neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaO =
			//   Sum for each output of this neuron:
			//     deltaCost/deltaDestinationZ * deltaDestinationZ/deltaSourceO
			//
			// deltaCost/deltaDestinationZ is already calculated and lives in m_outputLayerBiasesDeltaCost[destinationNeuronIndex].
			// deltaTargetZ/deltaSourceO is the value of the weight connecting the source and target neuron.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			// deltaO/deltaZ = O * (1 - O)
			//
			float deltaCost_deltaO = 0.0f;
			for (size_t destinationNeuronIndex = 0; destinationNeuronIndex < OUTPUT_NEURONS; ++destinationNeuronIndex)
				deltaCost_deltaO += m_outputLayerBiasesDeltaCost[destinationNeuronIndex] * m_outputLayerWeights[OutputLayerWeightIndex(neuronIndex, destinationNeuronIndex)];
			float deltaO_deltaZ = m_hiddenLayerOutputs[neuronIndex] * (1.0f - m_hiddenLayerOutputs[neuronIndex]);
			m_hiddenLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				m_hiddenLayerWeightsDeltaCost[HiddenLayerWeightIndex(inputIndex, neuronIndex)] = m_hiddenLayerBiasesDeltaCost[neuronIndex] * pixels[inputIndex];
		}
	}

private:

	// biases and weights
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiases;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiases;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeights;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeights;

	// neuron activation values aka "O" values
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerOutputs;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerOutputs;

	// derivatives of biases and weights for a single training example
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeightsDeltaCost;

	// derivatives of biases and weights for the minibatch. Average of all items in minibatch.
	std::array<float, HIDDEN_NEURONS>					m_miniBatchHiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_miniBatchOutputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_miniBatchHiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_miniBatchOutputLayerWeightsDeltaCost;

	// used for minibatch generation
	std::vector<size_t>									m_trainingOrder;
};

// ============================================================================================
//                                   DRIVER PROGRAM
// ============================================================================================

// training and test data
CMNISTData g_trainingData;
CMNISTData g_testData;

// neural network
CNeuralNetwork<c_numInputNeurons, c_numHiddenNeurons, c_numOutputNeurons> g_neuralNetwork;

float GetDataAccuracy (const CMNISTData& data)
{
	size_t correctItems = 0;
	for (size_t i = 0, c = data.NumImages(); i < c; ++i)
	{
		uint8 label;
		const float* pixels = data.GetImage(i, label);
		uint8 detectedLabel = g_neuralNetwork.ForwardPass(pixels, label);

		if (detectedLabel == label)
			++correctItems;
	}
	return float(correctItems) / float(data.NumImages());
}

void ShowImage (const CMNISTData& data, size_t imageIndex)
{
	uint8 label = 0;
	const float* pixels = data.GetImage(imageIndex, label);
	printf("showing a %in", label);
	for (int iy = 0; iy < 28; ++iy)
	{
		for (int ix = 0; ix < 28; ++ix)
		{
			if (*pixels < 0.125)
				printf(" ");
			else
				printf("+");
			++pixels;
		}
		printf("n");
	}
}

int main (int argc, char** argv)
{
	// load the MNIST data if we can
	if (!g_trainingData.Load(true) || !g_testData.Load(false))
	{
		printf("Could not load mnist data, aborting!n");
		system("pause");
		return 1;
	}

	#if REPORT_ERROR_WHILE_TRAINING()
	FILE *file = fopen("Error.csv","w+t");
	if (!file)
	{
		printf("Could not open Error.csv for writing, aborting!n");
		system("pause");
		return 2;
	}
	fprintf(file, ""Training Data Accuracy","Testing Data Accuracy"n");
	#endif

	{
		SBlockTimer timer("Training Time:  ");

		// train the network, reporting error before each training
		for (size_t epoch = 0; epoch < c_trainingEpochs; ++epoch)
		{
			#if REPORT_ERROR_WHILE_TRAINING()
				float accuracyTraining = GetDataAccuracy(g_trainingData);
				float accuracyTest = GetDataAccuracy(g_testData);
				printf("Training Data Accuracy: %0.2f%%n", 100.0f*accuracyTraining);
				printf("Test Data Accuracy: %0.2f%%nn", 100.0f*accuracyTest);
				fprintf(file, ""%f","%f"n", accuracyTraining, accuracyTest);
			#endif

			printf("Training epoch %zu / %zu...n", epoch+1, c_trainingEpochs);
			g_neuralNetwork.Train(g_trainingData, c_miniBatchSize, c_learningRate);
			printf("n");
		}
	}

	// report final error
	float accuracyTraining = GetDataAccuracy(g_trainingData);
	float accuracyTest = GetDataAccuracy(g_testData);
	printf("nFinal Training Data Accuracy: %0.2f%%n", 100.0f*accuracyTraining);
	printf("Final Test Data Accuracy: %0.2f%%nn", 100.0f*accuracyTest);

	#if REPORT_ERROR_WHILE_TRAINING()
		fprintf(file, ""%f","%f"n", accuracyTraining, accuracyTest);
		fclose(file);
	#endif

	// Write out the final weights and biases as JSON for use in the web demo
	{
		FILE* file = fopen("WeightsBiasesJSON.txt", "w+t");
		fprintf(file, "{n");

		// network structure
		fprintf(file, "  "InputNeurons":%zu,n", c_numInputNeurons);
		fprintf(file, "  "HiddenNeurons":%zu,n", c_numHiddenNeurons);
		fprintf(file, "  "OutputNeurons":%zu,n", c_numOutputNeurons);

		// HiddenBiases
		auto hiddenBiases = g_neuralNetwork.GetHiddenLayerBiases();
		fprintf(file, "  "HiddenBiases" : [n");
		for (size_t i = 0; i < hiddenBiases.size(); ++i)
		{
			fprintf(file, "    %f", hiddenBiases[i]);
			if (i < hiddenBiases.size() -1)
				fprintf(file, ",");
			fprintf(file, "n");
		}
		fprintf(file, "  ],n");

		// HiddenWeights
		auto hiddenWeights = g_neuralNetwork.GetHiddenLayerWeights();
		fprintf(file, "  "HiddenWeights" : [n");
		for (size_t i = 0; i < hiddenWeights.size(); ++i)
		{
			fprintf(file, "    %f", hiddenWeights[i]);
			if (i < hiddenWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "n");
		}
		fprintf(file, "  ],n");

		// OutputBiases
		auto outputBiases = g_neuralNetwork.GetOutputLayerBiases();
		fprintf(file, "  "OutputBiases" : [n");
		for (size_t i = 0; i < outputBiases.size(); ++i)
		{
			fprintf(file, "    %f", outputBiases[i]);
			if (i < outputBiases.size() - 1)
				fprintf(file, ",");
			fprintf(file, "n");
		}
		fprintf(file, "  ],n");

		// OutputWeights
		auto outputWeights = g_neuralNetwork.GetOutputLayerWeights();
		fprintf(file, "  "OutputWeights" : [n");
		for (size_t i = 0; i < outputWeights.size(); ++i)
		{
			fprintf(file, "    %f", outputWeights[i]);
			if (i < outputWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "n");
		}
		fprintf(file, "  ]n");

		// done
		fprintf(file, "}n");
		fclose(file);
	}

	// You can use the code like the below to visualize an image if you want to.
	//ShowImage(g_testData, 0);

	system("pause");
	return 0;
}

Thanks for reading, and if you have any questions, comments, or just want to chat, hit me up in the comments below, or on twitter at @Atrix256.

Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences

In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences.

By special request, this is that post!

The post I already linked to explains backpropagation.

If you want an explanation of dual numbers, check out these posts:

  1. Dual Numbers & Automatic Differentiation
  2. Multivariable Dual Numbers & Automatic Differentiation

If you want an explanation of finite differences, check out this post:
Finite Differences

Since the fundamentals are explained in the links above, we’ll go straight to the code.

We’ll be getting the gradient (learning values) for the network in example 4 in the backpropagation post:

Note that I am using “central differences” for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I’m using an epsilon of 0.001.

I didn’t compare the running times of each method as my code is meant to be readable, not fast, and the code isn’t doing enough work to make a meaningful performance test IMO. If you did do a speed test, the finite differences should be by far the slowest, backpropagation should be the fastest, and dual numbers are probably going to be closer to backpropagation than to finite differences.

The output of the program is below. Both backpropagation and dual numbers get the exact derivatives (within the tolerances of floating point math of course!) because they use the chain rule, whereas finite differences is a numerical approximation. This shows up in the fact that backpropagation and dual numbers agree for all values, where finite differences has some small error in the derivatives.

And here is the code:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>
  
#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

//----------------------------------------------------------------------
// Dual Number Class - CDualNumber
//----------------------------------------------------------------------

template <size_t NUMVARIABLES>
class CDualNumber
{
public:
 
    // constructor to make a constant
    CDualNumber (float f = 0.0f) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
    }
 
    // constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
    CDualNumber (float f, size_t variableIndex) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
        m_dual[variableIndex] = 1.0f;
    }

	// Set a constant value.
	void Set (float f) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
	}

	// Set a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
	void Set (float f, size_t variableIndex) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
		m_dual[variableIndex] = 1.0f;
	}
 
    // storage for real and dual values
    float                           m_real;
    std::array<float, NUMVARIABLES> m_dual;
};
  
//----------------------------------------------------------------------
// Dual Number Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real + b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
    return ret;
}
  
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real - b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real * b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real / b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
    return ret;
}

// NOTE: the "special functions" below all just use the chain rule, which you can also use to add more functions
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    float sqrtReal = sqrt(a.m_real);
    ret.m_real = sqrtReal;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = pow(a.m_real, y);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
    return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> exp (const CDualNumber<NUMVARIABLES>& a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = exp(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * exp(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = sin(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = cos(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
    return ret;
}
 
// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
    return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Driver Program
//----------------------------------------------------------------------

enum EWeightsBiases
{
	e_weight0 = 0,
	e_weight1,
	e_weight2,
	e_weight3,
	e_weight4,
	e_weight5,
	e_weight6,
	e_weight7,
	e_bias0,
	e_bias1,
	e_bias2,
	e_bias3,

	e_count
};

// our dual number needs a "dual" for every value we want a derivative for: aka every weight and bias
typedef CDualNumber<EWeightsBiases::e_count> TDualNumber;

// templated so it can work for both the dual numbers, as well as the float finite differences
template <typename TBaseType>
void ForwardPass (const std::array<TBaseType, 2>& input, const std::array<TBaseType, 2>& desiredOutput, const std::array<TBaseType, EWeightsBiases::e_count>& weightsBiases, TBaseType& cost)
{
	// calculate hidden layer neuron activations
	TBaseType Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
	TBaseType O0 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z0 * TBaseType(-1.0f)));

	TBaseType Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
	TBaseType O1 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z1 * TBaseType(-1.0f)));

	// calculate output layer neuron activations
	TBaseType Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
	TBaseType O2 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z2 * TBaseType(-1.0f)));

	TBaseType Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
	TBaseType O3 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z3 * TBaseType(-1.0f)));

	// calculate the cost: 0.5 * ||target-actual||^2
	// aka cost = half (error squared)
	TBaseType diff1 = TBaseType(desiredOutput[0]) - O2;
	TBaseType diff2 = TBaseType(desiredOutput[1]) - O3;
	cost = TBaseType(0.5f) * (diff1*diff1 + diff2*diff2);
}

// backpropagation
void ForwardPassAndBackpropagation (
    const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, EWeightsBiases::e_count>& weightsBiases,
    float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, EWeightsBiases::e_count>& deltaWeightsBiases
) {
    // calculate Z0 and O0 for neuron0
    float Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
    float O0 = 1.0f / (1.0f + std::exp(-Z0));
 
    // calculate Z1 and O1 for neuron1
    float Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
    float O1 = 1.0f / (1.0f + std::exp(-Z1));
 
    // calculate Z2 and O2 for neuron2
    float Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
    float O2 = 1.0f / (1.0f + std::exp(-Z2));
 
    // calculate Z3 and O3 for neuron3
    float Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
    float O3 = 1.0f / (1.0f + std::exp(-Z3));
 
    // the actual output of the network is the activation of the output layer neurons
    actualOutput[0] = O2;
    actualOutput[1] = O3;
 
    // calculate error
    float diff0 = desiredOutput[0] - actualOutput[0];
    float diff1 = desiredOutput[1] - actualOutput[1];
    error = std::sqrt(diff0*diff0 + diff1*diff1);
 
    // calculate cost
    cost = 0.5f * error * error;
 
    //----- Neuron 2 -----
 
    // calculate how much a change in neuron 2 activation affects the cost function
    // deltaCost/deltaO2 = O2 - target0
    float deltaCost_deltaO2 = O2 - desiredOutput[0];
 
    // calculate how much a change in neuron 2 weighted input affects neuron 2 activation
    // deltaO2/deltaZ2 = O2 * (1 - O2)
    float deltaO2_deltaZ2 = O2 * (1 - O2);
 
    // calculate how much a change in neuron 2 weighted input affects the cost function.
    // This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
    // This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
    float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaWeightsBiases[e_bias2] = neuron2Error;
 
    // calculate how much a change in weight4 affects the cost function.
    // deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * O0
    // similar thing for weight5
	deltaWeightsBiases[e_weight4] = neuron2Error * O0;
	deltaWeightsBiases[e_weight5] = neuron2Error * O1;
 
    //----- Neuron 3 -----
 
    // calculate how much a change in neuron 3 activation affects the cost function
    // deltaCost/deltaO3 = O3 - target1
    float deltaCost_deltaO3 = O3 - desiredOutput[1];
 
    // calculate how much a change in neuron 3 weighted input affects neuron 3 activation
    // deltaO3/deltaZ3 = O3 * (1 - O3)
    float deltaO3_deltaZ3 = O3 * (1 - O3);
 
    // calculate how much a change in neuron 3 weighted input affects the cost function.
    // This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
    // This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
    float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaWeightsBiases[e_bias3] = neuron3Error;
 
    // calculate how much a change in weight6 affects the cost function.
    // deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * O0
    // similar thing for weight7
	deltaWeightsBiases[e_weight6] = neuron3Error * O0;
	deltaWeightsBiases[e_weight7] = neuron3Error * O1;
 
    //----- Neuron 0 -----
 
    // calculate how much a change in neuron 0 activation affects the cost function
    // deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
    // deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
    float deltaCost_deltaO0 = neuron2Error * weightsBiases[e_weight4] + neuron3Error * weightsBiases[e_weight6];
 
    // calculate how much a change in neuron 0 weighted input affects neuron 0 activation
    // deltaO0/deltaZ0 = O0 * (1 - O0)
    float deltaO0_deltaZ0 = O0 * (1 - O0);
 
    // calculate how much a change in neuron 0 weighted input affects the cost function.
    // This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
    // This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
    float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaWeightsBiases[e_bias0] = neuron0Error;
 
    // calculate how much a change in weight0 affects the cost function.
    // deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * input0
    // similar thing for weight1
	deltaWeightsBiases[e_weight0] = neuron0Error * input[0];
	deltaWeightsBiases[e_weight1] = neuron0Error * input[1];
 
    //----- Neuron 1 -----
 
    // calculate how much a change in neuron 1 activation affects the cost function
    // deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
    // deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
    float deltaCost_deltaO1 = neuron2Error * weightsBiases[e_weight5] + neuron3Error * weightsBiases[e_weight7];
 
    // calculate how much a change in neuron 1 weighted input affects neuron 1 activation
    // deltaO1/deltaZ1 = O1 * (1 - O1)
    float deltaO1_deltaZ1 = O1 * (1 - O1);
 
    // calculate how much a change in neuron 1 weighted input affects the cost function.
    // This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
    // This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
    float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaWeightsBiases[e_bias1] = neuron1Error;
 
    // calculate how much a change in weight2 affects the cost function.
    // deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * input0
    // similar thing for weight3
	deltaWeightsBiases[e_weight2] = neuron1Error * input[0];
	deltaWeightsBiases[e_weight3] = neuron1Error * input[1];
}
 
int main (int argc, char **argv)
{

	// weights and biases, inputs and desired output
	const std::array<float, EWeightsBiases::e_count> weightsBiases =
	{
		0.15f, // e_weight0
		0.2f,  // e_weight1
		0.25f, // e_weight2
		0.3f,  // e_weight3
		0.4f,  // e_weight4
		0.45f, // e_weight5
		0.5f,  // e_weight6
		0.55f, // e_weight7
		0.35f, // e_bias0
		0.35f, // e_bias1
		0.6f,  // e_bias2
		0.6f   // e_bias3
	};

	const std::array<float, 2> inputs = 
	{
		0.05f,
		0.1f
	};

	std::array<float, 2> desiredOutput = {
		0.01f,
		0.99f
	};

	// =====================================================
	// ===== FINITE DIFFERENCES CALCULATED DERIVATIVES =====
	// =====================================================

	std::array<float, EWeightsBiases::e_count> gradientFiniteDifferences;
	std::array<float, EWeightsBiases::e_count> weightsBiasesFloat;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesFloat[i] = weightsBiases[i];

	// use central differences to approximate the gradient
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float costSample1 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] - EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample1);

		float costSample2 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] + EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample2);

		gradientFiniteDifferences[i] = (costSample2 - costSample1) / (EPSILON * 2.0f);

		weightsBiasesFloat[i] = weightsBiases[i];
	}

	// ==============================================
	// ===== DUAL NUMBER CALCULATED DERIVATIVES =====
	// ==============================================

	// dual number weights and biases
	std::array<TDualNumber, EWeightsBiases::e_count> weightsBiasesDual;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesDual[i].Set(weightsBiases[i], i);

	// dual number inputs and desired output
	std::array<TDualNumber, 2> inputsDual;
	for (size_t i = 0; i < inputsDual.size(); ++i)
		inputsDual[i].Set(inputs[i]);

	std::array<TDualNumber, 2> desiredOutputDual;
	for (size_t i = 0; i < desiredOutputDual.size(); ++i)
		desiredOutputDual[i].Set(desiredOutput[i]);

	// dual number derivatives
	TDualNumber gradientDualNumbers;
	ForwardPass(inputsDual, desiredOutputDual, weightsBiasesDual, gradientDualNumbers);

	// ==================================================
	// ===== BACKPROPAGATION CALCULATED DERIVATIVES =====
	// ==================================================

	float error;
	float cost;
	std::array<float, 2> actualOutput;
	std::array<float, EWeightsBiases::e_count> gradientBackPropagation;
	ForwardPassAndBackpropagation(inputs, desiredOutput, weightsBiases, error, cost, actualOutput, gradientBackPropagation);

	// ==========================
	// ===== Report Results =====
	// ==========================

	printf("Neural Network GradientnnBackpropagation     Dual Numbers (Error)       Finite Differences (Error)n");
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float diffDual = gradientBackPropagation[i] - gradientDualNumbers.m_dual[i];
		float diffFinitDifferences = gradientBackPropagation[i] - gradientFiniteDifferences[i];
		printf("% 08f,         % 08f (% 08f),     % 08f (% 08f)n",
			gradientBackPropagation[i],
			gradientDualNumbers.m_dual[i], diffDual,
			gradientFiniteDifferences[i], diffFinitDifferences
		);
	}
	printf("n");

	system("pause");
    return 0;
}

How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

  1. Explanation
  2. Working through examples
  3. Simple sample C++ source code using only standard includes
  4. Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
  weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);

To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)

Plugging in the specific values for the weights and biases, it looks like this:

Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

\sigma(x) = \frac{1}{1+e^{-x}}

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula y=mx+b and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is \frac{dy}{dx}, which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by \frac{dy}{dx} to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: \frac{\partial y}{\partial x}.

The reason partial derivatives even exist is because if you have a function of multiple variables like z=f(x,y)=x^2+3y+2, you have two variables that you can take the derivative of. You can calculate \frac{\partial z}{\partial x} and \frac{\partial z}{\partial y}. The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\

If you put both of those values together into a vector (\frac{\partial z}{\partial x},\frac{\partial z}{\partial y}) you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

  1. Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
  2. Multiply the gradient by a small “learning rate” value, such as 0.05
  3. Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

Cost = ||target-output||^2

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like \frac{\partial Cost}{\partial Weight0}, you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

dy/dx

That you can also write it like this:

dy/du*du/dx

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

3/2 = 1.5

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

  1. dCost/dOutput: The derivative of how a neuron’s output affects cost
  2. dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
  3. dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Backpropagation Example 2: Single Neuron, Two Training Examples

Let’s start with the same neural network from last time:

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476

If we follow the same steps with the “input 0, output 1” training example we get these:
\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887

To get the actual derivatives to train the network with, we just average them!
\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984

Now we start the backward pass to calculate the derivatives for training.

Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\

Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\

Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\

Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\

Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

Check the links at the bottom of the post for more information about this!

Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
	float input, float desiredOutput,
	float weight, float bias,
	float& error, float& cost, float& actualOutput,
	float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
	// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
	float Z = input * weight + bias;
	float O = 1.0f / (1.0f + std::exp(-Z));

	// the actual output of the network is the activation of the neuron
	actualOutput = O;

	// calculate error
	error = std::abs(desiredOutput - actualOutput);

	// calculate cost
	cost = 0.5f * error * error;

	// calculate how much a change in neuron activation affects the cost function
	// deltaCost/deltaO = O - target
	float deltaCost_deltaO = O - desiredOutput;

	// calculate how much a change in neuron weighted input affects neuron activation
	// deltaO/deltaZ = O * (1 - O)
	float deltaO_deltaZ = O * (1 - O);


	// calculate how much a change in a neuron's weighted input affects the cost function.
	// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
	// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
	float neuronError = deltaCost_deltaO * deltaO_deltaZ;
	deltaCost_deltaBias = neuronError;

	// calculate how much a change in the weight affects the cost function.
	// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * input
	deltaCost_deltaWeight = neuronError * input;


	// As a bonus, calculate how much a change in the input affects the cost function.
	// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
	// deltaCost/deltaInput = neuronError * weight
	deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example1.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 10000;

	// training data
	// input: 1, output: 0
	const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float error = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		// run the network to get error and derivatives
		float output = 0.0f;
		float cost = 0.0f;
		float deltaCost_deltaWeight = 0.0f;
		float deltaCost_deltaBias = 0.0f;
		float deltaCost_deltaInput = 0.0f;
		Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= deltaCost_deltaWeight * c_learningRate;
		bias -= deltaCost_deltaBias * c_learningRate;
	}

	printf("Example1 Final Error: %fn", error);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example2.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 100000;

	// training data
	// input: 1, output: 0
	// input: 0, output: 1
	const std::array<std::array<float, 2>, 2> c_trainingData = { {
		{1.0f, 0.0f},
		{0.0f, 1.0f}
	} };

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		avgError = 0.0f;
		float avgOutput = 0.0f;
		float avgCost = 0.0f;
		float avgDeltaCost_deltaWeight = 0.0f;
		float avgDeltaCost_deltaBias = 0.0f;
		float avgDeltaCost_deltaInput = 0.0f;

		// run the network to get error and derivatives for each training example
		for (const std::array<float, 2>& trainingData : c_trainingData)
		{
			float error = 0.0f;
			float output = 0.0f;
			float cost = 0.0f;
			float deltaCost_deltaWeight = 0.0f;
			float deltaCost_deltaBias = 0.0f;
			float deltaCost_deltaInput = 0.0f;
			Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

			avgError += error;
			avgOutput += output;
			avgCost += cost;
			avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
			avgDeltaCost_deltaBias += deltaCost_deltaBias;
			avgDeltaCost_deltaInput += deltaCost_deltaInput;
		}

		avgError /= (float)c_trainingData.size();
		avgOutput /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
		avgDeltaCost_deltaBias /= (float)c_trainingData.size();
		avgDeltaCost_deltaInput /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= avgDeltaCost_deltaWeight * c_learningRate;
		bias -= avgDeltaCost_deltaBias * c_learningRate;
	}

	printf("Example2 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
	std::array<float, 2> m_input;
	std::array<float, 2> m_output;
};

void Example3RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 4>& weights, const std::array<float, 2>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// the actual output of the network is the activation of the neurons
	actualOutput[0] = O0;
	actualOutput[1] = O1;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = O0 - target0
	float deltaCost_deltaO0 = O0 - desiredOutput[0];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = O1 - target1
	float deltaCost_deltaO1 = O1 - desiredOutput[1];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO0/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example3.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 520000;

	// training data: OR/AND
	// input: 00, output: 00
	// input: 01, output: 10
	// input: 10, output: 10
	// input: 11, output: 11
	const std::array<SExample3Training, 4> c_trainingData = { {
		{{0.0f, 0.0f}, {0.0f, 0.0f}},
		{{0.0f, 1.0f}, {1.0f, 0.0f}},
		{{1.0f, 0.0f}, {1.0f, 0.0f}},
		{{1.0f, 1.0f}, {1.0f, 1.0f}},
	} };

	// starting weight and bias values
	std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
	std::array<float, 2> biases = { 0.5f, 0.1f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		//float avgCost = 0.0f;
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example3 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 8>& weights, const std::array<float, 4>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// calculate Z2 and O2 for neuron2
	float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
	float O2 = 1.0f / (1.0f + std::exp(-Z2));

	// calculate Z3 and O3 for neuron3
	float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
	float O3 = 1.0f / (1.0f + std::exp(-Z3));

	// the actual output of the network is the activation of the output layer neurons
	actualOutput[0] = O2;
	actualOutput[1] = O3;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 2 -----

	// calculate how much a change in neuron 2 activation affects the cost function
	// deltaCost/deltaO2 = O2 - target0
	float deltaCost_deltaO2 = O2 - desiredOutput[0];

	// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
	// deltaO2/deltaZ2 = O2 * (1 - O2)
	float deltaO2_deltaZ2 = O2 * (1 - O2);

	// calculate how much a change in neuron 2 weighted input affects the cost function.
	// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
	// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
	float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaCost_deltaBiases[2] = neuron2Error;

	// calculate how much a change in weight4 affects the cost function.
	// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * O0
	// similar thing for weight5
	deltaCost_deltaWeights[4] = neuron2Error * O0;
	deltaCost_deltaWeights[5] = neuron2Error * O1;

	//----- Neuron 3 -----

	// calculate how much a change in neuron 3 activation affects the cost function
	// deltaCost/deltaO3 = O3 - target1
	float deltaCost_deltaO3 = O3 - desiredOutput[1];

	// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
	// deltaO3/deltaZ3 = O3 * (1 - O3)
	float deltaO3_deltaZ3 = O3 * (1 - O3);

	// calculate how much a change in neuron 3 weighted input affects the cost function.
	// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
	// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
	float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaCost_deltaBiases[3] = neuron3Error;

	// calculate how much a change in weight6 affects the cost function.
	// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * O0
	// similar thing for weight7
	deltaCost_deltaWeights[6] = neuron3Error * O0;
	deltaCost_deltaWeights[7] = neuron3Error * O1;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
	// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
	float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
	// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
	float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO1/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example4.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 5000;

	// training data: 0.05, 0.1 in = 0.01, 0.99 out
	const std::array<SExample3Training, 1> c_trainingData = { {
		{{0.05f, 0.1f}, {0.01f, 0.99f}},
	} };

	// starting weight and bias values
	std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
	std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example4 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

int main (int argc, char **argv)
{
	Example1();
	Example2();
	Example3();
	Example4();
	system("pause");
	return 0;
}

Closing & Links

The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

  1. finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
  2. Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

A Step by Step Backpropagation Example
Neural Networks and Deep Learning
Backpropogation is Just Steepest Descent with Automatic Differentiation
Chain Rule
Deep Vis

Multivariable Dual Numbers & Automatic Differentiation

In a previous post I showed how to use dual numbers to be able to get both the value and derivative of a function at the same time:
Dual Numbers & Automatic Differentiation

That post mentions that you can extend it to multivariable functions but doesn’t explain how. This post is that explanation, including simple working C++ code!

Extending this to multivariable functions is useful for ray marching, calculating analytical surface normals and also likely useful for training a neural network if you want an alternative to back propagation. I’m not sure about the efficiency comparison of this versus back propagation but I intend on looking into it (:

How Does it Work?

It turns out to be really simple to use dual numbers with multivariable functions. The end result is that you want a partial derivative for each variable in the equation, so to do that, you just have a dual number per variable, and process the entire equation for each of those dual numbers separately.

We’ll work through an example. Let’s find the partial derivatives of x and y of the function 3x^2-2y^3, at input (5,2).

We’ll start by finding the derivative of x, and then the derivative of y.

Example: df/dx

We start by making a dual number for our x value, remembering that the real part is the actual value for x, and the dual part is the derivative of x, which is 1:

5+1\epsilon

or:

5+\epsilon

We multiply that value by itself to get the x^2 value, keeping in mind that \epsilon^2 is zero:
(5+\epsilon)*(5+\epsilon)= \\ 25+10\epsilon+\epsilon^2= \\ 25+10\epsilon \\

Next we need to multiply that by 3 to get the 3x^2 term:

3*(25+10\epsilon) = 75+30\epsilon

Putting that aside for a moment, we need to make the 2y^3 term. We start by making our y value:

2+0\epsilon

or:

2

If you are wondering why it has a zero for the epsilon term, it’s because when we are calculating the partial derivative of x, y is a constant, so has a derivative of zero.

Next, we multiply this y value by itself twice to get the y^3 value:

2*2*2=8

We then multiply it by 2 to get the 2y^3 term:

8*2=16

Now that we have our two terms, we subtract the y term from the x term to get our final result:

75+30\epsilon-16 = \\ 59+30\epsilon

This result says that 3x^2-2y^3 has a value of 59 at location (5,2), and that the derivative of x at that point is 30.

That checks out, let’s move on to the derivative of y!

Example: df/dy

Calculating the derivative of y is very similar to calculating the derivative of x, except that it’s the x term that has an epsilon value (derivative) of 0, instead of the y term. The y term has the epsilon value of 1 this time as well. We’ll work through it to see how it plays out.

First up, we need to make the value for x:

5+0\epsilon

or:

5

Next we square it and multiply it by 3 to get the 3x^2 term:

5*5*3=75

Next we need to make the value for y, remembering that we use an epsilon value of 1 since the derivative of y is 1 this time around.

2+\epsilon

We cube that value and multiply by 2 to get the 2y^3 term:
2*(2+\epsilon)*(2+\epsilon)*(2+\epsilon)= \\ 2*(2+\epsilon)*(4+4\epsilon+\epsilon^2)= \\ 2*(2+\epsilon)*(4+4\epsilon)= \\ 2*(8+12\epsilon+4\epsilon^2)= \\ 2*(8+12\epsilon)= \\ 16+24\epsilon

Now we subtract the y term from the x term to get the final result:

75 - (16+24\epsilon)= \\ 59-24\epsilon

This result says that 3x^2-2y^3 has a value of 59 at location (5,2), and that the derivative of y at that point is -24.

That also checks out, so we got the correct value and partial derivatives for the equation.

Reducing Redundancy

There was quite a bit of redundancy when working through the x and y derivatives wasn’t there? Increasing the number of variables will increase the amount of redundancy too, so it doesn’t scale up well.

Luckily there is a way to address this. Basically, instead of making two dual numbers which have two items, you make them share the real value (since it’s the same for both, as is the work to make it) and append the dual values for x and y to it.

Thus, instead of having:

x'=(a+b\epsilon) \\ y'=(a+b\epsilon)

You have:

(a+b\epsilon_x+c\epsilon_y)

Then, in your math or in your program, you treat it as if it’s two different dual numbers packed into one. This lets you do the work for the real number once instead of twice, but still lets you do your dual number work for each variable independently.

While it’s probably easiest to think of these as two dual numbers packed into one value, there is actually a mathematical basis for it as well, which may or may not surprise you.

Check out what happens when we multiply two of these together, keeping in mind that multiplying ANY two epsilon values together becomes zero, even if they are different epsilons:

(a+b\epsilon_x+c\epsilon_y) * (d+e\epsilon_x+f\epsilon_y)= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + be\epsilon_x^2 + bf\epsilon_x\epsilon_y + cd\epsilon_y + ce\epsilon_x\epsilon_y + cf\epsilon_y^2= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + cd\epsilon_y= \\ ad + (ae+bd)\epsilon_x + (af+cd)\epsilon_y

The interesting thing is that the above result gives you the same values as if you did the same work for two dual numbers individually.

Let’s see this three component dual number in action by re-doing the example again. Note that this pattern scales up to ANY number of variables!

Example: Both Derivatives (Gradient Vector)

Our goal is to calculate the value and partial derivatives of the function 3x^2-2y^3 at location (5,2).

First we make our x value:

5 + 1\epsilon_x + 0\epsilon_y

or:

5 + \epsilon_x

We square that and multiply it by 3 to get our 3x^2 term:

3*(5 + \epsilon_x)*(5 + \epsilon_x)= \\ 3*(25+10\epsilon_x+\epsilon_x^2)= \\ 3*(25+10\epsilon_x)= \\ 75+30\epsilon_x

Next, we make our y value:

2 + 0\epsilon_x + 1\epsilon_y

or:

2 + \epsilon_y

We cube it and multiply it by 2 to get our 2x^3 term:

16+24\epsilon_y

Lastly we subtract the y term from the x term to get our final answer:

(75+30\epsilon_x) - (16+24\epsilon_y)= \\ 59+30\epsilon_x-24\epsilon_y

The result says that 3x^2-2y^3 has a value of 59 at location (5,2), and that the derivative of x at that point is 30, and the derivative of y at that point is -24.

Neat, right?!

Example Code

Here is the example code output:

Here is the code that generated it:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>
 
#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

	// constructor to make a constant
	CDualNumber (float f = 0.0f) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
	}

	// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
	CDualNumber (float f, size_t variableIndex) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
		m_dual[variableIndex] = 1.0f;
	}

	// storage for real and dual values
	float							m_real;
	std::array<float, NUMVARIABLES> m_dual;
};
 
//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real + b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
	return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real - b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real * b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real / b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	float sqrtReal = sqrt(a.m_real);
	ret.m_real = sqrtReal;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = pow(a.m_real, y);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = sin(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = cos(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = tan(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = tan(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
	return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
	return x * x * (T(3.0f) - T(2.0f) * x);
}
 
//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------
 
void TestSmoothStep (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// calculate value and derivative using dual numbers
	CDualNumber<1> y = SmoothStep(x);

	// calculate numeric derivative using central differences
	float derivNumeric = (SmoothStep(input + EPSILON) - SmoothStep(input - EPSILON)) / (2.0f * EPSILON);

	// calculate actual derivative
	float derivActual = 6.0f * input - 6.0f * input * input;

	// show value and derivatives
	printf("(smoothstep) y=3x^2-2x^3  (x=%0.4f)n", input);
	printf("  y = %0.4fn", y.m_real);
	printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
	printf("  actual dy/dx = %0.4fn", derivActual);
	printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

void TestTrig (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// sin
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = sin(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (sin(input + EPSILON) - sin(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = cos(input);

		// show value and derivatives
		printf("sin(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// cos
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = cos(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (cos(input + EPSILON) - cos(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = -sin(input);

		// show value and derivatives
		printf("cos(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// tan
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = tan(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (tan(input + EPSILON) - tan(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.0f / (cos(input)*cos(input));

		// show value and derivatives
		printf("tan(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// atan
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = atan(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (atan(input + EPSILON) - atan(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.0f / (1.0f + input * input);

		// show value and derivatives
		printf("atan(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}
}

void TestSimple (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// sqrt
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = CDualNumber<1>(3.0f) / sqrt(x);

		// calculate numeric derivative using central differences
		float derivNumeric = ((3.0f / sqrt(input + EPSILON)) - (3.0f / sqrt(input - EPSILON))) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = -3.0f / (2.0f * pow(input, 3.0f / 2.0f));

		// show value and derivatives
		printf("3/sqrt(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// pow
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = pow(x + CDualNumber<1>(1.0f), 1.337f);

		// calculate numeric derivative using central differences
		float derivNumeric = ((pow(input + 1.0f + EPSILON, 1.337f)) - (pow(input + 1.0f - EPSILON, 1.337f))) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.337f * pow(input + 1.0f, 0.337f);

		// show value and derivatives
		printf("(%0.4f+1)^1.337 = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}
}

void Test2D (float inputx, float inputy)
{
	// create dual numbers as the value of x and y
	CDualNumber<2> x(inputx, 0);
	CDualNumber<2> y(inputy, 1);

	// z = 3x^2 - 2y^3
	{
		// calculate value and partial derivatives using dual numbers
		CDualNumber<2> z = CDualNumber<2>(3.0f) * x * x - CDualNumber<2>(2.0f) * y * y * y;

		// calculate numeric partial derivatives using central differences
		auto f = [] (float x, float y) {
			return 3.0f * x * x - 2.0f * y * y * y;
		};
		float derivNumericX = (f(inputx + EPSILON, inputy) - f(inputx - EPSILON, inputy)) / (2.0f * EPSILON);
		float derivNumericY = (f(inputx, inputy + EPSILON) - f(inputx, inputy - EPSILON)) / (2.0f * EPSILON);

		// calculate actual partial derivatives
		float derivActualX = 6.0f * inputx;
		float derivActualY = -6.0f * inputy * inputy;

		// show value and derivatives
		printf("z=3x^2-2y^3 (x = %0.4f, y = %0.4f)n", inputx, inputy);
		printf("  z = %0.4fn", z.m_real);
		printf("  dual# dz/dx = %0.4fn", z.m_dual[0]);
		printf("  dual# dz/dy = %0.4fn", z.m_dual[1]);
		printf("  actual dz/dx = %0.4fn", derivActualX);
		printf("  actual dz/dy = %0.4fn", derivActualY);
		printf("  numeric dz/dx = %0.4fn", derivNumericX);
		printf("  numeric dz/dy = %0.4fnn", derivNumericY);
	}
}

void Test3D (float inputx, float inputy, float inputz)
{
	// create dual numbers as the value of x and y
	CDualNumber<3> x(inputx, 0);
	CDualNumber<3> y(inputy, 1);
	CDualNumber<3> z(inputz, 2);

	// w = sin(x*cos(2*y)) / tan(z)
	{
		// calculate value and partial derivatives using dual numbers
		CDualNumber<3> w = sin(x * cos(CDualNumber<3>(2.0f)*y)) / tan(z);

		// calculate numeric partial derivatives using central differences
		auto f = [] (float x, float y, float z) {
			return sin(x*cos(2.0f*y)) / tan(z);
		};
		float derivNumericX = (f(inputx + EPSILON, inputy, inputz) - f(inputx - EPSILON, inputy, inputz)) / (2.0f * EPSILON);
		float derivNumericY = (f(inputx, inputy + EPSILON, inputz) - f(inputx, inputy - EPSILON, inputz)) / (2.0f * EPSILON);
		float derivNumericZ = (f(inputx, inputy, inputz + EPSILON) - f(inputx, inputy, inputz - EPSILON)) / (2.0f * EPSILON);

		// calculate actual partial derivatives
		float derivActualX = cos(inputx*cos(2.0f*inputy))*cos(2.0f * inputy) / tan(inputz);
		float derivActualY = cos(inputx*cos(2.0f*inputy)) *-2.0f*inputx*sin(2.0f*inputy) / tan(inputz);
		float derivActualZ = sin(inputx * cos(2.0f * inputy)) / -(sin(inputz) * sin(inputz));

		// show value and derivatives
		printf("w=sin(x*cos(2*y))/tan(z) (x = %0.4f, y = %0.4f, z = %0.4f)n", inputx, inputy, inputz);
		printf("  w = %0.4fn", w.m_real);
		printf("  dual# dw/dx = %0.4fn", w.m_dual[0]);
		printf("  dual# dw/dy = %0.4fn", w.m_dual[1]);
		printf("  dual# dw/dz = %0.4fn", w.m_dual[2]);
		printf("  actual dw/dx = %0.4fn", derivActualX);
		printf("  actual dw/dy = %0.4fn", derivActualY);
		printf("  actual dw/dz = %0.4fn", derivActualZ);
		printf("  numeric dw/dx = %0.4fn", derivNumericX);
		printf("  numeric dw/dy = %0.4fn", derivNumericY);
		printf("  numeric dw/dz = %0.4fnn", derivNumericZ);
	}
}

int main (int argc, char **argv)
{
	TestSmoothStep(0.5f);
	TestSmoothStep(0.75f);
	TestTrig(PI * 0.25f);
	TestSimple(3.0f);
	Test2D(1.5f, 3.28f);
	Test3D(7.12f, 8.93f, 12.01f);
    return 0;
}

Closing

One of the neatest things about dual numbers is that they give precise results. They are not approximations and they are not numerical methods, unlike the central differences method that I compared them to in the example program (More info on numerical derivatives here: Finite Differences). Using dual numbers gives you exact derivatives, within the limitations of (eg) floating point math.

It turns out that backpropagation (the method that is commonly used to train neural networks) is just steepest gradient descent. You can read about that here: Backpropogation is Just Steepest Descent with Automatic Differentiation

That makes me wonder how dual numbers would do in run time speed compared to back propagation as well as numerical methods for getting the gradient to adjust a neural network during training.

If I had to guess, I’d say that dual numbers may be slightly slower than backpropagation, but not as slow as numerical methods which are going to be much, much slower. We’ll see though. It may be much easier to implement neural network learning using dual numbers compared to backpropagation, so may be worth an exploration and write up, even if only to make neural networks a little bit more accessible to people.

Comments, corrections, etc? Let me know in the comments below, or contact me on twitter at @Atrix256

A Geometric Interpretation of Neural Networks

In the 90s before I was a professional programmer / game developer I looked at neural networks and found them interesting but got scared off by things like back propagation, which I wasn’t yet ready to understand.

With all the interesting machine learning things going on in modern times, I decided to have a look again and have been pleasantly surprised at how simple they are to understand. If you have knowledge of partial derivatives and gradients (like, if you’ve done any ray marching), you have the knowledge it takes to understand it.

Here are some really great resources I recomend for learning the nuts and bolts of how modern neural networks actually work:
Learn TensorFlow and deep learning, without a Ph.D.
Neural Networks and Deep Learning
A Neural Network Playground (Web Based NN sand box from google)

This post doesn’t require any understanding of neural networks or partial derivatives, so don’t worry if you don’t have that knowledge. The harder math comes up when training a neural network, but we are only going to be dealing with evaluating neural networks, which is much simpler.

A Geometric Interpretation of a Neuron

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron.

To calculate the output value of a neuron, you multiply every input into that neuron by a weight for that input, and add a bias. This value is fed into an activation function (more on activation functions shortly) and the result is the output value of that neuron. Here is a diagram for a single neuron:

A more formal definition of a neuron’s output is below. b is the bias, w_j is the j’th weight and i_j is the j’th input.
Output = b+\sum_{j=0}^nw_ji_j

You might notice that the above is really just the dot product of the weight vector and the input vector, with a value added on the end. We could re-write it like that:
Output = Dot(w,i)+b

Interestingly, that is the same equation that you use to find the distance of a point to a plane. Let’s say that we have a plane defined by a unit length normal N and a distance to the origin d, and we want to calculate the distance of a point P to that plane. We’d use this formula:
Distance = Dot(N,P)+d

This would give us a signed distance, where the value will be negative if we are in the negative half space defined by the plane, and positive otherwise.

This equation works if you are working in 3 dimensional space, but also works in general for any N dimensional point and plane definition.

What does this mean? Well, this tells us that every neuron in a neural network is essentially deciding what side of a hyperplane a point is on. Each neuron is doing a linear classification, saying if something is on side A or side B, and giving a distance of how far it is into A or B.

This also means that when you combine multiple neurons into a network, that an output neuron of that neural network tells you whether the input point is inside or outside of some shape, and by how much.

I find this interesting for two reason.

Firstly, it means that a neural network can be interpreted as encoding SHAPES, which means it could be used for modeling shapes. I’m interested in seeing what sort of shapes it’s capable of, and any sorts of behaviors this representation might have. I don’t expect it to be useful for, say, main stream game development (bonus if it is useful!) but at minimum it ought to be an interesting investigation to help understand neural networks a bit better.

Secondly, there is another machine learning algorithm called Support Vector Machines which are also based on being able to tell you which side of a separation a data point is on. However, unlike the above, SVM separations are not limited to plane tests and can use arbitrary shapes for separation. Does this mean that we are leaving something on the table for neural networks? Could we do better than we are to make networks with fewer layers and fewer neurons that do better classification by doing non linear separation tests?

Quick side note: besides classification, neural nets can help us with something called regression, which is where you fit a network to some analog data, instead of the more discrete problem of classification, which tells you what group something belongs to.

It turns out that the activation function of a neuron can have a pretty big influence on what sort of shapes are possible, which makes it so we aren’t strictly limited to shapes made up of planes and lines, and also means we aren’t necessarily leaving things on the table compared to SVM’s.

This all sort of gives me the feeling though that modern neural networks may not be the best possible algorithm for the types of things we use them for. I feel like we may need to generalize them beyond biological limitations to allow things like multiplications between weighted inputs instead of only sums. I feel like that sort of setup will be closer to whatever the real ideal “neural computation” model might be. However, the modern main stream neural models do have the benefit that they are evaluated very efficiently via dot products. They are particularly well suited for execution on GPUs which excel at performing homogenous operations on lots and lots of data. So, a more powerful and more general “neuron” may come at the cost of increased computational costs, which may make them less desirable in the end.

As a quick tangent, here is a paper from 2011 which shows a neural network model which does in fact allow for multiplication between neuron inputs. You then will likely be wanting exponents and other things, so while it’s a step in the right direction perhaps, it doesn’t yet seem to be the end all be all!
Sum-Product Networks: A New Deep Architecture

It’s also worth while to note that there are other flavors of neural networks, such as convolutional neural networks, which work quite a bit differently.

Let’s investigate this geometric interpretation of neurons as binary classifiers a bit, focusing on some different activation functions!

Step Activation Function

The Heaviside step function is very simple. If you give it a value greater than zero, it returns a 1, else it returns a 0. That makes our neuron just spit out binary: either a 0 or a 1. The output of a neuron using the step activation function is just the below:

Output = Dot(w,i)+b > 0

The output of a neuron using the step activation function is true if the input point is in the positive half space of the plane that this neuron describes, else it returns false.

Let’s think about this in 2d. Let’s make a neural network that takes x and y as input and spits out a value. We can make an image that visualizes the range from (-1,-1) to (1,1). Negative values can be shown in blue, zero in white, and positive values in orange.

To start out, we’ll make a 2d plane (aka a line) that runs vertically and passes through the origin. That means it is a 2d plane with a normal of (1,0) and a distance from the origin of 0. In other words, our network will have a single neuron that has weights of (1,0) and a bias of 0. This is what the visualization looks like:

You can actually play around with the experiments of this post and do your own using an interactive visualization I made for this post. Click here to see this first experiment: Experiment: Vertical Seperation

We can change the normal (weights) to change the angle of the line, and we can also change the bias to move the line to it’s relative left or right. Here is the same network that has it’s weights adjusted to (1,1) with a bias of 0.1.

Experiment: Diagonal Separation

The normal (1,1) isn’t normalized though, which makes it so the distance from origin (aka the bias) isn’t really 0.1 units. The distance from origin is actually divided by the length of the normal to get the REAL distance to origin, so in the above image, where the normal is a bit more than 1.0, the line is actually less than 0.1 units from the origin.

Below is the visualization if we normalize the weights to (0.707,0.707) but leave the bias at 0.1 units. The result is that the line is actually 0.1 units away from the origin.

Experiment: Normalized Diagonal Separation

Recalling the description of our visualization, white is where the network outputs zero, while orange is where the network outputs a positive number (which is 1 in this case).

If we define three lines such that their negative half spaces don’t completely overlap, we can get a triangle where the network outputs a zero, while everywhere else it outputs a positive value. We do this by having three sibling neurons in the first layer which define three separate lines, and then in the output neuron we give them all a weight of 1. This makes it so the area outside the triangle is always a positive value, which step turns into 1, but inside the triangle, the value remains at 0.


Experiment: Negative Space Triangle

We can turn this negative space triangle into a positive space triangle however by making the output neuron have a weight on the inputs of -1, and adding a bias of 0.1. This makes it so that pixels in the positive space of any of the lines will become a negative value. The negative space of those three lines get a small bias to make it be a positive value, resulting in the step function making the values be 0 outside of the triangle, and 1 inside the triangle. This gives us a positive space triangle:


Experiment: Positive Space Triangle

Taking this a bit further, we can make the first layer define 6 lines, which make up two different triangles – a bigger one and a smaller one. We can then have a second layer which has two neurons – one which makes a positive space larger triangle, and one which makes a positive space smaller triangle. Then, in the output neuron we can give the larger triangle neuron a weight of 1, while giving the smaller triangle neuron a weight of -1. The end result is that we have subtracted the smaller triangle from the larger one:


Experiment: Triangle Cutout

Using the step function we have so far been limited to line based shapes. This has been due to the fact that we can only test our inputs against lines. There is a way around this though: Pass non linear input into the network!

Below is a circle with radius 0.5. The neural network has only a single input which is sqrt(x*x+y*y). The output neuron has a bias of -0.5 and a weight of 1. It’s the bias value which controls the radius of the circle.

You could pass other non linear inputs into the network to get a whole host of other shapes. You could pass sin(x) as an input for example, or x squared.


Experiment: Circle

While the step function is inherently very much limited to linear tests, you can still do quite a lot of interesting non linear shapes (and data separations) by passing non linear input into the network.

Unfortunately though, you as a human would have to know the right non linear inputs to provide. The network can’t learn how to make optimal non linear separations when using the step function. That’s quite a limitation, but as I understand it, that’s how it works with support vector machines as well: a human can define non linear separations, but the human must decide the details of that separation.

BTW it seems like there could be a fun puzzle game here. Something like you have a fixed number of neurons that you can arrange in however many layers you want, and your goal is to separate blue data points from orange ones, or something like that. If you think that’d be a fun game, go make it with my blessing! I don’t have time to pursue it, so have at it (:

Identity and Relu Activation Functions

The identity activation function doesn’t do anything. It’s the same as if no activation function is used. I’ve heard that it can be useful in regression, but it can also be useful for our geometric interpretation.

Below is the same circle experiment as before, but using the identity activation function instead of the step activation function:


Experiment: Identity Circle

Remembering that orange is positive, blue is negative, and white is zero, you can see that outside the circle is orange (positive) and inside the circle is blue (negative), while the outline of the circle itself is white. We are looking at a signed distance field of the circle! Every point on this image is a scalar value that says how far inside or outside that point is from the surface of the shape.

Signed distance fields are a popular way of rendering vector graphics on the GPU. They are often approximated by storing the distance field in a texture and sampling that texture at runtime. Storing them in a texture only requires a single color channel for storage, and as you zoom in to the shape, they preserve their shape a lot better than regular images. You can read more about SDF textures on my post: Distance Field Textures.

Considering the machine learning perspective, having a signed distance field is also an interesting proposition. It would allow you to do classification of input, but also let you know how deeply that input point is classified within it’s group. This could be a confidence level maybe, or could be interpreted in some other way, but it gives a sort of analog value to classification, which definitely seems like it could come in handy sometimes.

If we take our negative space triangle example from the last section and change it from using step activation to identity activation, we find that our technique doesn’t generalize naively though, as we see below. (It doesn’t generalize for the positive space triangle either)


Experiment: Negative Space Triangle Identity

The reason it doesn’t generalize is that the negatives and positives of pixel distances to each of the lines cancel out. Consider a pixel on the edge of the triangle: you are going to have a near zero value for the edge it’s on, and two larger magnitude negative values from the other edges it is in the negative half spaces of. Adding those all together is going to be a negative value.

To help sort this out we can use an activation function called “relu”, which returns 0 if the value it’s given is less than zero, otherwise it returns the value. This means that all our negative values become 0 and don’t affect the distance summation. If we switch all the neurons to using relu activation, we get this:


Experiment: Negative Space Triangle Relu

If you squint a bit, you can see a triangle in the white. If you open the experiment and turn on “discrete output” to force 0 to orange you get a nice image that shows you that the triangle is in fact still there.

Our result with relu is better than identity, but there are two problems with our resulting distance field.

Firstly it isn’t a signed distance field – there is no blue as you might notice. It only gives positive distances, for pixels that are outside the shape. This isn’t actually that big of an issue from a rendering perspective, as unsigned distance fields are still useful. It also doesn’t seem that big of an issue from a machine learning perspective, as it still gives some information about how deeply something is classified, even though it is only from one direction.

I think with some clever operations, you could probably create the internal negative signed distance using different operations, and then could compose it with the external positive distance in the output neuron by adding them together.

The second problem is a bigger deal though: The distance field is no longer accurate!

By summing the distance values, the distance is incorrect for points where the closest feature of the triangle is a vertex, because multiple lines are contributing their distance to the final value.

I can’t think of any good ways to correct that problem in the context of a neural network, but the distance is an approximation, and is correct for the edges, and also gets more correct the closer you get to the object, so this is still useful for rendering, and probably still useful for machine learning despite it being an imperfect measurement.

Sigmoid and Hyperbolic Tangent Activation Function

The sigmoid function is basically a softer version of the step function and gives output between 0 and 1. The hyperbolic tangent activation function is also a softer version of the step function but gives output between -1 and 1.

Sigmoid:

Hyperbolic Tangent:

(images from Wolfram Mathworld)

They have different uses in machine learning, but I’ve found them to be visibly indistinguishable in my experiments after compensating for the different range of output values. It makes me think that smoothstep could probably be a decent activation function too, so long as your input was in the 0 to 1 range (maybe you could clamp input to 0 and 1?).

These activation functions let you get some non linearity in your neural network in a way that the learning algorithms can adjust, which is pretty neat. That puts us more into the realm where a neural net can find a good non linear separation for learned data. For the geometric perspective, this also lets us make more interesting non linear shapes.

Unfortunately, I haven’t been able to get a good understanding of how to use these functions to craft desired results.

It seems like if you add even numbers of hyperbolic tangents together in a neural network that you end up getting a lot of white, like below:


Experiment: tanh1

However, if you add an odd number of them together, it starts to look a bit more interesting, like this:


Experiment: tanh2

Other than that, it’s been difficult seeing a pattern that I can use to craft things. The two examples above were made by modifying the negative space triangle to use tanh instead of step.

Closing

We’ve wandered a bit in the idea of interpreting neural networks geometrically but I feel like we’ve only barely scratched the surface. This also hasn’t been a very rigorous exploration, but more has just been about exploring the problem space to get a feeling for what might be possible.

It would be interesting to look more deeply into some of these areas, particularly for the case of distance field generation of shapes, or combining activation functions to get more diverse results.

While stumbling around, it seems like we may have gained some intuition about how neural networks work as well.

It feels like whenever you add a layer, you are adding the ability for a “logical operation” to happen within the network.

For instance, in the triangle cutout experiment, the first layer after the inputs defines the 6 individual lines of the two triangles and classifies input accordingly. The next layer combines those values into the two different triangle shapes. The layer after that converts them from negative space triangles to positive space triangles. Lastly, the output layer subtracts the smaller triangle’s values from the larger triangle’s values to make the final triangle outline shape.

Each layer has a logical operation it performs, which is based on the steps previous to it.

Another piece of intuition I’ve found is that it seems like adding more neurons to a layer allows it to do more work in parallel.

For instance, in the triangle cutout experiment, we created those two triangles in parallel, reserving some neurons in each layer for each triangle. It was only in the final step that we combined the values into a single output.

Since neurons can only access data from the previous network layer, it seems as though adding more neurons to layers can help push data forward from previous layers, to later layers that might need the data. But, it seems like it is most efficient to process input data as early as possible, so that you don’t have to shuttle it forward and waste layers / neurons / memory and computing power.

Here is some info on other activation functions:
Wikipedia:Activation Function

Here’s a link that talks about how perceptrons (step activated neural networks) relate to SVMs:
Hyperplane based Classification: Perceptron and (Intro to) Support Vector Machines

By the way, did I mention you can visualize neural networks in three dimensions as well?

Experiment: 3d Visualization

Here are the two visualizers of neural networks I made for this post using WebGL2:
Neural Network Visualization 2D
Neural Network Visualization 3D

If you play around with this stuff and find anything interesting, please share!

My Old Master: How to Correct as a Mentor or a Teacher

Preface: I studied martial arts for a little over a decade (shaolin kempo at USSD) and learned a lot while I was there. Our teacher was a great guy who genuinely cared about his students, and in particular, taught my friends and I some really interesting things when we became instructors. I’d like to share some of that information with you in the “My Old Master” posts category. As cliched as it is, many things he taught us apply to all aspects of life, not just martial arts and I’d like other people to benefit from them.

My old master used to say…

“Praise, Correct, Praise. Even if you have to make something up, you need to say something positive.”

Let’s say that I’m teaching you how to punch and you aren’t quite doing it right.

Here are two things I might say to try and correct it:

  1. “Keep your wrist straight when you are punching so you don’t hurt your hand” or…
  2. “I really love how you are keeping your left hand up while you punch with your right. It’s doing a great job of protecting your head against your opponent hitting you back. Now try keeping your wrist straight when you punch so that you don’t hurt your hand.” Then I watch you try again and I say “Great, just like that, keep it up!”

Think about how those two responses make you feel for a second.

The first one likely makes you feel like you are messing up and need to fix it (a negative thing), while the second makes you feel like you were doing well and are now are doing even better.

What’s the difference? Well, like the opening quote says, I praised, corrected, and then praised. First I found something you were doing well, complimented you on it, gave a suggestion for improvement, and then praised you on doing (or attempting!) the correction.

This can be a great way to give people feedback, in a way that makes them feel better about themselves, and feel better about the feedback you are giving them. Instead of being a negative thing, it becomes a positive thing.

Pretty simple stuff, and if you practice this technique it starts to become second nature.

The quote says that if you can’t find anything positive to say, you should make it up. It shouldn’t be your first choice to make something up, because the more genuine you are about the praise, the better it will be. However, if you really can’t find anything nice to say, yes, you should make something up.

A person’s ego and self worth is a measurable quantity that is increased with praise and decreased with corrections or negative feedback (aka “you suck!”). When this tank of self worth gets too low, your student or mentee will feel worthless, get frustrated and/or start to get resentful at you.

This technique is useful because it allows you to give a correction while minimizing hit to the person’s self worth. In the end allows you to give MORE correction and help them more in the long run, just by phrasing your corrections differently. Another term for this is “complement sandwich” which you may have heard of before.

Another thing to be mindful of however is that you can only give so much feedback at any one time. The ego/self worth tank needs to refill after it’s diminished, and frankly, the person needs to absorb and internalize what you’ve taught them before they are ready for more.

Our teacher would say “it’s better having a mediocre black belt, than having a stellar white belt who quits out of frustration” and that’s very true. It’s better for them since a mediocre black belt is FAR SUPERIOR to a stellar white belt and much better able to protect themselves and their loved ones, but also better for the organization, since we are often teaching or mentoring in a “for profit” situation where the person we are trying to help is either a customer or a co-worker which the company is interested in keeping around.

Before wrapping it up, I heard a funny story regarding this topic about a special needs child and his or her parents. Just like everyone else, this child has a concept of self worth, however being disabled makes it very easy to feel depressed when you realize there are so many things you can’t do that other people can do. It’s difficult too for the parents to help the child feel better about themselves if they really can’t find an area he shines in. One day the parents noticed he loved to use tools and it clicked. They started loosening screws in the house and asking him to tighten them for him. “Jimmy, can you come tighten this screw up for us? You are so good at it!”.

I think that’s a cute story but really shows how we work as humans. Your job as (an effective) teacher, mentor, parent, boss or leader, is to teach whatever you need to teach, correct whatever you need to correct, but also to make sure you do so in a way that is least damaging to the person’s ego and self worth. They feel better about themselves, but you are also more effective at getting the job done. It matters!

So go out there and serve some compliment sandwiches, making sure to be as genuine as possible with your praise!

P.S. Yes people can have over inflated egos and feeding them more is only going to make things worse. That’s the topic of another post (;