Orthogonal Projection Matrix Plainly Explained

“Scratch a Pixel” has a really nice explanation of perspective and orthogonal projection matrices.

It inspired me to make a very simple / plain explanation of orthogonal projection matrices that hopefully will help them be less opaque for folks and more intuitive.

Original article: Scratch A Pixel: The Perspective and Orthographic Projection Matrix

Let’s Get To It!

The whole purpose of an orthogonal matrix is to take x,y and z as input and output x,y and z such that valid points on the screen will have x,y,z values between -1 and 1.

If we transform a point and get an x,y or z that is outside of that range, we know the point is outside of the screen either because it’s too far left, right, up or down, or because it’s too close or too far on the z axis.

Let’s think about how we’d do this, thinking only about the x coordinate for now.

To map some range of x values from -1 to 1, we’ll need to decide on what x value maps to -1 and what x value maps to 1. We’ll call these “left” and “right”.

Given a left and right value, and an x value we want to map to the range, perhaps the most straight forward way to do it would be this:

XOut = \frac{X-Left}{Right-Left} * 2 - 1

The division calculates the percentage of how far X is between left and right. Multiplying that by 2 and subtracting 1 changes it so instead of valid points being from 0 to 1 (aka from 0% to 100%), they are instead between -1 and 1.

Let’s change this formula so that there is one term that is multiplied by X and another term that has everything else. (Wondering why? It’s because I’m cheating and know the final form. Don’t feel bad if it isn’t intuitive why we’d do this!)

\frac{X-Left}{Right-Left} * 2 - 1 =\\ \\ \frac{2(X-Left)}{Right-Left} - 1 =\\ \\ \frac{2X-2*Left}{Right-Left} - 1 =\\ \\ \frac{2X-2*Left}{Right-Left} - \frac{Right-Left}{Right-Left} =\\ \\ \frac{2X-2*Left-(Right-Left)}{Right-Left} =\\ \\ \frac{2X-2*Left-Right+Left}{Right-Left} =\\ \\ \frac{2X-Left-Right}{Right-Left} =\\ \\ \frac{2X}{Right-Left} - \frac{Right+Left}{Right-Left} =\\ \\ \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left}\\

Setting up the formula this way allows us to transform the x component of an (x,y,z,1) point using a dot product:

(x,y,z,1) \cdot (\frac{2}{Right-Left},0,0,-\frac{Right+Left}{Right-Left}) = \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left}

A dot product is what happens during matrix multiplication, so if we put this into a 4×4 matrix, we get the same result. Let’s check that out.

We start with an identity matrix. If we use it to transform an (x,y,z,1) point, we get the same point as output aka nothing happens.

\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

Now let’s put the x transform we came up with into the matrix:

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ -\frac{Right+Left}{Right-Left} & 0 & 0 & 1 \\ \end{bmatrix}

If we use that matrix to transform an (x,y,z,1) point, it will transform our x component as we described (valid ranges of x that are between left and right will be between -1 and 1), while leaving the other components of the point alone.

As you might imagine, it’s pretty simple to get our formulas for y and z as well. Starting with the x formula, we can just change x with y and z, and right/left with top/bottom and far/near.

XOut = \frac{2}{Right-Left}X - \frac{Right+Left}{Right-Left} \\ \\ YOut = \frac{2}{Top-Bottom}Y - \frac{Top+Bottom}{Top-Bottom} \\ \\ ZOut = \frac{2}{Far-Near}Z - \frac{Far+Near}{Far-Near}

We can put those into our matrix to get a full orthographic projection matrix.

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{2}{Far-Near} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

There we go, that’s all there is to making an orthographic projection matrix. It’s whole purpose is to convert x,y,z values to be between -1 and 1 so that the GPU knows whether points are inside our outside the screen – and thus whether they need to be clipped or not.

Variations

While the projection matrix we made is a valid orthographic projection matrix in OpenGL, we actually need a tweak for it to be valid for DirectX. The reason for this is because while in OpenGL the clip space for z is between -1 and 1, it’s actually between 0 and 1 for DirectX!

If you leave off the *2-1 for the z formula, but leave it for x and y, you’ll end up with a matrix like this one:

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{1}{Near-Far} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Near}{Near-Far} & 1 \\ \end{bmatrix}

Another variation you’ll see is a version where the camera is centered on the origin for the x and y axis. In other words, left = -right, and top = -bottom. When that is true, right+left and top+bottom become zero which simplifies the matrix to this:

\begin{bmatrix} \frac{2}{Width} & 0 & 0 & 0 \\ 0 & \frac{2}{Height} & 0 & 0 \\ 0 & 0 & \frac{2}{Far-Near} & 0 \\ 0 & 0 & -\frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

Another variation you’ll see is that the matrix is transposed. You’ll see this when switching between pre and post multiplication, or when switching from column major matrices to row matrices. Either is valid and it’s basically just a notation and convention thing. Here is the origional matrix we made transposed.

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & -\frac{Right+Left}{Right-Left}  \\ 0 & \frac{2}{Top-Bottom} & 0 & -\frac{Top+Bottom}{Top-Bottom} \\ 0 & 0 & \frac{2}{Far-Near} & -\frac{Far+Near}{Far-Near} \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

Lastly, the above matrices were all for a “left handed” system. That means that it assumes the positive x axis goes to the right, the positive y axis goes up, and the positive z goes into your screen (aka, the camera is looking down the positive z axis). Positive Z values will map to the valid -1 to 1 range, while negative z values will be outside the valid range.

A variation on the orthographic projection matrix we made that you’ll see is the matrix being a “right handed” matrix which is the same as the left handed matrix, except that the positive z axis goes out from your screen (aka the camera is looking down the negative z axis). Negative Z values will map to the valid -1 to 1 range, while positive z values will be outside the valid range.

To switch the handedness of the matrix, you just flip the sign of the element at (3,3), so here is our original orthographic projection matrix, but converted to right handed instead of left handed.

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & -\frac{2}{Far-Near} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

You may also just see the denominator changed from Far-Near to Near-Far which has the same effect, and would give you something like this:

\begin{bmatrix} \frac{2}{Right-Left} & 0 & 0 & 0 \\ 0 & \frac{2}{Top-Bottom} & 0 & 0 \\ 0 & 0 & \frac{2}{Near-Far} & 0 \\ -\frac{Right+Left}{Right-Left} & - \frac{Top+Bottom}{Top-Bottom} & - \frac{Far+Near}{Far-Near} & 1 \\ \end{bmatrix}

Fun trivia: the term “sinister” comes from latin, meaning “left handed”. So, when talking to someone about their graphics engine, you can ask them whether or not they use sinister projection 😛

Links

Scratch A Pixel: The Perspective and Orthographic Projection Matrix

D3DXMatrixOrthoRH (DirectX) – shows the resulting matrix. Also links to left handed and off center variants.

glOrtho (OpenGL) – shows resulting matrix.

Plastic Bag Ban – Semi Reusable Bag Kiosks a Better Solution?

I an in favor of people generating less trash, and have been amazed that where I live (southern California), people have taken a “plastic bag ban” so well in stride. It felt like one of those things where we couldn’t live without it, but it turns out we can quite easily. Maybe there’s a lesson there, but that’s not the point of this post so I’ll get back to it (;

Where I live, you can BUY plastic bags for use from the grocery store for about 10 cents each, which is cost prohibitive enough that people tend to bring their own bags.

Now let me tell you, I’m no dummy. I am pretty sure those re-usable plastic bags everyone has gotten use far more resources to make and create more pollution in the process so seem like they will take some usage before they hit a break even point compared to the disposable bags. Google seems to give mixed results, saying that some bags are better than others: Google: are reusable bags better for the environment.

I think that the sentiment of what’s going on is good for sure though, and I’m hoping it’s a net positive (?), and I think that there is some winning to be had here environmentally – if done right.

However, let’s put the environmental concerns on the back burner for a second.

Grocery store baggers now get handed all sorts of differently shaped bags of various capacity, and folks often want their stuff bagged specific ways to make sure it all fits in however many they brought. It’s also very common for people to forget their bags at home, leave them in the car, etc. This makes things a bit awkward and definitely not as fast and smooth as it used to be with disposable bags.

My idea to address these issues is this:

  • When you check out from the store, there are semi-reusable bags in bails similar to how disposable bags are in bails now. You pay to use the bags, but the cost is mostly a refundable deposit.
  • You take the bags home, unload, etc like normal, but ideally you do not throw the bags away.
  • The next time you come to the store, you feed your empty bags into a kiosk at the front of the store (similar to the coin to cash machines many currently have) and it prints you out a voucher saying how many bags you returned.
  • The machine bails the bags back up for use by the store employees at the register, or perhaps some off site service does this and washes the bags, gets rid of damaged ones etc.
  • You shop as normal, but when you check out, you give them your bag voucher. If the voucher is for fewer bags than you need for this trip, you can pay for some extra ones. If your voucher is for more than you need this time, you get the deposit refunded on the ones you don’t need.

The idea here is that at the end of the day…

  • The baggers have a uniform type of bag to work with, which makes their job easier and allows them to bag more quickly (like the old days!).
  • The bags are ideally as environmentally friendly made as possible (heck, they could be made from corn, hemp, burlap, whatever).
  • It isn’t a big deal if you forgot your bags at home or in the car – you pay a little extra to use new bags, but when you return them, you get most of that back.

There are some obvious issues to work through, including:

  • Getting those kiosks into the stores, and ideally having all stores use compatible bags.
  • Dealing with damaged or dirty bags. It would be nice if the machine or whatever process bails them is able to detect and segregate them somehow, and ideally have some washing / minimal recycling process off site to make new ones (like pulp and re-form?)
  • The plastic bags are a profit center for grocery stores now. They would need sufficient incentive, or pressure from customers to make it happen.

So, that’s my idea. Environmentally friendly semi-reusable bags – but with the convenience we all have come to enjoy from disposable bags. The best of both worlds.

As a video game programmer this is far outside of my interest and ability, so please take this business idea if you have the desire and the means. Let’s make it happen!

Neural Network Recipe: Recognize Handwritten Digits With 95% Accuracy

This post is a recipe for making a neural network which is able to recognize hand written numeric digits (0-9) with 95% accuracy.

The intent is that you can use this recipe (and included simple C++ code, and interactive web demo!) as a starting point for some hands on experimentation.

A recent post of mine talks about all the things used in this recipe so give it a read if you want more info about anything: How to Train Neural Networks With Backpropagation.

This recipe is also taken straight from this amazing website (but coded from scratch in C++ by myself), where it’s implemented in python: Using neural nets to recognize handwritten digits.

Recipe

The neural network takes as input 28×28 greyscale images, so there will be 784 input neurons.

There is one hidden layer that has 30 neurons.

The final layer is the output layer which has 10 neurons.

The output neuron with the highest activation is the digit that was recognized. For instance if output neuron 0 had the highest activation, the network detected a 0. If output neuron 2 was highest, the network detected a 2.

The neurons use the sigmoid activation function, and the cost function used is half mean squared error.

Training uses a learning rate of 3.0 and the training data is processed by the network 30 times (aka 30 training epochs), using a minibatch size of 10.

A minibatch size of 10 just means that we calculate the gradient for 10 training samples at a time and adjust the weights and biases using that gradient. We do that for the entire (shuffled) 60,000 training items and call that a single epoch. 30 epochs mean we do this full process 30 times.

There are 60,000 items in the training data, mapping 28×28 greyscale images to what digit 0-9 they actually represent.

Besides the 60,000 training data items, there are also 10,000 separate items that are the test data. These test data items are items never seen by the network during training and are just used as a way to see how well the network has learned about the problem in general, versus learning about the specific training data items.

The test and training data is the MNIST data set. I have a link to zip file I made with the data in it below, but this is where I got the data from: The MNIST database of handwritten digits.

That is the entire recipe!

Results

The 30 training epochs took 1 minute 22 seconds on my machine in release x64 (with REPORT_ERROR_WHILE_TRAINING() set to 0 to speed things up), but the code could be made to run faster by using SIMD, putting it on the GPU, getting multithreading involved or other things.

Below is a graph of the accuracy during the training epochs.

Notice that most learning happened very early on and then only slowly improved from there. This is due to our neuron activation functions and also our cost function. There are better choices for both, but this is also an ongoing area of research to improve in neural networks.

The end result of my training run is 95.32% accuracy but you may get slightly higher or lower due to random initialization of weights and biases. That sounds pretty high, but if you were actually using this, 4 or 5 numbers wrong out of 100 is a pretty big deal! The record for MNIST is 99.77% accuracy using “a committee of convolutional networks” where they distorted the input data during training to make it learn in a more generalized way (described as “committee of 35 conv. net, 1-20-P-40-P-150-10 [elastic distortions]”).

A better cost function would probably be the cross entropy cost function, a better activation function than sigmoid would probably be an ELU (Exponential Linear Unit). A soft max layer could be used instead of just taking the maximum output neuron as the answer. The weights could be initialized to smarter values. We could also use a convolutional layer to help let the network learn features in a way that didn’t also tie the features to specific locations in the images.

Many of these things are described in good detail at http://neuralnetworksanddeeplearning.com/, particularly in chapter 3 where they make a python implementation of a convolutional neural network which performs better than this one. I highly recommend checking that website out!

HTML5 Demo

You can play with a network created with this recipe here: Recognize Handwritten Digit 95% Accuracy

Here is an example of it correctly detecting that I drew a 4.

The demo works “pretty well” but it does have a little less than 95% accuracy.

The reason for this though is that it isn’t comparing apples to apples.

A handwritten digit isn’t quite the same as a digit drawn with a mouse. Check out the image below to see 100 of the training images and see what i mean.

The demo finds the bounding box of the drawn image and rescales that bounding box to a 20×20 image, preserving the aspect ratio. It then puts that into a 28×28 image, using the center of mass of the pixels to center the smaller image in the larger one. This is how the MNIST data was generated, so makes the demo more accurate, but it also has the nice side effect of making it so you can draw a number of any size, in any part of the box, and it will treat it the same as if you drew it at a difference size, or in a different part of the box.

The code that goes with this post outputs the weights, biases and network structure in a json format that is very easy to drop into the html5 demo. This way, if you want to try different things in the network, it should be fairly low effort to adjust the demo to try your adjustments there as well.

Lastly, it might be interesting to get the derivatives of the inputs and play around with the input you gave it. Some experiments I can think of:

  1. When it misclassifies what number you drew, have it adjust what you drew (the input) to be more like what the network would expect to see for that digit. This could help show why it misclassified your number.
  2. Start with a well classified number and make it morph into something recognized by the network as a different number.
  3. Start with a random static (noise) image and adjust it until the network recognizes it as a digit. It would be interesting to see if it looked anything like a number, or if it was still just static.

Source Code

The source code and mnist data is on github at MNIST1, but is also included below for your convenience.

If grabbing the source code from below instead of github, you will need to extract this zip file into the working directory of the program as well. It contains the test data used for training the network.
mnist.zip

Note: it’s been reported on that some compilers/OSs, std::random_device will use dev/urandom which can result in lower quality random numbers, which can result in the network not learning as quickly or to as high of accuracy. You can force it to use dev/random instead which will fix that problem. Details can be found here: http://en.cppreference.com/w/cpp/numeric/random/random_device/random_device

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <random>
#include <array>
#include <vector>
#include <algorithm>
#include <chrono>

typedef uint32_t uint32;
typedef uint16_t uint16;
typedef uint8_t uint8;

// Set to 1 to have it show error after each training and also writes it to an Error.csv file.
// Slows down the process a bit (+~50% time on my machine)
#define REPORT_ERROR_WHILE_TRAINING() 1 

const size_t c_numInputNeurons = 784;
const size_t c_numHiddenNeurons = 30;  // NOTE: setting this to 100 hidden neurons can give better results, but also can be worse other times.
const size_t c_numOutputNeurons = 10;

const size_t c_trainingEpochs = 30;
const size_t c_miniBatchSize = 10;
const float c_learningRate = 3.0f;

// ============================================================================================
//                                     SBlockTimer
// ============================================================================================
// times a block of code
struct SBlockTimer
{
	SBlockTimer (const char* label)
	{
		m_start = std::chrono::high_resolution_clock::now();
		m_label = label;
	}

	~SBlockTimer ()
	{
		std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
		printf("%s%0.2f seconds\n", m_label, seconds.count());
	}

	std::chrono::high_resolution_clock::time_point m_start;
	const char* m_label;
};

// ============================================================================================
//                                    MNIST DATA LOADER
// ============================================================================================

inline uint32 EndianSwap (uint32 a)
{
	return (a<<24) | ((a<<8) & 0x00ff0000) |
           ((a>>8) & 0x0000ff00) | (a>>24);
}

// MNIST data and file format description is from http://yann.lecun.com/exdb/mnist/
class CMNISTData
{
public:
	CMNISTData ()
	{
		m_labelData = nullptr;
		m_imageData = nullptr;

		m_imageCount = 0;
		m_labels = nullptr;
		m_pixels = nullptr;
	}

	bool Load (bool training)
	{
		// set the expected image count
		m_imageCount = training ? 60000 : 10000;

		// read labels
		const char* labelsFileName = training ? "train-labels.idx1-ubyte" : "t10k-labels.idx1-ubyte";
		FILE* file = fopen(labelsFileName,"rb");
		if (!file)
		{
			printf("could not open %s for reading.\n", labelsFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		long fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_labelData = new uint8[fileSize];
		fread(m_labelData, fileSize, 1, file);
		fclose(file);

		// read images
		const char* imagesFileName = training ? "train-images.idx3-ubyte" : "t10k-images.idx3-ubyte";
		file = fopen(imagesFileName, "rb");
		if (!file)
		{
			printf("could not open %s for reading.\n", imagesFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_imageData = new uint8[fileSize];
		fread(m_imageData, fileSize, 1, file);
		fclose(file);

		// endian swap label file if needed, just first two uint32's.  The rest is uint8's.
		uint32* data = (uint32*)m_labelData;
		if (data[0] == 0x01080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
		}

		// verify that the label file has the right header
		if (data[0] != 2049 || data[1] != m_imageCount)
		{
			printf("Label data had unexpected header values.\n");
			return false;
		}
		m_labels = (uint8*)&(data[2]);

		// endian swap the image file if needed, just first 4 uint32's. The rest is uint8's.
		data = (uint32*)m_imageData;
		if (data[0] == 0x03080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
			data[2] = EndianSwap(data[2]);
			data[3] = EndianSwap(data[3]);
		}

		// verify that the image file has the right header
		if (data[0] != 2051 || data[1] != m_imageCount || data[2] != 28 || data[3] != 28)
		{
			printf("Label data had unexpected header values.\n");
			return false;
		}
		m_pixels = (uint8*)&(data[4]);

		// convert the pixels from uint8 to float
		m_pixelsFloat.resize(28 * 28 * m_imageCount);
		for (size_t i = 0; i < 28 * 28 * m_imageCount; ++i)
			m_pixelsFloat[i] = float(m_pixels[i]) / 255.0f;

		// success!
		return true;
	}

	~CMNISTData ()
	{
		delete[] m_labelData;
		delete[] m_imageData;
	}

	size_t NumImages () const { return m_imageCount; }

	const float* GetImage (size_t index, uint8& label) const
	{
		label = m_labels[index];
		return &m_pixelsFloat[index * 28 * 28];
	}

private:
	void* m_labelData;
	void* m_imageData;

	size_t m_imageCount;
	uint8* m_labels;
	uint8* m_pixels;

	std::vector<float> m_pixelsFloat;
};

// ============================================================================================
//                                    NEURAL NETWORK
// ============================================================================================

template <size_t INPUTS, size_t HIDDEN_NEURONS, size_t OUTPUT_NEURONS>
class CNeuralNetwork
{
public:
	CNeuralNetwork ()
	{
		// initialize weights and biases to a gaussian distribution random number with mean 0, stddev 1.0
		std::random_device rd;
		std::mt19937 e2(rd());
		std::normal_distribution<float> dist(0, 1);

		for (float& f : m_hiddenLayerBiases)
			f = dist(e2);

		for (float& f : m_outputLayerBiases)
			f = dist(e2);

		for (float& f : m_hiddenLayerWeights)
			f = dist(e2);

		for (float& f : m_outputLayerWeights)
			f = dist(e2);
	}

	void Train (const CMNISTData& trainingData, size_t miniBatchSize, float learningRate)
	{
		// shuffle the order of the training data for our mini batches
		if (m_trainingOrder.size() != trainingData.NumImages())
		{
			m_trainingOrder.resize(trainingData.NumImages());
			size_t index = 0;
			for (size_t& v : m_trainingOrder)
			{
				v = index;
				++index;
			}
		}
		static std::random_device rd;
		static std::mt19937 e2(rd());
		std::shuffle(m_trainingOrder.begin(), m_trainingOrder.end(), e2);

		// process all minibatches until we are out of training examples
		size_t trainingIndex = 0;
		while (trainingIndex < trainingData.NumImages())
		{
			// Clear out minibatch derivatives.  We sum them up and then divide at the end of the minimatch
			std::fill(m_miniBatchHiddenLayerBiasesDeltaCost.begin(), m_miniBatchHiddenLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerBiasesDeltaCost.begin(), m_miniBatchOutputLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchHiddenLayerWeightsDeltaCost.begin(), m_miniBatchHiddenLayerWeightsDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerWeightsDeltaCost.begin(), m_miniBatchOutputLayerWeightsDeltaCost.end(), 0.0f);

			// process the minibatch
			size_t miniBatchIndex = 0;
			while (miniBatchIndex < miniBatchSize && trainingIndex < trainingData.NumImages())
			{
				// get the training item
				uint8 imageLabel = 0;
				const float* pixels = trainingData.GetImage(m_trainingOrder[trainingIndex], imageLabel);

				// run the forward pass of the network
				uint8 labelDetected = ForwardPass(pixels, imageLabel);

				// run the backward pass to get derivatives of the cost function
				BackwardPass(pixels, imageLabel);

				// add the current derivatives into the minibatch derivative arrays so we can average them at the end of the minibatch via division.
				for (size_t i = 0; i < m_hiddenLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerBiasesDeltaCost[i] += m_hiddenLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchOutputLayerBiasesDeltaCost[i] += m_outputLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_hiddenLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerWeightsDeltaCost[i] += m_hiddenLayerWeightsDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchOutputLayerWeightsDeltaCost[i] += m_outputLayerWeightsDeltaCost[i];

				// note that we've added another item to the minibatch, and that we've consumed another training example
				++trainingIndex;
				++miniBatchIndex;
			}

			// divide minibatch derivatives by how many items were in the minibatch, to get the average of the derivatives.
			// NOTE: instead of doing this explicitly like in the commented code below, we'll do it implicitly
			// by dividing the learning rate by miniBatchIndex.
			/*
			for (float& f : m_miniBatchHiddenLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchHiddenLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			*/

			float miniBatchLearningRate = learningRate / float(miniBatchIndex);

			// apply training to biases and weights
			for (size_t i = 0; i < m_hiddenLayerBiases.size(); ++i)
				m_hiddenLayerBiases[i] -= m_miniBatchHiddenLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerBiases.size(); ++i)
				m_outputLayerBiases[i] -= m_miniBatchOutputLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_hiddenLayerWeights.size(); ++i)
				m_hiddenLayerWeights[i] -= m_miniBatchHiddenLayerWeightsDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerWeights.size(); ++i)
				m_outputLayerWeights[i] -= m_miniBatchOutputLayerWeightsDeltaCost[i] * miniBatchLearningRate;
		}
	}

	// This function evaluates the network for the given input pixels and returns the label it thinks it is from 0-9
	uint8 ForwardPass (const float* pixels, uint8 correctLabel)
	{
		// first do hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			float Z = m_hiddenLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				Z += pixels[inputIndex] * m_hiddenLayerWeights[HiddenLayerWeightIndex(inputIndex, neuronIndex)];

			m_hiddenLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// then do output layer
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			float Z = m_outputLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				Z += m_hiddenLayerOutputs[inputIndex] * m_outputLayerWeights[OutputLayerWeightIndex(inputIndex, neuronIndex)];

			m_outputLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// calculate error.
		// this is the magnitude of the vector that is Desired - Actual.
		// Commenting out because it's not needed.
		/*
		{
			error = 0.0f;
			for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
			{
				float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;
				float diff = (desiredOutput - m_outputLayerOutputs[neuronIndex]);
				error += diff * diff;
			}
			error = std::sqrt(error);
		}
		*/

		// find the maximum value of the output layer and return that index as the label
		float maxOutput = m_outputLayerOutputs[0];
		uint8 maxLabel = 0;
		for (uint8 neuronIndex = 1; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			if (m_outputLayerOutputs[neuronIndex] > maxOutput)
			{
				maxOutput = m_outputLayerOutputs[neuronIndex];
				maxLabel = neuronIndex;
			}
		}
		return maxLabel;
	}

	// Functions to get weights/bias values. Used to make the JSON file.
	const std::array<float, HIDDEN_NEURONS>& GetHiddenLayerBiases () const { return m_hiddenLayerBiases; }
	const std::array<float, OUTPUT_NEURONS>& GetOutputLayerBiases () const { return m_outputLayerBiases; }
	const std::array<float, INPUTS * HIDDEN_NEURONS>& GetHiddenLayerWeights () const { return m_hiddenLayerWeights; }
	const std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>& GetOutputLayerWeights () const { return m_outputLayerWeights; }

private:

	static size_t HiddenLayerWeightIndex (size_t inputIndex, size_t hiddenLayerNeuronIndex)
	{
		return hiddenLayerNeuronIndex * INPUTS + inputIndex;
	}

	static size_t OutputLayerWeightIndex (size_t hiddenLayerNeuronIndex, size_t outputLayerNeuronIndex)
	{
		return outputLayerNeuronIndex * HIDDEN_NEURONS + hiddenLayerNeuronIndex;
	}

	// this function uses the neuron output values from the forward pass to backpropagate the error
	// of the network to calculate the gradient needed for training.  It figures out what the error
	// is by comparing the label it came up with to the label it should have come up with (correctLabel).
	void BackwardPass (const float* pixels, uint8 correctLabel)
	{
		// since we are going backwards, do the output layer first
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each output neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			//
			// deltaCost/deltaO = O - desiredOutput
			// deltaO/deltaZ = O * (1 - O)
			//
			float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;

			float deltaCost_deltaO = m_outputLayerOutputs[neuronIndex] - desiredOutput;
			float deltaO_deltaZ = m_outputLayerOutputs[neuronIndex] * (1.0f - m_outputLayerOutputs[neuronIndex]);

			m_outputLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				m_outputLayerWeightsDeltaCost[OutputLayerWeightIndex(inputIndex, neuronIndex)] = m_outputLayerBiasesDeltaCost[neuronIndex] * m_hiddenLayerOutputs[inputIndex];
		}

		// then do the hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each hidden neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaO =
			//   Sum for each output of this neuron:
			//     deltaCost/deltaDestinationZ * deltaDestinationZ/deltaSourceO
			//
			// deltaCost/deltaDestinationZ is already calculated and lives in m_outputLayerBiasesDeltaCost[destinationNeuronIndex].
			// deltaTargetZ/deltaSourceO is the value of the weight connecting the source and target neuron.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			// deltaO/deltaZ = O * (1 - O)
			//
			float deltaCost_deltaO = 0.0f;
			for (size_t destinationNeuronIndex = 0; destinationNeuronIndex < OUTPUT_NEURONS; ++destinationNeuronIndex)
				deltaCost_deltaO += m_outputLayerBiasesDeltaCost[destinationNeuronIndex] * m_outputLayerWeights[OutputLayerWeightIndex(neuronIndex, destinationNeuronIndex)];
			float deltaO_deltaZ = m_hiddenLayerOutputs[neuronIndex] * (1.0f - m_hiddenLayerOutputs[neuronIndex]);
			m_hiddenLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				m_hiddenLayerWeightsDeltaCost[HiddenLayerWeightIndex(inputIndex, neuronIndex)] = m_hiddenLayerBiasesDeltaCost[neuronIndex] * pixels[inputIndex];
		}
	}

private:

	// biases and weights
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiases;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiases;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeights;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeights;

	// neuron activation values aka "O" values
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerOutputs;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerOutputs;

	// derivatives of biases and weights for a single training example
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeightsDeltaCost;

	// derivatives of biases and weights for the minibatch. Average of all items in minibatch.
	std::array<float, HIDDEN_NEURONS>					m_miniBatchHiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_miniBatchOutputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_miniBatchHiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_miniBatchOutputLayerWeightsDeltaCost;

	// used for minibatch generation
	std::vector<size_t>									m_trainingOrder;
};

// ============================================================================================
//                                   DRIVER PROGRAM
// ============================================================================================

// training and test data
CMNISTData g_trainingData;
CMNISTData g_testData;

// neural network
CNeuralNetwork<c_numInputNeurons, c_numHiddenNeurons, c_numOutputNeurons> g_neuralNetwork;

float GetDataAccuracy (const CMNISTData& data)
{
	size_t correctItems = 0;
	for (size_t i = 0, c = data.NumImages(); i < c; ++i)
	{
		uint8 label;
		const float* pixels = data.GetImage(i, label);
		uint8 detectedLabel = g_neuralNetwork.ForwardPass(pixels, label);

		if (detectedLabel == label)
			++correctItems;
	}
	return float(correctItems) / float(data.NumImages());
}

void ShowImage (const CMNISTData& data, size_t imageIndex)
{
	uint8 label = 0;
	const float* pixels = data.GetImage(imageIndex, label);
	printf("showing a %i\n", label);
	for (int iy = 0; iy < 28; ++iy)
	{
		for (int ix = 0; ix < 28; ++ix)
		{
			if (*pixels < 0.125)
				printf(" ");
			else
				printf("+");
			++pixels;
		}
		printf("\n");
	}
}

int main (int argc, char** argv)
{
	// load the MNIST data if we can
	if (!g_trainingData.Load(true) || !g_testData.Load(false))
	{
		printf("Could not load mnist data, aborting!\n");
		system("pause");
		return 1;
	}

	#if REPORT_ERROR_WHILE_TRAINING()
	FILE *file = fopen("Error.csv","w+t");
	if (!file)
	{
		printf("Could not open Error.csv for writing, aborting!\n");
		system("pause");
		return 2;
	}
	fprintf(file, "\"Training Data Accuracy\",\"Testing Data Accuracy\"\n");
	#endif

	{
		SBlockTimer timer("Training Time:  ");

		// train the network, reporting error before each training
		for (size_t epoch = 0; epoch < c_trainingEpochs; ++epoch)
		{
			#if REPORT_ERROR_WHILE_TRAINING()
				float accuracyTraining = GetDataAccuracy(g_trainingData);
				float accuracyTest = GetDataAccuracy(g_testData);
				printf("Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
				printf("Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);
				fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
			#endif

			printf("Training epoch %zu / %zu...\n", epoch+1, c_trainingEpochs);
			g_neuralNetwork.Train(g_trainingData, c_miniBatchSize, c_learningRate);
			printf("\n");
		}
	}

	// report final error
	float accuracyTraining = GetDataAccuracy(g_trainingData);
	float accuracyTest = GetDataAccuracy(g_testData);
	printf("\nFinal Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
	printf("Final Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);

	#if REPORT_ERROR_WHILE_TRAINING()
		fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
		fclose(file);
	#endif

	// Write out the final weights and biases as JSON for use in the web demo
	{
		FILE* file = fopen("WeightsBiasesJSON.txt", "w+t");
		fprintf(file, "{\n");

		// network structure
		fprintf(file, "  \"InputNeurons\":%zu,\n", c_numInputNeurons);
		fprintf(file, "  \"HiddenNeurons\":%zu,\n", c_numHiddenNeurons);
		fprintf(file, "  \"OutputNeurons\":%zu,\n", c_numOutputNeurons);

		// HiddenBiases
		auto hiddenBiases = g_neuralNetwork.GetHiddenLayerBiases();
		fprintf(file, "  \"HiddenBiases\" : [\n");
		for (size_t i = 0; i < hiddenBiases.size(); ++i)
		{
			fprintf(file, "    %f", hiddenBiases[i]);
			if (i < hiddenBiases.size() -1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// HiddenWeights
		auto hiddenWeights = g_neuralNetwork.GetHiddenLayerWeights();
		fprintf(file, "  \"HiddenWeights\" : [\n");
		for (size_t i = 0; i < hiddenWeights.size(); ++i)
		{
			fprintf(file, "    %f", hiddenWeights[i]);
			if (i < hiddenWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// OutputBiases
		auto outputBiases = g_neuralNetwork.GetOutputLayerBiases();
		fprintf(file, "  \"OutputBiases\" : [\n");
		for (size_t i = 0; i < outputBiases.size(); ++i)
		{
			fprintf(file, "    %f", outputBiases[i]);
			if (i < outputBiases.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// OutputWeights
		auto outputWeights = g_neuralNetwork.GetOutputLayerWeights();
		fprintf(file, "  \"OutputWeights\" : [\n");
		for (size_t i = 0; i < outputWeights.size(); ++i)
		{
			fprintf(file, "    %f", outputWeights[i]);
			if (i < outputWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ]\n");

		// done
		fprintf(file, "}\n");
		fclose(file);
	}

	// You can use the code like the below to visualize an image if you want to.
	//ShowImage(g_testData, 0);

	system("pause");
	return 0;
}

Thanks for reading, and if you have any questions, comments, or just want to chat, hit me up in the comments below, or on twitter at @Atrix256.

Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences

In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences.

By special request, this is that post!

The post I already linked to explains backpropagation.

If you want an explanation of dual numbers, check out these posts:

  1. Dual Numbers & Automatic Differentiation
  2. Multivariable Dual Numbers & Automatic Differentiation

If you want an explanation of finite differences, check out this post:
Finite Differences

Since the fundamentals are explained in the links above, we’ll go straight to the code.

We’ll be getting the gradient (learning values) for the network in example 4 in the backpropagation post:

Note that I am using “central differences” for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I’m using an epsilon of 0.001.

I didn’t compare the running times of each method as my code is meant to be readable, not fast, and the code isn’t doing enough work to make a meaningful performance test IMO. If you did do a speed test, the finite differences should be by far the slowest, backpropagation should be the fastest, and dual numbers are probably going to be closer to backpropagation than to finite differences.

The output of the program is below. Both backpropagation and dual numbers get the exact derivatives (within the tolerances of floating point math of course!) because they use the chain rule, whereas finite differences is a numerical approximation. This shows up in the fact that backpropagation and dual numbers agree for all values, where finite differences has some small error in the derivatives.

And here is the code:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>
  
#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

//----------------------------------------------------------------------
// Dual Number Class - CDualNumber
//----------------------------------------------------------------------

template <size_t NUMVARIABLES>
class CDualNumber
{
public:
 
    // constructor to make a constant
    CDualNumber (float f = 0.0f) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
    }
 
    // constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
    CDualNumber (float f, size_t variableIndex) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
        m_dual[variableIndex] = 1.0f;
    }

	// Set a constant value.
	void Set (float f) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
	}

	// Set a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
	void Set (float f, size_t variableIndex) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
		m_dual[variableIndex] = 1.0f;
	}
 
    // storage for real and dual values
    float                           m_real;
    std::array<float, NUMVARIABLES> m_dual;
};
  
//----------------------------------------------------------------------
// Dual Number Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real + b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
    return ret;
}
  
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real - b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real * b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real / b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
    return ret;
}

// NOTE: the "special functions" below all just use the chain rule, which you can also use to add more functions
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    float sqrtReal = sqrt(a.m_real);
    ret.m_real = sqrtReal;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = pow(a.m_real, y);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
    return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> exp (const CDualNumber<NUMVARIABLES>& a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = exp(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * exp(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = sin(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = cos(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
    return ret;
}
 
// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
    return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Driver Program
//----------------------------------------------------------------------

enum EWeightsBiases
{
	e_weight0 = 0,
	e_weight1,
	e_weight2,
	e_weight3,
	e_weight4,
	e_weight5,
	e_weight6,
	e_weight7,
	e_bias0,
	e_bias1,
	e_bias2,
	e_bias3,

	e_count
};

// our dual number needs a "dual" for every value we want a derivative for: aka every weight and bias
typedef CDualNumber<EWeightsBiases::e_count> TDualNumber;

// templated so it can work for both the dual numbers, as well as the float finite differences
template <typename TBaseType>
void ForwardPass (const std::array<TBaseType, 2>& input, const std::array<TBaseType, 2>& desiredOutput, const std::array<TBaseType, EWeightsBiases::e_count>& weightsBiases, TBaseType& cost)
{
	// calculate hidden layer neuron activations
	TBaseType Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
	TBaseType O0 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z0 * TBaseType(-1.0f)));

	TBaseType Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
	TBaseType O1 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z1 * TBaseType(-1.0f)));

	// calculate output layer neuron activations
	TBaseType Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
	TBaseType O2 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z2 * TBaseType(-1.0f)));

	TBaseType Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
	TBaseType O3 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z3 * TBaseType(-1.0f)));

	// calculate the cost: 0.5 * ||target-actual||^2
	// aka cost = half (error squared)
	TBaseType diff1 = TBaseType(desiredOutput[0]) - O2;
	TBaseType diff2 = TBaseType(desiredOutput[1]) - O3;
	cost = TBaseType(0.5f) * (diff1*diff1 + diff2*diff2);
}

// backpropagation
void ForwardPassAndBackpropagation (
    const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, EWeightsBiases::e_count>& weightsBiases,
    float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, EWeightsBiases::e_count>& deltaWeightsBiases
) {
    // calculate Z0 and O0 for neuron0
    float Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
    float O0 = 1.0f / (1.0f + std::exp(-Z0));
 
    // calculate Z1 and O1 for neuron1
    float Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
    float O1 = 1.0f / (1.0f + std::exp(-Z1));
 
    // calculate Z2 and O2 for neuron2
    float Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
    float O2 = 1.0f / (1.0f + std::exp(-Z2));
 
    // calculate Z3 and O3 for neuron3
    float Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
    float O3 = 1.0f / (1.0f + std::exp(-Z3));
 
    // the actual output of the network is the activation of the output layer neurons
    actualOutput[0] = O2;
    actualOutput[1] = O3;
 
    // calculate error
    float diff0 = desiredOutput[0] - actualOutput[0];
    float diff1 = desiredOutput[1] - actualOutput[1];
    error = std::sqrt(diff0*diff0 + diff1*diff1);
 
    // calculate cost
    cost = 0.5f * error * error;
 
    //----- Neuron 2 -----
 
    // calculate how much a change in neuron 2 activation affects the cost function
    // deltaCost/deltaO2 = O2 - target0
    float deltaCost_deltaO2 = O2 - desiredOutput[0];
 
    // calculate how much a change in neuron 2 weighted input affects neuron 2 activation
    // deltaO2/deltaZ2 = O2 * (1 - O2)
    float deltaO2_deltaZ2 = O2 * (1 - O2);
 
    // calculate how much a change in neuron 2 weighted input affects the cost function.
    // This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
    // This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
    float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaWeightsBiases[e_bias2] = neuron2Error;
 
    // calculate how much a change in weight4 affects the cost function.
    // deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * O0
    // similar thing for weight5
	deltaWeightsBiases[e_weight4] = neuron2Error * O0;
	deltaWeightsBiases[e_weight5] = neuron2Error * O1;
 
    //----- Neuron 3 -----
 
    // calculate how much a change in neuron 3 activation affects the cost function
    // deltaCost/deltaO3 = O3 - target1
    float deltaCost_deltaO3 = O3 - desiredOutput[1];
 
    // calculate how much a change in neuron 3 weighted input affects neuron 3 activation
    // deltaO3/deltaZ3 = O3 * (1 - O3)
    float deltaO3_deltaZ3 = O3 * (1 - O3);
 
    // calculate how much a change in neuron 3 weighted input affects the cost function.
    // This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
    // This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
    float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaWeightsBiases[e_bias3] = neuron3Error;
 
    // calculate how much a change in weight6 affects the cost function.
    // deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * O0
    // similar thing for weight7
	deltaWeightsBiases[e_weight6] = neuron3Error * O0;
	deltaWeightsBiases[e_weight7] = neuron3Error * O1;
 
    //----- Neuron 0 -----
 
    // calculate how much a change in neuron 0 activation affects the cost function
    // deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
    // deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
    float deltaCost_deltaO0 = neuron2Error * weightsBiases[e_weight4] + neuron3Error * weightsBiases[e_weight6];
 
    // calculate how much a change in neuron 0 weighted input affects neuron 0 activation
    // deltaO0/deltaZ0 = O0 * (1 - O0)
    float deltaO0_deltaZ0 = O0 * (1 - O0);
 
    // calculate how much a change in neuron 0 weighted input affects the cost function.
    // This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
    // This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
    float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaWeightsBiases[e_bias0] = neuron0Error;
 
    // calculate how much a change in weight0 affects the cost function.
    // deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * input0
    // similar thing for weight1
	deltaWeightsBiases[e_weight0] = neuron0Error * input[0];
	deltaWeightsBiases[e_weight1] = neuron0Error * input[1];
 
    //----- Neuron 1 -----
 
    // calculate how much a change in neuron 1 activation affects the cost function
    // deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
    // deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
    float deltaCost_deltaO1 = neuron2Error * weightsBiases[e_weight5] + neuron3Error * weightsBiases[e_weight7];
 
    // calculate how much a change in neuron 1 weighted input affects neuron 1 activation
    // deltaO1/deltaZ1 = O1 * (1 - O1)
    float deltaO1_deltaZ1 = O1 * (1 - O1);
 
    // calculate how much a change in neuron 1 weighted input affects the cost function.
    // This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
    // This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
    float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaWeightsBiases[e_bias1] = neuron1Error;
 
    // calculate how much a change in weight2 affects the cost function.
    // deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * input0
    // similar thing for weight3
	deltaWeightsBiases[e_weight2] = neuron1Error * input[0];
	deltaWeightsBiases[e_weight3] = neuron1Error * input[1];
}
 
int main (int argc, char **argv)
{

	// weights and biases, inputs and desired output
	const std::array<float, EWeightsBiases::e_count> weightsBiases =
	{
		0.15f, // e_weight0
		0.2f,  // e_weight1
		0.25f, // e_weight2
		0.3f,  // e_weight3
		0.4f,  // e_weight4
		0.45f, // e_weight5
		0.5f,  // e_weight6
		0.55f, // e_weight7
		0.35f, // e_bias0
		0.35f, // e_bias1
		0.6f,  // e_bias2
		0.6f   // e_bias3
	};

	const std::array<float, 2> inputs = 
	{
		0.05f,
		0.1f
	};

	std::array<float, 2> desiredOutput = {
		0.01f,
		0.99f
	};

	// =====================================================
	// ===== FINITE DIFFERENCES CALCULATED DERIVATIVES =====
	// =====================================================

	std::array<float, EWeightsBiases::e_count> gradientFiniteDifferences;
	std::array<float, EWeightsBiases::e_count> weightsBiasesFloat;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesFloat[i] = weightsBiases[i];

	// use central differences to approximate the gradient
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float costSample1 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] - EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample1);

		float costSample2 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] + EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample2);

		gradientFiniteDifferences[i] = (costSample2 - costSample1) / (EPSILON * 2.0f);

		weightsBiasesFloat[i] = weightsBiases[i];
	}

	// ==============================================
	// ===== DUAL NUMBER CALCULATED DERIVATIVES =====
	// ==============================================

	// dual number weights and biases
	std::array<TDualNumber, EWeightsBiases::e_count> weightsBiasesDual;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesDual[i].Set(weightsBiases[i], i);

	// dual number inputs and desired output
	std::array<TDualNumber, 2> inputsDual;
	for (size_t i = 0; i < inputsDual.size(); ++i)
		inputsDual[i].Set(inputs[i]);

	std::array<TDualNumber, 2> desiredOutputDual;
	for (size_t i = 0; i < desiredOutputDual.size(); ++i)
		desiredOutputDual[i].Set(desiredOutput[i]);

	// dual number derivatives
	TDualNumber gradientDualNumbers;
	ForwardPass(inputsDual, desiredOutputDual, weightsBiasesDual, gradientDualNumbers);

	// ==================================================
	// ===== BACKPROPAGATION CALCULATED DERIVATIVES =====
	// ==================================================

	float error;
	float cost;
	std::array<float, 2> actualOutput;
	std::array<float, EWeightsBiases::e_count> gradientBackPropagation;
	ForwardPassAndBackpropagation(inputs, desiredOutput, weightsBiases, error, cost, actualOutput, gradientBackPropagation);

	// ==========================
	// ===== Report Results =====
	// ==========================

	printf("Neural Network GradientnnBackpropagation     Dual Numbers (Error)       Finite Differences (Error)n");
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float diffDual = gradientBackPropagation[i] - gradientDualNumbers.m_dual[i];
		float diffFinitDifferences = gradientBackPropagation[i] - gradientFiniteDifferences[i];
		printf("% 08f,         % 08f (% 08f),     % 08f (% 08f)n",
			gradientBackPropagation[i],
			gradientDualNumbers.m_dual[i], diffDual,
			gradientFiniteDifferences[i], diffFinitDifferences
		);
	}
	printf("n");

	system("pause");
    return 0;
}

How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

  1. Explanation
  2. Working through examples
  3. Simple sample C++ source code using only standard includes
  4. Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
  weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);

To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)

Plugging in the specific values for the weights and biases, it looks like this:

Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

\sigma(x) = \frac{1}{1+e^{-x}}

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula y=mx+b and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is \frac{dy}{dx}, which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by \frac{dy}{dx} to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: \frac{\partial y}{\partial x}.

The reason partial derivatives even exist is because if you have a function of multiple variables like z=f(x,y)=x^2+3y+2, you have two variables that you can take the derivative of. You can calculate \frac{\partial z}{\partial x} and \frac{\partial z}{\partial y}. The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\

If you put both of those values together into a vector (\frac{\partial z}{\partial x},\frac{\partial z}{\partial y}) you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

  1. Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
  2. Multiply the gradient by a small “learning rate” value, such as 0.05
  3. Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

Cost = ||target-output||^2

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like \frac{\partial Cost}{\partial Weight0}, you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

dy/dx

That you can also write it like this:

dy/du*du/dx

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

3/2 = 1.5

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

  1. dCost/dOutput: The derivative of how a neuron’s output affects cost
  2. dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
  3. dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Backpropagation Example 2: Single Neuron, Two Training Examples

Let’s start with the same neural network from last time:

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476

If we follow the same steps with the “input 0, output 1” training example we get these:
\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887

To get the actual derivatives to train the network with, we just average them!
\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984

Now we start the backward pass to calculate the derivatives for training.

Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\

Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\

Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\

Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\

Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

Check the links at the bottom of the post for more information about this!

Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
	float input, float desiredOutput,
	float weight, float bias,
	float& error, float& cost, float& actualOutput,
	float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
	// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
	float Z = input * weight + bias;
	float O = 1.0f / (1.0f + std::exp(-Z));

	// the actual output of the network is the activation of the neuron
	actualOutput = O;

	// calculate error
	error = std::abs(desiredOutput - actualOutput);

	// calculate cost
	cost = 0.5f * error * error;

	// calculate how much a change in neuron activation affects the cost function
	// deltaCost/deltaO = O - target
	float deltaCost_deltaO = O - desiredOutput;

	// calculate how much a change in neuron weighted input affects neuron activation
	// deltaO/deltaZ = O * (1 - O)
	float deltaO_deltaZ = O * (1 - O);


	// calculate how much a change in a neuron's weighted input affects the cost function.
	// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
	// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
	float neuronError = deltaCost_deltaO * deltaO_deltaZ;
	deltaCost_deltaBias = neuronError;

	// calculate how much a change in the weight affects the cost function.
	// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * input
	deltaCost_deltaWeight = neuronError * input;


	// As a bonus, calculate how much a change in the input affects the cost function.
	// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
	// deltaCost/deltaInput = neuronError * weight
	deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example1.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 10000;

	// training data
	// input: 1, output: 0
	const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float error = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		// run the network to get error and derivatives
		float output = 0.0f;
		float cost = 0.0f;
		float deltaCost_deltaWeight = 0.0f;
		float deltaCost_deltaBias = 0.0f;
		float deltaCost_deltaInput = 0.0f;
		Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= deltaCost_deltaWeight * c_learningRate;
		bias -= deltaCost_deltaBias * c_learningRate;
	}

	printf("Example1 Final Error: %fn", error);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example2.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 100000;

	// training data
	// input: 1, output: 0
	// input: 0, output: 1
	const std::array<std::array<float, 2>, 2> c_trainingData = { {
		{1.0f, 0.0f},
		{0.0f, 1.0f}
	} };

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		avgError = 0.0f;
		float avgOutput = 0.0f;
		float avgCost = 0.0f;
		float avgDeltaCost_deltaWeight = 0.0f;
		float avgDeltaCost_deltaBias = 0.0f;
		float avgDeltaCost_deltaInput = 0.0f;

		// run the network to get error and derivatives for each training example
		for (const std::array<float, 2>& trainingData : c_trainingData)
		{
			float error = 0.0f;
			float output = 0.0f;
			float cost = 0.0f;
			float deltaCost_deltaWeight = 0.0f;
			float deltaCost_deltaBias = 0.0f;
			float deltaCost_deltaInput = 0.0f;
			Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

			avgError += error;
			avgOutput += output;
			avgCost += cost;
			avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
			avgDeltaCost_deltaBias += deltaCost_deltaBias;
			avgDeltaCost_deltaInput += deltaCost_deltaInput;
		}

		avgError /= (float)c_trainingData.size();
		avgOutput /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
		avgDeltaCost_deltaBias /= (float)c_trainingData.size();
		avgDeltaCost_deltaInput /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= avgDeltaCost_deltaWeight * c_learningRate;
		bias -= avgDeltaCost_deltaBias * c_learningRate;
	}

	printf("Example2 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
	std::array<float, 2> m_input;
	std::array<float, 2> m_output;
};

void Example3RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 4>& weights, const std::array<float, 2>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// the actual output of the network is the activation of the neurons
	actualOutput[0] = O0;
	actualOutput[1] = O1;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = O0 - target0
	float deltaCost_deltaO0 = O0 - desiredOutput[0];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = O1 - target1
	float deltaCost_deltaO1 = O1 - desiredOutput[1];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO0/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example3.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 520000;

	// training data: OR/AND
	// input: 00, output: 00
	// input: 01, output: 10
	// input: 10, output: 10
	// input: 11, output: 11
	const std::array<SExample3Training, 4> c_trainingData = { {
		{{0.0f, 0.0f}, {0.0f, 0.0f}},
		{{0.0f, 1.0f}, {1.0f, 0.0f}},
		{{1.0f, 0.0f}, {1.0f, 0.0f}},
		{{1.0f, 1.0f}, {1.0f, 1.0f}},
	} };

	// starting weight and bias values
	std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
	std::array<float, 2> biases = { 0.5f, 0.1f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		//float avgCost = 0.0f;
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example3 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 8>& weights, const std::array<float, 4>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// calculate Z2 and O2 for neuron2
	float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
	float O2 = 1.0f / (1.0f + std::exp(-Z2));

	// calculate Z3 and O3 for neuron3
	float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
	float O3 = 1.0f / (1.0f + std::exp(-Z3));

	// the actual output of the network is the activation of the output layer neurons
	actualOutput[0] = O2;
	actualOutput[1] = O3;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 2 -----

	// calculate how much a change in neuron 2 activation affects the cost function
	// deltaCost/deltaO2 = O2 - target0
	float deltaCost_deltaO2 = O2 - desiredOutput[0];

	// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
	// deltaO2/deltaZ2 = O2 * (1 - O2)
	float deltaO2_deltaZ2 = O2 * (1 - O2);

	// calculate how much a change in neuron 2 weighted input affects the cost function.
	// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
	// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
	float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaCost_deltaBiases[2] = neuron2Error;

	// calculate how much a change in weight4 affects the cost function.
	// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * O0
	// similar thing for weight5
	deltaCost_deltaWeights[4] = neuron2Error * O0;
	deltaCost_deltaWeights[5] = neuron2Error * O1;

	//----- Neuron 3 -----

	// calculate how much a change in neuron 3 activation affects the cost function
	// deltaCost/deltaO3 = O3 - target1
	float deltaCost_deltaO3 = O3 - desiredOutput[1];

	// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
	// deltaO3/deltaZ3 = O3 * (1 - O3)
	float deltaO3_deltaZ3 = O3 * (1 - O3);

	// calculate how much a change in neuron 3 weighted input affects the cost function.
	// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
	// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
	float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaCost_deltaBiases[3] = neuron3Error;

	// calculate how much a change in weight6 affects the cost function.
	// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * O0
	// similar thing for weight7
	deltaCost_deltaWeights[6] = neuron3Error * O0;
	deltaCost_deltaWeights[7] = neuron3Error * O1;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
	// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
	float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
	// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
	float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO1/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example4.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 5000;

	// training data: 0.05, 0.1 in = 0.01, 0.99 out
	const std::array<SExample3Training, 1> c_trainingData = { {
		{{0.05f, 0.1f}, {0.01f, 0.99f}},
	} };

	// starting weight and bias values
	std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
	std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example4 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

int main (int argc, char **argv)
{
	Example1();
	Example2();
	Example3();
	Example4();
	system("pause");
	return 0;
}

Closing & Links

The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

  1. finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
  2. Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

A Step by Step Backpropagation Example
Neural Networks and Deep Learning
Backpropogation is Just Steepest Descent with Automatic Differentiation
Chain Rule
Deep Vis