Neural Network Recipe: Recognize Handwritten Digits With 95% Accuracy

This post is a recipe for making a neural network which is able to recognize hand written numeric digits (0-9) with 95% accuracy.

The intent is that you can use this recipe (and included simple C++ code, and interactive web demo!) as a starting point for some hands on experimentation.

A recent post of mine talks about all the things used in this recipe so give it a read if you want more info about anything: How to Train Neural Networks With Backpropagation.

This recipe is also taken straight from this amazing website (but coded from scratch in C++ by myself), where it’s implemented in python: Using neural nets to recognize handwritten digits.

Recipe

The neural network takes as input 28×28 greyscale images, so there will be 784 input neurons.

There is one hidden layer that has 30 neurons.

The final layer is the output layer which has 10 neurons.

The output neuron with the highest activation is the digit that was recognized. For instance if output neuron 0 had the highest activation, the network detected a 0. If output neuron 2 was highest, the network detected a 2.

The neurons use the sigmoid activation function, and the cost function used is half mean squared error.

Training uses a learning rate of 3.0 and the training data is processed by the network 30 times (aka 30 training epochs), using a minibatch size of 10.

A minibatch size of 10 just means that we calculate the gradient for 10 training samples at a time and adjust the weights and biases using that gradient. We do that for the entire (shuffled) 60,000 training items and call that a single epoch. 30 epochs mean we do this full process 30 times.

There are 60,000 items in the training data, mapping 28×28 greyscale images to what digit 0-9 they actually represent.

Besides the 60,000 training data items, there are also 10,000 separate items that are the test data. These test data items are items never seen by the network during training and are just used as a way to see how well the network has learned about the problem in general, versus learning about the specific training data items.

The test and training data is the MNIST data set. I have a link to zip file I made with the data in it below, but this is where I got the data from: The MNIST database of handwritten digits.

That is the entire recipe!

Results

The 30 training epochs took 1 minute 22 seconds on my machine in release x64 (with REPORT_ERROR_WHILE_TRAINING() set to 0 to speed things up), but the code could be made to run faster by using SIMD, putting it on the GPU, getting multithreading involved or other things.

Below is a graph of the accuracy during the training epochs.

Notice that most learning happened very early on and then only slowly improved from there. This is due to our neuron activation functions and also our cost function. There are better choices for both, but this is also an ongoing area of research to improve in neural networks.

The end result of my training run is 95.32% accuracy but you may get slightly higher or lower due to random initialization of weights and biases. That sounds pretty high, but if you were actually using this, 4 or 5 numbers wrong out of 100 is a pretty big deal! The record for MNIST is 99.77% accuracy using “a committee of convolutional networks” where they distorted the input data during training to make it learn in a more generalized way (described as “committee of 35 conv. net, 1-20-P-40-P-150-10 [elastic distortions]”).

A better cost function would probably be the cross entropy cost function, a better activation function than sigmoid would probably be an ELU (Exponential Linear Unit). A soft max layer could be used instead of just taking the maximum output neuron as the answer. The weights could be initialized to smarter values. We could also use a convolutional layer to help let the network learn features in a way that didn’t also tie the features to specific locations in the images.

Many of these things are described in good detail at http://neuralnetworksanddeeplearning.com/, particularly in chapter 3 where they make a python implementation of a convolutional neural network which performs better than this one. I highly recommend checking that website out!

HTML5 Demo

You can play with a network created with this recipe here: Recognize Handwritten Digit 95% Accuracy

Here is an example of it correctly detecting that I drew a 4.

The demo works “pretty well” but it does have a little less than 95% accuracy.

The reason for this though is that it isn’t comparing apples to apples.

A handwritten digit isn’t quite the same as a digit drawn with a mouse. Check out the image below to see 100 of the training images and see what i mean.

The demo finds the bounding box of the drawn image and rescales that bounding box to a 20×20 image, preserving the aspect ratio. It then puts that into a 28×28 image, using the center of mass of the pixels to center the smaller image in the larger one. This is how the MNIST data was generated, so makes the demo more accurate, but it also has the nice side effect of making it so you can draw a number of any size, in any part of the box, and it will treat it the same as if you drew it at a difference size, or in a different part of the box.

The code that goes with this post outputs the weights, biases and network structure in a json format that is very easy to drop into the html5 demo. This way, if you want to try different things in the network, it should be fairly low effort to adjust the demo to try your adjustments there as well.

Lastly, it might be interesting to get the derivatives of the inputs and play around with the input you gave it. Some experiments I can think of:

When it misclassifies what number you drew, have it adjust what you drew (the input) to be more like what the network would expect to see for that digit. This could help show why it misclassified your number.
Start with a well classified number and make it morph into something recognized by the network as a different number.
Start with a random static (noise) image and adjust it until the network recognizes it as a digit. It would be interesting to see if it looked anything like a number, or if it was still just static.

Source Code

The source code and mnist data is on github at MNIST1, but is also included below for your convenience.

If grabbing the source code from below instead of github, you will need to extract this zip file into the working directory of the program as well. It contains the test data used for training the network.
mnist.zip

Note: it’s been reported on that some compilers/OSs, std::random_device will use dev/urandom which can result in lower quality random numbers, which can result in the network not learning as quickly or to as high of accuracy. You can force it to use dev/random instead which will fix that problem. Details can be found here: http://en.cppreference.com/w/cpp/numeric/random/random_device/random_device

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <random>
#include <array>
#include <vector>
#include <algorithm>
#include <chrono>

typedef uint32_t uint32;
typedef uint16_t uint16;
typedef uint8_t uint8;

// Set to 1 to have it show error after each training and also writes it to an Error.csv file.
// Slows down the process a bit (+~50% time on my machine)
#define REPORT_ERROR_WHILE_TRAINING() 1 

const size_t c_numInputNeurons = 784;
const size_t c_numHiddenNeurons = 30;  // NOTE: setting this to 100 hidden neurons can give better results, but also can be worse other times.
const size_t c_numOutputNeurons = 10;

const size_t c_trainingEpochs = 30;
const size_t c_miniBatchSize = 10;
const float c_learningRate = 3.0f;

// ============================================================================================
//                                     SBlockTimer
// ============================================================================================
// times a block of code
struct SBlockTimer
{
	SBlockTimer (const char* label)
	{
		m_start = std::chrono::high_resolution_clock::now();
		m_label = label;
	}

	~SBlockTimer ()
	{
		std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
		printf("%s%0.2f seconds\n", m_label, seconds.count());
	}

	std::chrono::high_resolution_clock::time_point m_start;
	const char* m_label;
};

// ============================================================================================
//                                    MNIST DATA LOADER
// ============================================================================================

inline uint32 EndianSwap (uint32 a)
{
	return (a<<24) | ((a<<8) & 0x00ff0000) |
           ((a>>8) & 0x0000ff00) | (a>>24);
}

// MNIST data and file format description is from http://yann.lecun.com/exdb/mnist/
class CMNISTData
{
public:
	CMNISTData ()
	{
		m_labelData = nullptr;
		m_imageData = nullptr;

		m_imageCount = 0;
		m_labels = nullptr;
		m_pixels = nullptr;
	}

	bool Load (bool training)
	{
		// set the expected image count
		m_imageCount = training ? 60000 : 10000;

		// read labels
		const char* labelsFileName = training ? "train-labels.idx1-ubyte" : "t10k-labels.idx1-ubyte";
		FILE* file = fopen(labelsFileName,"rb");
		if (!file)
		{
			printf("could not open %s for reading.\n", labelsFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		long fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_labelData = new uint8[fileSize];
		fread(m_labelData, fileSize, 1, file);
		fclose(file);

		// read images
		const char* imagesFileName = training ? "train-images.idx3-ubyte" : "t10k-images.idx3-ubyte";
		file = fopen(imagesFileName, "rb");
		if (!file)
		{
			printf("could not open %s for reading.\n", imagesFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_imageData = new uint8[fileSize];
		fread(m_imageData, fileSize, 1, file);
		fclose(file);

		// endian swap label file if needed, just first two uint32's.  The rest is uint8's.
		uint32* data = (uint32*)m_labelData;
		if (data[0] == 0x01080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
		}

		// verify that the label file has the right header
		if (data[0] != 2049 || data[1] != m_imageCount)
		{
			printf("Label data had unexpected header values.\n");
			return false;
		}
		m_labels = (uint8*)&(data[2]);

		// endian swap the image file if needed, just first 4 uint32's. The rest is uint8's.
		data = (uint32*)m_imageData;
		if (data[0] == 0x03080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
			data[2] = EndianSwap(data[2]);
			data[3] = EndianSwap(data[3]);
		}

		// verify that the image file has the right header
		if (data[0] != 2051 || data[1] != m_imageCount || data[2] != 28 || data[3] != 28)
		{
			printf("Label data had unexpected header values.\n");
			return false;
		}
		m_pixels = (uint8*)&(data[4]);

		// convert the pixels from uint8 to float
		m_pixelsFloat.resize(28 * 28 * m_imageCount);
		for (size_t i = 0; i < 28 * 28 * m_imageCount; ++i)
			m_pixelsFloat[i] = float(m_pixels[i]) / 255.0f;

		// success!
		return true;
	}

	~CMNISTData ()
	{
		delete[] m_labelData;
		delete[] m_imageData;
	}

	size_t NumImages () const { return m_imageCount; }

	const float* GetImage (size_t index, uint8& label) const
	{
		label = m_labels[index];
		return &m_pixelsFloat[index * 28 * 28];
	}

private:
	void* m_labelData;
	void* m_imageData;

	size_t m_imageCount;
	uint8* m_labels;
	uint8* m_pixels;

	std::vector<float> m_pixelsFloat;
};

// ============================================================================================
//                                    NEURAL NETWORK
// ============================================================================================

template <size_t INPUTS, size_t HIDDEN_NEURONS, size_t OUTPUT_NEURONS>
class CNeuralNetwork
{
public:
	CNeuralNetwork ()
	{
		// initialize weights and biases to a gaussian distribution random number with mean 0, stddev 1.0
		std::random_device rd;
		std::mt19937 e2(rd());
		std::normal_distribution<float> dist(0, 1);

		for (float& f : m_hiddenLayerBiases)
			f = dist(e2);

		for (float& f : m_outputLayerBiases)
			f = dist(e2);

		for (float& f : m_hiddenLayerWeights)
			f = dist(e2);

		for (float& f : m_outputLayerWeights)
			f = dist(e2);
	}

	void Train (const CMNISTData& trainingData, size_t miniBatchSize, float learningRate)
	{
		// shuffle the order of the training data for our mini batches
		if (m_trainingOrder.size() != trainingData.NumImages())
		{
			m_trainingOrder.resize(trainingData.NumImages());
			size_t index = 0;
			for (size_t& v : m_trainingOrder)
			{
				v = index;
				++index;
			}
		}
		static std::random_device rd;
		static std::mt19937 e2(rd());
		std::shuffle(m_trainingOrder.begin(), m_trainingOrder.end(), e2);

		// process all minibatches until we are out of training examples
		size_t trainingIndex = 0;
		while (trainingIndex < trainingData.NumImages())
		{
			// Clear out minibatch derivatives.  We sum them up and then divide at the end of the minimatch
			std::fill(m_miniBatchHiddenLayerBiasesDeltaCost.begin(), m_miniBatchHiddenLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerBiasesDeltaCost.begin(), m_miniBatchOutputLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchHiddenLayerWeightsDeltaCost.begin(), m_miniBatchHiddenLayerWeightsDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerWeightsDeltaCost.begin(), m_miniBatchOutputLayerWeightsDeltaCost.end(), 0.0f);

			// process the minibatch
			size_t miniBatchIndex = 0;
			while (miniBatchIndex < miniBatchSize && trainingIndex < trainingData.NumImages())
			{
				// get the training item
				uint8 imageLabel = 0;
				const float* pixels = trainingData.GetImage(m_trainingOrder[trainingIndex], imageLabel);

				// run the forward pass of the network
				uint8 labelDetected = ForwardPass(pixels, imageLabel);

				// run the backward pass to get derivatives of the cost function
				BackwardPass(pixels, imageLabel);

				// add the current derivatives into the minibatch derivative arrays so we can average them at the end of the minibatch via division.
				for (size_t i = 0; i < m_hiddenLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerBiasesDeltaCost[i] += m_hiddenLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchOutputLayerBiasesDeltaCost[i] += m_outputLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_hiddenLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerWeightsDeltaCost[i] += m_hiddenLayerWeightsDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchOutputLayerWeightsDeltaCost[i] += m_outputLayerWeightsDeltaCost[i];

				// note that we've added another item to the minibatch, and that we've consumed another training example
				++trainingIndex;
				++miniBatchIndex;
			}

			// divide minibatch derivatives by how many items were in the minibatch, to get the average of the derivatives.
			// NOTE: instead of doing this explicitly like in the commented code below, we'll do it implicitly
			// by dividing the learning rate by miniBatchIndex.
			/*
			for (float& f : m_miniBatchHiddenLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchHiddenLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			*/

			float miniBatchLearningRate = learningRate / float(miniBatchIndex);

			// apply training to biases and weights
			for (size_t i = 0; i < m_hiddenLayerBiases.size(); ++i)
				m_hiddenLayerBiases[i] -= m_miniBatchHiddenLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerBiases.size(); ++i)
				m_outputLayerBiases[i] -= m_miniBatchOutputLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_hiddenLayerWeights.size(); ++i)
				m_hiddenLayerWeights[i] -= m_miniBatchHiddenLayerWeightsDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerWeights.size(); ++i)
				m_outputLayerWeights[i] -= m_miniBatchOutputLayerWeightsDeltaCost[i] * miniBatchLearningRate;
		}
	}

	// This function evaluates the network for the given input pixels and returns the label it thinks it is from 0-9
	uint8 ForwardPass (const float* pixels, uint8 correctLabel)
	{
		// first do hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			float Z = m_hiddenLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				Z += pixels[inputIndex] * m_hiddenLayerWeights[HiddenLayerWeightIndex(inputIndex, neuronIndex)];

			m_hiddenLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// then do output layer
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			float Z = m_outputLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				Z += m_hiddenLayerOutputs[inputIndex] * m_outputLayerWeights[OutputLayerWeightIndex(inputIndex, neuronIndex)];

			m_outputLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// calculate error.
		// this is the magnitude of the vector that is Desired - Actual.
		// Commenting out because it's not needed.
		/*
		{
			error = 0.0f;
			for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
			{
				float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;
				float diff = (desiredOutput - m_outputLayerOutputs[neuronIndex]);
				error += diff * diff;
			}
			error = std::sqrt(error);
		}
		*/

		// find the maximum value of the output layer and return that index as the label
		float maxOutput = m_outputLayerOutputs[0];
		uint8 maxLabel = 0;
		for (uint8 neuronIndex = 1; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			if (m_outputLayerOutputs[neuronIndex] > maxOutput)
			{
				maxOutput = m_outputLayerOutputs[neuronIndex];
				maxLabel = neuronIndex;
			}
		}
		return maxLabel;
	}

	// Functions to get weights/bias values. Used to make the JSON file.
	const std::array<float, HIDDEN_NEURONS>& GetHiddenLayerBiases () const { return m_hiddenLayerBiases; }
	const std::array<float, OUTPUT_NEURONS>& GetOutputLayerBiases () const { return m_outputLayerBiases; }
	const std::array<float, INPUTS * HIDDEN_NEURONS>& GetHiddenLayerWeights () const { return m_hiddenLayerWeights; }
	const std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>& GetOutputLayerWeights () const { return m_outputLayerWeights; }

private:

	static size_t HiddenLayerWeightIndex (size_t inputIndex, size_t hiddenLayerNeuronIndex)
	{
		return hiddenLayerNeuronIndex * INPUTS + inputIndex;
	}

	static size_t OutputLayerWeightIndex (size_t hiddenLayerNeuronIndex, size_t outputLayerNeuronIndex)
	{
		return outputLayerNeuronIndex * HIDDEN_NEURONS + hiddenLayerNeuronIndex;
	}

	// this function uses the neuron output values from the forward pass to backpropagate the error
	// of the network to calculate the gradient needed for training.  It figures out what the error
	// is by comparing the label it came up with to the label it should have come up with (correctLabel).
	void BackwardPass (const float* pixels, uint8 correctLabel)
	{
		// since we are going backwards, do the output layer first
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each output neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			//
			// deltaCost/deltaO = O - desiredOutput
			// deltaO/deltaZ = O * (1 - O)
			//
			float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;

			float deltaCost_deltaO = m_outputLayerOutputs[neuronIndex] - desiredOutput;
			float deltaO_deltaZ = m_outputLayerOutputs[neuronIndex] * (1.0f - m_outputLayerOutputs[neuronIndex]);

			m_outputLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				m_outputLayerWeightsDeltaCost[OutputLayerWeightIndex(inputIndex, neuronIndex)] = m_outputLayerBiasesDeltaCost[neuronIndex] * m_hiddenLayerOutputs[inputIndex];
		}

		// then do the hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each hidden neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaO =
			//   Sum for each output of this neuron:
			//     deltaCost/deltaDestinationZ * deltaDestinationZ/deltaSourceO
			//
			// deltaCost/deltaDestinationZ is already calculated and lives in m_outputLayerBiasesDeltaCost[destinationNeuronIndex].
			// deltaTargetZ/deltaSourceO is the value of the weight connecting the source and target neuron.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			// deltaO/deltaZ = O * (1 - O)
			//
			float deltaCost_deltaO = 0.0f;
			for (size_t destinationNeuronIndex = 0; destinationNeuronIndex < OUTPUT_NEURONS; ++destinationNeuronIndex)
				deltaCost_deltaO += m_outputLayerBiasesDeltaCost[destinationNeuronIndex] * m_outputLayerWeights[OutputLayerWeightIndex(neuronIndex, destinationNeuronIndex)];
			float deltaO_deltaZ = m_hiddenLayerOutputs[neuronIndex] * (1.0f - m_hiddenLayerOutputs[neuronIndex]);
			m_hiddenLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				m_hiddenLayerWeightsDeltaCost[HiddenLayerWeightIndex(inputIndex, neuronIndex)] = m_hiddenLayerBiasesDeltaCost[neuronIndex] * pixels[inputIndex];
		}
	}

private:

	// biases and weights
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiases;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiases;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeights;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeights;

	// neuron activation values aka "O" values
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerOutputs;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerOutputs;

	// derivatives of biases and weights for a single training example
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeightsDeltaCost;

	// derivatives of biases and weights for the minibatch. Average of all items in minibatch.
	std::array<float, HIDDEN_NEURONS>					m_miniBatchHiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_miniBatchOutputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_miniBatchHiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_miniBatchOutputLayerWeightsDeltaCost;

	// used for minibatch generation
	std::vector<size_t>									m_trainingOrder;
};

// ============================================================================================
//                                   DRIVER PROGRAM
// ============================================================================================

// training and test data
CMNISTData g_trainingData;
CMNISTData g_testData;

// neural network
CNeuralNetwork<c_numInputNeurons, c_numHiddenNeurons, c_numOutputNeurons> g_neuralNetwork;

float GetDataAccuracy (const CMNISTData& data)
{
	size_t correctItems = 0;
	for (size_t i = 0, c = data.NumImages(); i < c; ++i)
	{
		uint8 label;
		const float* pixels = data.GetImage(i, label);
		uint8 detectedLabel = g_neuralNetwork.ForwardPass(pixels, label);

		if (detectedLabel == label)
			++correctItems;
	}
	return float(correctItems) / float(data.NumImages());
}

void ShowImage (const CMNISTData& data, size_t imageIndex)
{
	uint8 label = 0;
	const float* pixels = data.GetImage(imageIndex, label);
	printf("showing a %i\n", label);
	for (int iy = 0; iy < 28; ++iy)
	{
		for (int ix = 0; ix < 28; ++ix)
		{
			if (*pixels < 0.125)
				printf(" ");
			else
				printf("+");
			++pixels;
		}
		printf("\n");
	}
}

int main (int argc, char** argv)
{
	// load the MNIST data if we can
	if (!g_trainingData.Load(true) || !g_testData.Load(false))
	{
		printf("Could not load mnist data, aborting!\n");
		system("pause");
		return 1;
	}

	#if REPORT_ERROR_WHILE_TRAINING()
	FILE *file = fopen("Error.csv","w+t");
	if (!file)
	{
		printf("Could not open Error.csv for writing, aborting!\n");
		system("pause");
		return 2;
	}
	fprintf(file, "\"Training Data Accuracy\",\"Testing Data Accuracy\"\n");
	#endif

	{
		SBlockTimer timer("Training Time:  ");

		// train the network, reporting error before each training
		for (size_t epoch = 0; epoch < c_trainingEpochs; ++epoch)
		{
			#if REPORT_ERROR_WHILE_TRAINING()
				float accuracyTraining = GetDataAccuracy(g_trainingData);
				float accuracyTest = GetDataAccuracy(g_testData);
				printf("Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
				printf("Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);
				fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
			#endif

			printf("Training epoch %zu / %zu...\n", epoch+1, c_trainingEpochs);
			g_neuralNetwork.Train(g_trainingData, c_miniBatchSize, c_learningRate);
			printf("\n");
		}
	}

	// report final error
	float accuracyTraining = GetDataAccuracy(g_trainingData);
	float accuracyTest = GetDataAccuracy(g_testData);
	printf("\nFinal Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
	printf("Final Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);

	#if REPORT_ERROR_WHILE_TRAINING()
		fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
		fclose(file);
	#endif

	// Write out the final weights and biases as JSON for use in the web demo
	{
		FILE* file = fopen("WeightsBiasesJSON.txt", "w+t");
		fprintf(file, "{\n");

		// network structure
		fprintf(file, "  \"InputNeurons\":%zu,\n", c_numInputNeurons);
		fprintf(file, "  \"HiddenNeurons\":%zu,\n", c_numHiddenNeurons);
		fprintf(file, "  \"OutputNeurons\":%zu,\n", c_numOutputNeurons);

		// HiddenBiases
		auto hiddenBiases = g_neuralNetwork.GetHiddenLayerBiases();
		fprintf(file, "  \"HiddenBiases\" : [\n");
		for (size_t i = 0; i < hiddenBiases.size(); ++i)
		{
			fprintf(file, "    %f", hiddenBiases[i]);
			if (i < hiddenBiases.size() -1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// HiddenWeights
		auto hiddenWeights = g_neuralNetwork.GetHiddenLayerWeights();
		fprintf(file, "  \"HiddenWeights\" : [\n");
		for (size_t i = 0; i < hiddenWeights.size(); ++i)
		{
			fprintf(file, "    %f", hiddenWeights[i]);
			if (i < hiddenWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// OutputBiases
		auto outputBiases = g_neuralNetwork.GetOutputLayerBiases();
		fprintf(file, "  \"OutputBiases\" : [\n");
		for (size_t i = 0; i < outputBiases.size(); ++i)
		{
			fprintf(file, "    %f", outputBiases[i]);
			if (i < outputBiases.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// OutputWeights
		auto outputWeights = g_neuralNetwork.GetOutputLayerWeights();
		fprintf(file, "  \"OutputWeights\" : [\n");
		for (size_t i = 0; i < outputWeights.size(); ++i)
		{
			fprintf(file, "    %f", outputWeights[i]);
			if (i < outputWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ]\n");

		// done
		fprintf(file, "}\n");
		fclose(file);
	}

	// You can use the code like the below to visualize an image if you want to.
	//ShowImage(g_testData, 0);

	system("pause");
	return 0;
}

Thanks for reading, and if you have any questions, comments, or just want to chat, hit me up in the comments below, or on twitter at @Atrix256.

Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences

In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences.

By special request, this is that post!

The post I already linked to explains backpropagation.

If you want an explanation of dual numbers, check out these posts:

If you want an explanation of finite differences, check out this post:
Finite Differences

Since the fundamentals are explained in the links above, we’ll go straight to the code.

We’ll be getting the gradient (learning values) for the network in example 4 in the backpropagation post:

Note that I am using “central differences” for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I’m using an epsilon of 0.001.

I didn’t compare the running times of each method as my code is meant to be readable, not fast, and the code isn’t doing enough work to make a meaningful performance test IMO. If you did do a speed test, the finite differences should be by far the slowest, backpropagation should be the fastest, and dual numbers are probably going to be closer to backpropagation than to finite differences.

The output of the program is below. Both backpropagation and dual numbers get the exact derivatives (within the tolerances of floating point math of course!) because they use the chain rule, whereas finite differences is a numerical approximation. This shows up in the fact that backpropagation and dual numbers agree for all values, where finite differences has some small error in the derivatives.

And here is the code:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>
  
#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

//----------------------------------------------------------------------
// Dual Number Class - CDualNumber
//----------------------------------------------------------------------

template <size_t NUMVARIABLES>
class CDualNumber
{
public:
 
    // constructor to make a constant
    CDualNumber (float f = 0.0f) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
    }
 
    // constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
    CDualNumber (float f, size_t variableIndex) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
        m_dual[variableIndex] = 1.0f;
    }

	// Set a constant value.
	void Set (float f) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
	}

	// Set a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
	void Set (float f, size_t variableIndex) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
		m_dual[variableIndex] = 1.0f;
	}
 
    // storage for real and dual values
    float                           m_real;
    std::array<float, NUMVARIABLES> m_dual;
};
  
//----------------------------------------------------------------------
// Dual Number Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real + b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
    return ret;
}
  
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real - b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real * b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real / b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
    return ret;
}

// NOTE: the "special functions" below all just use the chain rule, which you can also use to add more functions
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    float sqrtReal = sqrt(a.m_real);
    ret.m_real = sqrtReal;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = pow(a.m_real, y);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
    return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> exp (const CDualNumber<NUMVARIABLES>& a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = exp(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * exp(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = sin(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = cos(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
    return ret;
}
 
// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
    return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Driver Program
//----------------------------------------------------------------------

enum EWeightsBiases
{
	e_weight0 = 0,
	e_weight1,
	e_weight2,
	e_weight3,
	e_weight4,
	e_weight5,
	e_weight6,
	e_weight7,
	e_bias0,
	e_bias1,
	e_bias2,
	e_bias3,

	e_count
};

// our dual number needs a "dual" for every value we want a derivative for: aka every weight and bias
typedef CDualNumber<EWeightsBiases::e_count> TDualNumber;

// templated so it can work for both the dual numbers, as well as the float finite differences
template <typename TBaseType>
void ForwardPass (const std::array<TBaseType, 2>& input, const std::array<TBaseType, 2>& desiredOutput, const std::array<TBaseType, EWeightsBiases::e_count>& weightsBiases, TBaseType& cost)
{
	// calculate hidden layer neuron activations
	TBaseType Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
	TBaseType O0 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z0 * TBaseType(-1.0f)));

	TBaseType Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
	TBaseType O1 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z1 * TBaseType(-1.0f)));

	// calculate output layer neuron activations
	TBaseType Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
	TBaseType O2 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z2 * TBaseType(-1.0f)));

	TBaseType Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
	TBaseType O3 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z3 * TBaseType(-1.0f)));

	// calculate the cost: 0.5 * ||target-actual||^2
	// aka cost = half (error squared)
	TBaseType diff1 = TBaseType(desiredOutput[0]) - O2;
	TBaseType diff2 = TBaseType(desiredOutput[1]) - O3;
	cost = TBaseType(0.5f) * (diff1*diff1 + diff2*diff2);
}

// backpropagation
void ForwardPassAndBackpropagation (
    const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, EWeightsBiases::e_count>& weightsBiases,
    float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, EWeightsBiases::e_count>& deltaWeightsBiases
) {
    // calculate Z0 and O0 for neuron0
    float Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
    float O0 = 1.0f / (1.0f + std::exp(-Z0));
 
    // calculate Z1 and O1 for neuron1
    float Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
    float O1 = 1.0f / (1.0f + std::exp(-Z1));
 
    // calculate Z2 and O2 for neuron2
    float Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
    float O2 = 1.0f / (1.0f + std::exp(-Z2));
 
    // calculate Z3 and O3 for neuron3
    float Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
    float O3 = 1.0f / (1.0f + std::exp(-Z3));
 
    // the actual output of the network is the activation of the output layer neurons
    actualOutput[0] = O2;
    actualOutput[1] = O3;
 
    // calculate error
    float diff0 = desiredOutput[0] - actualOutput[0];
    float diff1 = desiredOutput[1] - actualOutput[1];
    error = std::sqrt(diff0*diff0 + diff1*diff1);
 
    // calculate cost
    cost = 0.5f * error * error;
 
    //----- Neuron 2 -----
 
    // calculate how much a change in neuron 2 activation affects the cost function
    // deltaCost/deltaO2 = O2 - target0
    float deltaCost_deltaO2 = O2 - desiredOutput[0];
 
    // calculate how much a change in neuron 2 weighted input affects neuron 2 activation
    // deltaO2/deltaZ2 = O2 * (1 - O2)
    float deltaO2_deltaZ2 = O2 * (1 - O2);
 
    // calculate how much a change in neuron 2 weighted input affects the cost function.
    // This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
    // This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
    float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaWeightsBiases[e_bias2] = neuron2Error;
 
    // calculate how much a change in weight4 affects the cost function.
    // deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * O0
    // similar thing for weight5
	deltaWeightsBiases[e_weight4] = neuron2Error * O0;
	deltaWeightsBiases[e_weight5] = neuron2Error * O1;
 
    //----- Neuron 3 -----
 
    // calculate how much a change in neuron 3 activation affects the cost function
    // deltaCost/deltaO3 = O3 - target1
    float deltaCost_deltaO3 = O3 - desiredOutput[1];
 
    // calculate how much a change in neuron 3 weighted input affects neuron 3 activation
    // deltaO3/deltaZ3 = O3 * (1 - O3)
    float deltaO3_deltaZ3 = O3 * (1 - O3);
 
    // calculate how much a change in neuron 3 weighted input affects the cost function.
    // This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
    // This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
    float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaWeightsBiases[e_bias3] = neuron3Error;
 
    // calculate how much a change in weight6 affects the cost function.
    // deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * O0
    // similar thing for weight7
	deltaWeightsBiases[e_weight6] = neuron3Error * O0;
	deltaWeightsBiases[e_weight7] = neuron3Error * O1;
 
    //----- Neuron 0 -----
 
    // calculate how much a change in neuron 0 activation affects the cost function
    // deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
    // deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
    float deltaCost_deltaO0 = neuron2Error * weightsBiases[e_weight4] + neuron3Error * weightsBiases[e_weight6];
 
    // calculate how much a change in neuron 0 weighted input affects neuron 0 activation
    // deltaO0/deltaZ0 = O0 * (1 - O0)
    float deltaO0_deltaZ0 = O0 * (1 - O0);
 
    // calculate how much a change in neuron 0 weighted input affects the cost function.
    // This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
    // This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
    float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaWeightsBiases[e_bias0] = neuron0Error;
 
    // calculate how much a change in weight0 affects the cost function.
    // deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * input0
    // similar thing for weight1
	deltaWeightsBiases[e_weight0] = neuron0Error * input[0];
	deltaWeightsBiases[e_weight1] = neuron0Error * input[1];
 
    //----- Neuron 1 -----
 
    // calculate how much a change in neuron 1 activation affects the cost function
    // deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
    // deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
    float deltaCost_deltaO1 = neuron2Error * weightsBiases[e_weight5] + neuron3Error * weightsBiases[e_weight7];
 
    // calculate how much a change in neuron 1 weighted input affects neuron 1 activation
    // deltaO1/deltaZ1 = O1 * (1 - O1)
    float deltaO1_deltaZ1 = O1 * (1 - O1);
 
    // calculate how much a change in neuron 1 weighted input affects the cost function.
    // This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
    // This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
    float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaWeightsBiases[e_bias1] = neuron1Error;
 
    // calculate how much a change in weight2 affects the cost function.
    // deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * input0
    // similar thing for weight3
	deltaWeightsBiases[e_weight2] = neuron1Error * input[0];
	deltaWeightsBiases[e_weight3] = neuron1Error * input[1];
}
 
int main (int argc, char **argv)
{

	// weights and biases, inputs and desired output
	const std::array<float, EWeightsBiases::e_count> weightsBiases =
	{
		0.15f, // e_weight0
		0.2f,  // e_weight1
		0.25f, // e_weight2
		0.3f,  // e_weight3
		0.4f,  // e_weight4
		0.45f, // e_weight5
		0.5f,  // e_weight6
		0.55f, // e_weight7
		0.35f, // e_bias0
		0.35f, // e_bias1
		0.6f,  // e_bias2
		0.6f   // e_bias3
	};

	const std::array<float, 2> inputs = 
	{
		0.05f,
		0.1f
	};

	std::array<float, 2> desiredOutput = {
		0.01f,
		0.99f
	};

	// =====================================================
	// ===== FINITE DIFFERENCES CALCULATED DERIVATIVES =====
	// =====================================================

	std::array<float, EWeightsBiases::e_count> gradientFiniteDifferences;
	std::array<float, EWeightsBiases::e_count> weightsBiasesFloat;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesFloat[i] = weightsBiases[i];

	// use central differences to approximate the gradient
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float costSample1 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] - EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample1);

		float costSample2 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] + EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample2);

		gradientFiniteDifferences[i] = (costSample2 - costSample1) / (EPSILON * 2.0f);

		weightsBiasesFloat[i] = weightsBiases[i];
	}

	// ==============================================
	// ===== DUAL NUMBER CALCULATED DERIVATIVES =====
	// ==============================================

	// dual number weights and biases
	std::array<TDualNumber, EWeightsBiases::e_count> weightsBiasesDual;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesDual[i].Set(weightsBiases[i], i);

	// dual number inputs and desired output
	std::array<TDualNumber, 2> inputsDual;
	for (size_t i = 0; i < inputsDual.size(); ++i)
		inputsDual[i].Set(inputs[i]);

	std::array<TDualNumber, 2> desiredOutputDual;
	for (size_t i = 0; i < desiredOutputDual.size(); ++i)
		desiredOutputDual[i].Set(desiredOutput[i]);

	// dual number derivatives
	TDualNumber gradientDualNumbers;
	ForwardPass(inputsDual, desiredOutputDual, weightsBiasesDual, gradientDualNumbers);

	// ==================================================
	// ===== BACKPROPAGATION CALCULATED DERIVATIVES =====
	// ==================================================

	float error;
	float cost;
	std::array<float, 2> actualOutput;
	std::array<float, EWeightsBiases::e_count> gradientBackPropagation;
	ForwardPassAndBackpropagation(inputs, desiredOutput, weightsBiases, error, cost, actualOutput, gradientBackPropagation);

	// ==========================
	// ===== Report Results =====
	// ==========================

	printf("Neural Network GradientnnBackpropagation     Dual Numbers (Error)       Finite Differences (Error)n");
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float diffDual = gradientBackPropagation[i] - gradientDualNumbers.m_dual[i];
		float diffFinitDifferences = gradientBackPropagation[i] - gradientFiniteDifferences[i];
		printf("% 08f,         % 08f (% 08f),     % 08f (% 08f)n",
			gradientBackPropagation[i],
			gradientDualNumbers.m_dual[i], diffDual,
			gradientFiniteDifferences[i], diffFinitDifferences
		);
	}
	printf("n");

	system("pause");
    return 0;
}

How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

Explanation
Working through examples
Simple sample C++ source code using only standard includes
Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
  weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);

To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

$Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)$

Plugging in the specific values for the weights and biases, it looks like this:

$Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)$

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

$Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)$

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

$\sigma(x) = \frac{1}{1+e^{-x}}$

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

$Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295$

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula $y=mx+b$ and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is $\frac{dy}{dx}$ , which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by $\frac{dy}{dx}$ to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: $\frac{\partial y}{\partial x}$ .

The reason partial derivatives even exist is because if you have a function of multiple variables like $z=f(x,y)=x^2+3y+2$ , you have two variables that you can take the derivative of. You can calculate $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$ . The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

$\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\$

If you put both of those values together into a vector $(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
Multiply the gradient by a small “learning rate” value, such as 0.05
Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

$Cost = ||target-output||^2$

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like $\frac{\partial Cost}{\partial Weight0}$ , you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

$dy/dx$

That you can also write it like this:

$dy/du*du/dx$

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

$3/2 = 1.5$

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
$3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5$

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

dCost/dOutput: The derivative of how a neuron’s output affects cost
dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Backpropagation Example 2: Single Neuron, Two Training Examples

Let’s start with the same neural network from last time:

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
$\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476$

If we follow the same steps with the “input 0, output 1” training example we get these:
$\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887$

To get the actual derivatives to train the network with, we just average them!
$\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294$

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225$

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167$

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

$\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360$

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

$\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463$

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463$

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729$

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984$

Now we start the backward pass to calculate the derivatives for training.

Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

$\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385$

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

$\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\$

Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

$\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381$

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

$\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\$

Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

$\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364$

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

$\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\$

Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

$\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01$

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\$

Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

Check the links at the bottom of the post for more information about this!

Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
	float input, float desiredOutput,
	float weight, float bias,
	float& error, float& cost, float& actualOutput,
	float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
	// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
	float Z = input * weight + bias;
	float O = 1.0f / (1.0f + std::exp(-Z));

	// the actual output of the network is the activation of the neuron
	actualOutput = O;

	// calculate error
	error = std::abs(desiredOutput - actualOutput);

	// calculate cost
	cost = 0.5f * error * error;

	// calculate how much a change in neuron activation affects the cost function
	// deltaCost/deltaO = O - target
	float deltaCost_deltaO = O - desiredOutput;

	// calculate how much a change in neuron weighted input affects neuron activation
	// deltaO/deltaZ = O * (1 - O)
	float deltaO_deltaZ = O * (1 - O);


	// calculate how much a change in a neuron's weighted input affects the cost function.
	// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
	// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
	float neuronError = deltaCost_deltaO * deltaO_deltaZ;
	deltaCost_deltaBias = neuronError;

	// calculate how much a change in the weight affects the cost function.
	// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * input
	deltaCost_deltaWeight = neuronError * input;


	// As a bonus, calculate how much a change in the input affects the cost function.
	// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
	// deltaCost/deltaInput = neuronError * weight
	deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example1.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 10000;

	// training data
	// input: 1, output: 0
	const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float error = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		// run the network to get error and derivatives
		float output = 0.0f;
		float cost = 0.0f;
		float deltaCost_deltaWeight = 0.0f;
		float deltaCost_deltaBias = 0.0f;
		float deltaCost_deltaInput = 0.0f;
		Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= deltaCost_deltaWeight * c_learningRate;
		bias -= deltaCost_deltaBias * c_learningRate;
	}

	printf("Example1 Final Error: %fn", error);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example2.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 100000;

	// training data
	// input: 1, output: 0
	// input: 0, output: 1
	const std::array<std::array<float, 2>, 2> c_trainingData = { {
		{1.0f, 0.0f},
		{0.0f, 1.0f}
	} };

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		avgError = 0.0f;
		float avgOutput = 0.0f;
		float avgCost = 0.0f;
		float avgDeltaCost_deltaWeight = 0.0f;
		float avgDeltaCost_deltaBias = 0.0f;
		float avgDeltaCost_deltaInput = 0.0f;

		// run the network to get error and derivatives for each training example
		for (const std::array<float, 2>& trainingData : c_trainingData)
		{
			float error = 0.0f;
			float output = 0.0f;
			float cost = 0.0f;
			float deltaCost_deltaWeight = 0.0f;
			float deltaCost_deltaBias = 0.0f;
			float deltaCost_deltaInput = 0.0f;
			Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

			avgError += error;
			avgOutput += output;
			avgCost += cost;
			avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
			avgDeltaCost_deltaBias += deltaCost_deltaBias;
			avgDeltaCost_deltaInput += deltaCost_deltaInput;
		}

		avgError /= (float)c_trainingData.size();
		avgOutput /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
		avgDeltaCost_deltaBias /= (float)c_trainingData.size();
		avgDeltaCost_deltaInput /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= avgDeltaCost_deltaWeight * c_learningRate;
		bias -= avgDeltaCost_deltaBias * c_learningRate;
	}

	printf("Example2 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
	std::array<float, 2> m_input;
	std::array<float, 2> m_output;
};

void Example3RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 4>& weights, const std::array<float, 2>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// the actual output of the network is the activation of the neurons
	actualOutput[0] = O0;
	actualOutput[1] = O1;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = O0 - target0
	float deltaCost_deltaO0 = O0 - desiredOutput[0];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = O1 - target1
	float deltaCost_deltaO1 = O1 - desiredOutput[1];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO0/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example3.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 520000;

	// training data: OR/AND
	// input: 00, output: 00
	// input: 01, output: 10
	// input: 10, output: 10
	// input: 11, output: 11
	const std::array<SExample3Training, 4> c_trainingData = { {
		{{0.0f, 0.0f}, {0.0f, 0.0f}},
		{{0.0f, 1.0f}, {1.0f, 0.0f}},
		{{1.0f, 0.0f}, {1.0f, 0.0f}},
		{{1.0f, 1.0f}, {1.0f, 1.0f}},
	} };

	// starting weight and bias values
	std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
	std::array<float, 2> biases = { 0.5f, 0.1f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		//float avgCost = 0.0f;
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example3 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 8>& weights, const std::array<float, 4>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// calculate Z2 and O2 for neuron2
	float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
	float O2 = 1.0f / (1.0f + std::exp(-Z2));

	// calculate Z3 and O3 for neuron3
	float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
	float O3 = 1.0f / (1.0f + std::exp(-Z3));

	// the actual output of the network is the activation of the output layer neurons
	actualOutput[0] = O2;
	actualOutput[1] = O3;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 2 -----

	// calculate how much a change in neuron 2 activation affects the cost function
	// deltaCost/deltaO2 = O2 - target0
	float deltaCost_deltaO2 = O2 - desiredOutput[0];

	// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
	// deltaO2/deltaZ2 = O2 * (1 - O2)
	float deltaO2_deltaZ2 = O2 * (1 - O2);

	// calculate how much a change in neuron 2 weighted input affects the cost function.
	// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
	// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
	float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaCost_deltaBiases[2] = neuron2Error;

	// calculate how much a change in weight4 affects the cost function.
	// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * O0
	// similar thing for weight5
	deltaCost_deltaWeights[4] = neuron2Error * O0;
	deltaCost_deltaWeights[5] = neuron2Error * O1;

	//----- Neuron 3 -----

	// calculate how much a change in neuron 3 activation affects the cost function
	// deltaCost/deltaO3 = O3 - target1
	float deltaCost_deltaO3 = O3 - desiredOutput[1];

	// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
	// deltaO3/deltaZ3 = O3 * (1 - O3)
	float deltaO3_deltaZ3 = O3 * (1 - O3);

	// calculate how much a change in neuron 3 weighted input affects the cost function.
	// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
	// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
	float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaCost_deltaBiases[3] = neuron3Error;

	// calculate how much a change in weight6 affects the cost function.
	// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * O0
	// similar thing for weight7
	deltaCost_deltaWeights[6] = neuron3Error * O0;
	deltaCost_deltaWeights[7] = neuron3Error * O1;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
	// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
	float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
	// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
	float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO1/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example4.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 5000;

	// training data: 0.05, 0.1 in = 0.01, 0.99 out
	const std::array<SExample3Training, 1> c_trainingData = { {
		{{0.05f, 0.1f}, {0.01f, 0.99f}},
	} };

	// starting weight and bias values
	std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
	std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example4 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

int main (int argc, char **argv)
{
	Example1();
	Example2();
	Example3();
	Example4();
	system("pause");
	return 0;
}

Closing & Links

The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

A Step by Step Backpropagation Example
Neural Networks and Deep Learning
Backpropogation is Just Steepest Descent with Automatic Differentiation
Chain Rule
Deep Vis

Multivariable Dual Numbers & Automatic Differentiation

In a previous post I showed how to use dual numbers to be able to get both the value and derivative of a function at the same time:
Dual Numbers & Automatic Differentiation

That post mentions that you can extend it to multivariable functions but doesn’t explain how. This post is that explanation, including simple working C++ code!

Extending this to multivariable functions is useful for ray marching, calculating analytical surface normals and also likely useful for training a neural network if you want an alternative to back propagation. I’m not sure about the efficiency comparison of this versus back propagation but I intend on looking into it (:

How Does it Work?

It turns out to be really simple to use dual numbers with multivariable functions. The end result is that you want a partial derivative for each variable in the equation, so to do that, you just have a dual number per variable, and process the entire equation for each of those dual numbers separately.

We’ll work through an example. Let’s find the partial derivatives of x and y of the function $3x^2-2y^3$ , at input (5,2).

We’ll start by finding the derivative of x, and then the derivative of y.

Example: df/dx

We start by making a dual number for our x value, remembering that the real part is the actual value for x, and the dual part is the derivative of x, which is 1:

$5+1\epsilon$

or:

$5+\epsilon$

We multiply that value by itself to get the $x^2$ value, keeping in mind that $\epsilon^2$ is zero:
$(5+\epsilon)*(5+\epsilon)= \\ 25+10\epsilon+\epsilon^2= \\ 25+10\epsilon \\$

Next we need to multiply that by 3 to get the $3x^2$ term:

$3*(25+10\epsilon) = 75+30\epsilon$

Putting that aside for a moment, we need to make the $2y^3$ term. We start by making our y value:

$2+0\epsilon$

or:

$2$

If you are wondering why it has a zero for the epsilon term, it’s because when we are calculating the partial derivative of x, y is a constant, so has a derivative of zero.

Next, we multiply this y value by itself twice to get the $y^3$ value:

$2*2*2=8$

We then multiply it by 2 to get the $2y^3$ term:

$8*2=16$

Now that we have our two terms, we subtract the y term from the x term to get our final result:

$75+30\epsilon-16 = \\ 59+30\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30.

That checks out, let’s move on to the derivative of y!

Example: df/dy

Calculating the derivative of y is very similar to calculating the derivative of x, except that it’s the x term that has an epsilon value (derivative) of 0, instead of the y term. The y term has the epsilon value of 1 this time as well. We’ll work through it to see how it plays out.

First up, we need to make the value for x:

$5+0\epsilon$

or:

$5$

Next we square it and multiply it by 3 to get the $3x^2$ term:

$5*5*3=75$

Next we need to make the value for y, remembering that we use an epsilon value of 1 since the derivative of y is 1 this time around.

$2+\epsilon$

We cube that value and multiply by 2 to get the $2y^3$ term:
$2*(2+\epsilon)*(2+\epsilon)*(2+\epsilon)= \\ 2*(2+\epsilon)*(4+4\epsilon+\epsilon^2)= \\ 2*(2+\epsilon)*(4+4\epsilon)= \\ 2*(8+12\epsilon+4\epsilon^2)= \\ 2*(8+12\epsilon)= \\ 16+24\epsilon$

Now we subtract the y term from the x term to get the final result:

$75 - (16+24\epsilon)= \\ 59-24\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of y at that point is -24.

That also checks out, so we got the correct value and partial derivatives for the equation.

Reducing Redundancy

There was quite a bit of redundancy when working through the x and y derivatives wasn’t there? Increasing the number of variables will increase the amount of redundancy too, so it doesn’t scale up well.

Luckily there is a way to address this. Basically, instead of making two dual numbers which have two items, you make them share the real value (since it’s the same for both, as is the work to make it) and append the dual values for x and y to it.

Thus, instead of having:

$x'=(a+b\epsilon) \\ y'=(a+b\epsilon)$

You have:

$(a+b\epsilon_x+c\epsilon_y)$

Then, in your math or in your program, you treat it as if it’s two different dual numbers packed into one. This lets you do the work for the real number once instead of twice, but still lets you do your dual number work for each variable independently.

While it’s probably easiest to think of these as two dual numbers packed into one value, there is actually a mathematical basis for it as well, which may or may not surprise you.

Check out what happens when we multiply two of these together, keeping in mind that multiplying ANY two epsilon values together becomes zero, even if they are different epsilons:

$(a+b\epsilon_x+c\epsilon_y) * (d+e\epsilon_x+f\epsilon_y)= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + be\epsilon_x^2 + bf\epsilon_x\epsilon_y + cd\epsilon_y + ce\epsilon_x\epsilon_y + cf\epsilon_y^2= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + cd\epsilon_y= \\ ad + (ae+bd)\epsilon_x + (af+cd)\epsilon_y$

The interesting thing is that the above result gives you the same values as if you did the same work for two dual numbers individually.

Let’s see this three component dual number in action by re-doing the example again. Note that this pattern scales up to ANY number of variables!

Example: Both Derivatives (Gradient Vector)

Our goal is to calculate the value and partial derivatives of the function $3x^2-2y^3$ at location (5,2).

First we make our x value:

$5 + 1\epsilon_x + 0\epsilon_y$

or:

$5 + \epsilon_x$

We square that and multiply it by 3 to get our $3x^2$ term:

$3*(5 + \epsilon_x)*(5 + \epsilon_x)= \\ 3*(25+10\epsilon_x+\epsilon_x^2)= \\ 3*(25+10\epsilon_x)= \\ 75+30\epsilon_x$

Next, we make our y value:

$2 + 0\epsilon_x + 1\epsilon_y$

or:

$2 + \epsilon_y$

We cube it and multiply it by 2 to get our $2x^3$ term:

$16+24\epsilon_y$

Lastly we subtract the y term from the x term to get our final answer:

$(75+30\epsilon_x) - (16+24\epsilon_y)= \\ 59+30\epsilon_x-24\epsilon_y$

The result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30, and the derivative of y at that point is -24.

Neat, right?!

Example Code

Here is the example code output:

Here is the code that generated it:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>
 
#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

	// constructor to make a constant
	CDualNumber (float f = 0.0f) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
	}

	// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
	CDualNumber (float f, size_t variableIndex) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
		m_dual[variableIndex] = 1.0f;
	}

	// storage for real and dual values
	float							m_real;
	std::array<float, NUMVARIABLES> m_dual;
};
 
//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real + b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
	return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real - b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real * b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real / b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	float sqrtReal = sqrt(a.m_real);
	ret.m_real = sqrtReal;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = pow(a.m_real, y);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = sin(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = cos(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = tan(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = tan(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
	return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
	return x * x * (T(3.0f) - T(2.0f) * x);
}
 
//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------
 
void TestSmoothStep (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// calculate value and derivative using dual numbers
	CDualNumber<1> y = SmoothStep(x);

	// calculate numeric derivative using central differences
	float derivNumeric = (SmoothStep(input + EPSILON) - SmoothStep(input - EPSILON)) / (2.0f * EPSILON);

	// calculate actual derivative
	float derivActual = 6.0f * input - 6.0f * input * input;

	// show value and derivatives
	printf("(smoothstep) y=3x^2-2x^3  (x=%0.4f)n", input);
	printf("  y = %0.4fn", y.m_real);
	printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
	printf("  actual dy/dx = %0.4fn", derivActual);
	printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

void TestTrig (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// sin
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = sin(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (sin(input + EPSILON) - sin(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = cos(input);

		// show value and derivatives
		printf("sin(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// cos
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = cos(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (cos(input + EPSILON) - cos(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = -sin(input);

		// show value and derivatives
		printf("cos(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// tan
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = tan(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (tan(input + EPSILON) - tan(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.0f / (cos(input)*cos(input));

		// show value and derivatives
		printf("tan(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// atan
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = atan(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (atan(input + EPSILON) - atan(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.0f / (1.0f + input * input);

		// show value and derivatives
		printf("atan(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}
}

void TestSimple (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// sqrt
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = CDualNumber<1>(3.0f) / sqrt(x);

		// calculate numeric derivative using central differences
		float derivNumeric = ((3.0f / sqrt(input + EPSILON)) - (3.0f / sqrt(input - EPSILON))) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = -3.0f / (2.0f * pow(input, 3.0f / 2.0f));

		// show value and derivatives
		printf("3/sqrt(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// pow
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = pow(x + CDualNumber<1>(1.0f), 1.337f);

		// calculate numeric derivative using central differences
		float derivNumeric = ((pow(input + 1.0f + EPSILON, 1.337f)) - (pow(input + 1.0f - EPSILON, 1.337f))) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.337f * pow(input + 1.0f, 0.337f);

		// show value and derivatives
		printf("(%0.4f+1)^1.337 = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}
}

void Test2D (float inputx, float inputy)
{
	// create dual numbers as the value of x and y
	CDualNumber<2> x(inputx, 0);
	CDualNumber<2> y(inputy, 1);

	// z = 3x^2 - 2y^3
	{
		// calculate value and partial derivatives using dual numbers
		CDualNumber<2> z = CDualNumber<2>(3.0f) * x * x - CDualNumber<2>(2.0f) * y * y * y;

		// calculate numeric partial derivatives using central differences
		auto f = [] (float x, float y) {
			return 3.0f * x * x - 2.0f * y * y * y;
		};
		float derivNumericX = (f(inputx + EPSILON, inputy) - f(inputx - EPSILON, inputy)) / (2.0f * EPSILON);
		float derivNumericY = (f(inputx, inputy + EPSILON) - f(inputx, inputy - EPSILON)) / (2.0f * EPSILON);

		// calculate actual partial derivatives
		float derivActualX = 6.0f * inputx;
		float derivActualY = -6.0f * inputy * inputy;

		// show value and derivatives
		printf("z=3x^2-2y^3 (x = %0.4f, y = %0.4f)n", inputx, inputy);
		printf("  z = %0.4fn", z.m_real);
		printf("  dual# dz/dx = %0.4fn", z.m_dual[0]);
		printf("  dual# dz/dy = %0.4fn", z.m_dual[1]);
		printf("  actual dz/dx = %0.4fn", derivActualX);
		printf("  actual dz/dy = %0.4fn", derivActualY);
		printf("  numeric dz/dx = %0.4fn", derivNumericX);
		printf("  numeric dz/dy = %0.4fnn", derivNumericY);
	}
}

void Test3D (float inputx, float inputy, float inputz)
{
	// create dual numbers as the value of x and y
	CDualNumber<3> x(inputx, 0);
	CDualNumber<3> y(inputy, 1);
	CDualNumber<3> z(inputz, 2);

	// w = sin(x*cos(2*y)) / tan(z)
	{
		// calculate value and partial derivatives using dual numbers
		CDualNumber<3> w = sin(x * cos(CDualNumber<3>(2.0f)*y)) / tan(z);

		// calculate numeric partial derivatives using central differences
		auto f = [] (float x, float y, float z) {
			return sin(x*cos(2.0f*y)) / tan(z);
		};
		float derivNumericX = (f(inputx + EPSILON, inputy, inputz) - f(inputx - EPSILON, inputy, inputz)) / (2.0f * EPSILON);
		float derivNumericY = (f(inputx, inputy + EPSILON, inputz) - f(inputx, inputy - EPSILON, inputz)) / (2.0f * EPSILON);
		float derivNumericZ = (f(inputx, inputy, inputz + EPSILON) - f(inputx, inputy, inputz - EPSILON)) / (2.0f * EPSILON);

		// calculate actual partial derivatives
		float derivActualX = cos(inputx*cos(2.0f*inputy))*cos(2.0f * inputy) / tan(inputz);
		float derivActualY = cos(inputx*cos(2.0f*inputy)) *-2.0f*inputx*sin(2.0f*inputy) / tan(inputz);
		float derivActualZ = sin(inputx * cos(2.0f * inputy)) / -(sin(inputz) * sin(inputz));

		// show value and derivatives
		printf("w=sin(x*cos(2*y))/tan(z) (x = %0.4f, y = %0.4f, z = %0.4f)n", inputx, inputy, inputz);
		printf("  w = %0.4fn", w.m_real);
		printf("  dual# dw/dx = %0.4fn", w.m_dual[0]);
		printf("  dual# dw/dy = %0.4fn", w.m_dual[1]);
		printf("  dual# dw/dz = %0.4fn", w.m_dual[2]);
		printf("  actual dw/dx = %0.4fn", derivActualX);
		printf("  actual dw/dy = %0.4fn", derivActualY);
		printf("  actual dw/dz = %0.4fn", derivActualZ);
		printf("  numeric dw/dx = %0.4fn", derivNumericX);
		printf("  numeric dw/dy = %0.4fn", derivNumericY);
		printf("  numeric dw/dz = %0.4fnn", derivNumericZ);
	}
}

int main (int argc, char **argv)
{
	TestSmoothStep(0.5f);
	TestSmoothStep(0.75f);
	TestTrig(PI * 0.25f);
	TestSimple(3.0f);
	Test2D(1.5f, 3.28f);
	Test3D(7.12f, 8.93f, 12.01f);
    return 0;
}

Closing

One of the neatest things about dual numbers is that they give precise results. They are not approximations and they are not numerical methods, unlike the central differences method that I compared them to in the example program (More info on numerical derivatives here: Finite Differences). Using dual numbers gives you exact derivatives, within the limitations of (eg) floating point math.

It turns out that backpropagation (the method that is commonly used to train neural networks) is just steepest gradient descent. You can read about that here: Backpropogation is Just Steepest Descent with Automatic Differentiation

That makes me wonder how dual numbers would do in run time speed compared to back propagation as well as numerical methods for getting the gradient to adjust a neural network during training.

If I had to guess, I’d say that dual numbers may be slightly slower than backpropagation, but not as slow as numerical methods which are going to be much, much slower. We’ll see though. It may be much easier to implement neural network learning using dual numbers compared to backpropagation, so may be worth an exploration and write up, even if only to make neural networks a little bit more accessible to people.

Comments, corrections, etc? Let me know in the comments below, or contact me on twitter at @Atrix256

A Geometric Interpretation of Neural Networks

In the 90s before I was a professional programmer / game developer I looked at neural networks and found them interesting but got scared off by things like back propagation, which I wasn’t yet ready to understand.

With all the interesting machine learning things going on in modern times, I decided to have a look again and have been pleasantly surprised at how simple they are to understand. If you have knowledge of partial derivatives and gradients (like, if you’ve done any ray marching), you have the knowledge it takes to understand it.

Here are some really great resources I recomend for learning the nuts and bolts of how modern neural networks actually work:
Learn TensorFlow and deep learning, without a Ph.D.
Neural Networks and Deep Learning
A Neural Network Playground (Web Based NN sand box from google)

This post doesn’t require any understanding of neural networks or partial derivatives, so don’t worry if you don’t have that knowledge. The harder math comes up when training a neural network, but we are only going to be dealing with evaluating neural networks, which is much simpler.

A Geometric Interpretation of a Neuron

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron.

To calculate the output value of a neuron, you multiply every input into that neuron by a weight for that input, and add a bias. This value is fed into an activation function (more on activation functions shortly) and the result is the output value of that neuron. Here is a diagram for a single neuron:

A more formal definition of a neuron’s output is below. $b$ is the bias, $w_j$ is the j’th weight and $i_j$ is the j’th input.
$Output = b+\sum_{j=0}^nw_ji_j$

You might notice that the above is really just the dot product of the weight vector and the input vector, with a value added on the end. We could re-write it like that:
$Output = Dot(w,i)+b$

Interestingly, that is the same equation that you use to find the distance of a point to a plane. Let’s say that we have a plane defined by a unit length normal N and a distance to the origin d, and we want to calculate the distance of a point P to that plane. We’d use this formula:
$Distance = Dot(N,P)+d$

This would give us a signed distance, where the value will be negative if we are in the negative half space defined by the plane, and positive otherwise.

This equation works if you are working in 3 dimensional space, but also works in general for any N dimensional point and plane definition.

What does this mean? Well, this tells us that every neuron in a neural network is essentially deciding what side of a hyperplane a point is on. Each neuron is doing a linear classification, saying if something is on side A or side B, and giving a distance of how far it is into A or B.

This also means that when you combine multiple neurons into a network, that an output neuron of that neural network tells you whether the input point is inside or outside of some shape, and by how much.

I find this interesting for two reason.

Firstly, it means that a neural network can be interpreted as encoding SHAPES, which means it could be used for modeling shapes. I’m interested in seeing what sort of shapes it’s capable of, and any sorts of behaviors this representation might have. I don’t expect it to be useful for, say, main stream game development (bonus if it is useful!) but at minimum it ought to be an interesting investigation to help understand neural networks a bit better.

Secondly, there is another machine learning algorithm called Support Vector Machines which are also based on being able to tell you which side of a separation a data point is on. However, unlike the above, SVM separations are not limited to plane tests and can use arbitrary shapes for separation. Does this mean that we are leaving something on the table for neural networks? Could we do better than we are to make networks with fewer layers and fewer neurons that do better classification by doing non linear separation tests?

Quick side note: besides classification, neural nets can help us with something called regression, which is where you fit a network to some analog data, instead of the more discrete problem of classification, which tells you what group something belongs to.

It turns out that the activation function of a neuron can have a pretty big influence on what sort of shapes are possible, which makes it so we aren’t strictly limited to shapes made up of planes and lines, and also means we aren’t necessarily leaving things on the table compared to SVM’s.

This all sort of gives me the feeling though that modern neural networks may not be the best possible algorithm for the types of things we use them for. I feel like we may need to generalize them beyond biological limitations to allow things like multiplications between weighted inputs instead of only sums. I feel like that sort of setup will be closer to whatever the real ideal “neural computation” model might be. However, the modern main stream neural models do have the benefit that they are evaluated very efficiently via dot products. They are particularly well suited for execution on GPUs which excel at performing homogenous operations on lots and lots of data. So, a more powerful and more general “neuron” may come at the cost of increased computational costs, which may make them less desirable in the end.

As a quick tangent, here is a paper from 2011 which shows a neural network model which does in fact allow for multiplication between neuron inputs. You then will likely be wanting exponents and other things, so while it’s a step in the right direction perhaps, it doesn’t yet seem to be the end all be all!
Sum-Product Networks: A New Deep Architecture

It’s also worth while to note that there are other flavors of neural networks, such as convolutional neural networks, which work quite a bit differently.

Let’s investigate this geometric interpretation of neurons as binary classifiers a bit, focusing on some different activation functions!

Step Activation Function

The Heaviside step function is very simple. If you give it a value greater than zero, it returns a 1, else it returns a 0. That makes our neuron just spit out binary: either a 0 or a 1. The output of a neuron using the step activation function is just the below:

$Output = Dot(w,i)+b > 0$

The output of a neuron using the step activation function is true if the input point is in the positive half space of the plane that this neuron describes, else it returns false.

Let’s think about this in 2d. Let’s make a neural network that takes x and y as input and spits out a value. We can make an image that visualizes the range from (-1,-1) to (1,1). Negative values can be shown in blue, zero in white, and positive values in orange.

To start out, we’ll make a 2d plane (aka a line) that runs vertically and passes through the origin. That means it is a 2d plane with a normal of (1,0) and a distance from the origin of 0. In other words, our network will have a single neuron that has weights of (1,0) and a bias of 0. This is what the visualization looks like:

You can actually play around with the experiments of this post and do your own using an interactive visualization I made for this post. Click here to see this first experiment: Experiment: Vertical Seperation

We can change the normal (weights) to change the angle of the line, and we can also change the bias to move the line to it’s relative left or right. Here is the same network that has it’s weights adjusted to (1,1) with a bias of 0.1.

Experiment: Diagonal Separation

The normal (1,1) isn’t normalized though, which makes it so the distance from origin (aka the bias) isn’t really 0.1 units. The distance from origin is actually divided by the length of the normal to get the REAL distance to origin, so in the above image, where the normal is a bit more than 1.0, the line is actually less than 0.1 units from the origin.

Below is the visualization if we normalize the weights to (0.707,0.707) but leave the bias at 0.1 units. The result is that the line is actually 0.1 units away from the origin.

Experiment: Normalized Diagonal Separation

Recalling the description of our visualization, white is where the network outputs zero, while orange is where the network outputs a positive number (which is 1 in this case).

If we define three lines such that their negative half spaces don’t completely overlap, we can get a triangle where the network outputs a zero, while everywhere else it outputs a positive value. We do this by having three sibling neurons in the first layer which define three separate lines, and then in the output neuron we give them all a weight of 1. This makes it so the area outside the triangle is always a positive value, which step turns into 1, but inside the triangle, the value remains at 0.

Experiment: Negative Space Triangle

We can turn this negative space triangle into a positive space triangle however by making the output neuron have a weight on the inputs of -1, and adding a bias of 0.1. This makes it so that pixels in the positive space of any of the lines will become a negative value. The negative space of those three lines get a small bias to make it be a positive value, resulting in the step function making the values be 0 outside of the triangle, and 1 inside the triangle. This gives us a positive space triangle:

Experiment: Positive Space Triangle

Taking this a bit further, we can make the first layer define 6 lines, which make up two different triangles – a bigger one and a smaller one. We can then have a second layer which has two neurons – one which makes a positive space larger triangle, and one which makes a positive space smaller triangle. Then, in the output neuron we can give the larger triangle neuron a weight of 1, while giving the smaller triangle neuron a weight of -1. The end result is that we have subtracted the smaller triangle from the larger one:

Experiment: Triangle Cutout

Using the step function we have so far been limited to line based shapes. This has been due to the fact that we can only test our inputs against lines. There is a way around this though: Pass non linear input into the network!

Below is a circle with radius 0.5. The neural network has only a single input which is sqrt(x*x+y*y). The output neuron has a bias of -0.5 and a weight of 1. It’s the bias value which controls the radius of the circle.

You could pass other non linear inputs into the network to get a whole host of other shapes. You could pass sin(x) as an input for example, or x squared.

Experiment: Circle

While the step function is inherently very much limited to linear tests, you can still do quite a lot of interesting non linear shapes (and data separations) by passing non linear input into the network.

Unfortunately though, you as a human would have to know the right non linear inputs to provide. The network can’t learn how to make optimal non linear separations when using the step function. That’s quite a limitation, but as I understand it, that’s how it works with support vector machines as well: a human can define non linear separations, but the human must decide the details of that separation.

BTW it seems like there could be a fun puzzle game here. Something like you have a fixed number of neurons that you can arrange in however many layers you want, and your goal is to separate blue data points from orange ones, or something like that. If you think that’d be a fun game, go make it with my blessing! I don’t have time to pursue it, so have at it (:

Identity and Relu Activation Functions

The identity activation function doesn’t do anything. It’s the same as if no activation function is used. I’ve heard that it can be useful in regression, but it can also be useful for our geometric interpretation.

Below is the same circle experiment as before, but using the identity activation function instead of the step activation function:

Experiment: Identity Circle

Remembering that orange is positive, blue is negative, and white is zero, you can see that outside the circle is orange (positive) and inside the circle is blue (negative), while the outline of the circle itself is white. We are looking at a signed distance field of the circle! Every point on this image is a scalar value that says how far inside or outside that point is from the surface of the shape.

Signed distance fields are a popular way of rendering vector graphics on the GPU. They are often approximated by storing the distance field in a texture and sampling that texture at runtime. Storing them in a texture only requires a single color channel for storage, and as you zoom in to the shape, they preserve their shape a lot better than regular images. You can read more about SDF textures on my post: Distance Field Textures.

Considering the machine learning perspective, having a signed distance field is also an interesting proposition. It would allow you to do classification of input, but also let you know how deeply that input point is classified within it’s group. This could be a confidence level maybe, or could be interpreted in some other way, but it gives a sort of analog value to classification, which definitely seems like it could come in handy sometimes.

If we take our negative space triangle example from the last section and change it from using step activation to identity activation, we find that our technique doesn’t generalize naively though, as we see below. (It doesn’t generalize for the positive space triangle either)

Experiment: Negative Space Triangle Identity

The reason it doesn’t generalize is that the negatives and positives of pixel distances to each of the lines cancel out. Consider a pixel on the edge of the triangle: you are going to have a near zero value for the edge it’s on, and two larger magnitude negative values from the other edges it is in the negative half spaces of. Adding those all together is going to be a negative value.

To help sort this out we can use an activation function called “relu”, which returns 0 if the value it’s given is less than zero, otherwise it returns the value. This means that all our negative values become 0 and don’t affect the distance summation. If we switch all the neurons to using relu activation, we get this:

Experiment: Negative Space Triangle Relu

If you squint a bit, you can see a triangle in the white. If you open the experiment and turn on “discrete output” to force 0 to blue you get a nice image that shows you that the triangle is in fact still there.

Our result with relu is better than identity, but there are two problems with our resulting distance field.

Firstly it isn’t a signed distance field – there is no blue as you might notice. It only gives positive distances, for pixels that are outside the shape. This isn’t actually that big of an issue from a rendering perspective, as unsigned distance fields are still useful. It also doesn’t seem that big of an issue from a machine learning perspective, as it still gives some information about how deeply something is classified, even though it is only from one direction.

I think with some clever operations, you could probably create the internal negative signed distance using different operations, and then could compose it with the external positive distance in the output neuron by adding them together.

The second problem is a bigger deal though: The distance field is no longer accurate!

By summing the distance values, the distance is incorrect for points where the closest feature of the triangle is a vertex, because multiple lines are contributing their distance to the final value.

I can’t think of any good ways to correct that problem in the context of a neural network, but the distance is an approximation, and is correct for the edges, and also gets more correct the closer you get to the object, so this is still useful for rendering, and probably still useful for machine learning despite it being an imperfect measurement.

Sigmoid and Hyperbolic Tangent Activation Function

The sigmoid function is basically a softer version of the step function and gives output between 0 and 1. The hyperbolic tangent activation function is also a softer version of the step function but gives output between -1 and 1.

Sigmoid:

Hyperbolic Tangent:

(images from Wolfram Mathworld)

They have different uses in machine learning, but I’ve found them to be visibly indistinguishable in my experiments after compensating for the different range of output values. It makes me think that smoothstep could probably be a decent activation function too, so long as your input was in the 0 to 1 range (maybe you could clamp input to 0 and 1?).

These activation functions let you get some non linearity in your neural network in a way that the learning algorithms can adjust, which is pretty neat. That puts us more into the realm where a neural net can find a good non linear separation for learned data. For the geometric perspective, this also lets us make more interesting non linear shapes.

Unfortunately, I haven’t been able to get a good understanding of how to use these functions to craft desired results.

It seems like if you add even numbers of hyperbolic tangents together in a neural network that you end up getting a lot of white, like below:

Experiment: tanh1

However, if you add an odd number of them together, it starts to look a bit more interesting, like this:

Experiment: tanh2

Other than that, it’s been difficult seeing a pattern that I can use to craft things. The two examples above were made by modifying the negative space triangle to use tanh instead of step.

Closing

We’ve wandered a bit in the idea of interpreting neural networks geometrically but I feel like we’ve only barely scratched the surface. This also hasn’t been a very rigorous exploration, but more has just been about exploring the problem space to get a feeling for what might be possible.

It would be interesting to look more deeply into some of these areas, particularly for the case of distance field generation of shapes, or combining activation functions to get more diverse results.

While stumbling around, it seems like we may have gained some intuition about how neural networks work as well.

It feels like whenever you add a layer, you are adding the ability for a “logical operation” to happen within the network.

For instance, in the triangle cutout experiment, the first layer after the inputs defines the 6 individual lines of the two triangles and classifies input accordingly. The next layer combines those values into the two different triangle shapes. The layer after that converts them from negative space triangles to positive space triangles. Lastly, the output layer subtracts the smaller triangle’s values from the larger triangle’s values to make the final triangle outline shape.

Each layer has a logical operation it performs, which is based on the steps previous to it.

Another piece of intuition I’ve found is that it seems like adding more neurons to a layer allows it to do more work in parallel.

For instance, in the triangle cutout experiment, we created those two triangles in parallel, reserving some neurons in each layer for each triangle. It was only in the final step that we combined the values into a single output.

Since neurons can only access data from the previous network layer, it seems as though adding more neurons to layers can help push data forward from previous layers, to later layers that might need the data. But, it seems like it is most efficient to process input data as early as possible, so that you don’t have to shuttle it forward and waste layers / neurons / memory and computing power.

Here is some info on other activation functions:
Wikipedia:Activation Function

Here’s a link that talks about how perceptrons (step activated neural networks) relate to SVMs:
Hyperplane based Classification: Perceptron and (Intro to) Support Vector Machines

By the way, did I mention you can visualize neural networks in three dimensions as well?

Experiment: 3d Visualization

Here are the two visualizers of neural networks I made for this post using WebGL2:
Neural Network Visualization 2D
Neural Network Visualization 3D

If you play around with this stuff and find anything interesting, please share!

B.A.M. Neural Networks

Neural networks (officially “Artificial Neural Networks”) are computer simulations of neurons. Simulating neurons in software allows programs to do things that you would normally need a human brain to do, such as recognizing patterns, learning over time, or making non-obvious decisions based on complex data.

Simulating neurons is not enough to create human levels of intelligence however. Last I heard, someone could make toddler level intelligence via neural networks, but even that is somewhat misleading since toddlers can walk, use vocal chords, swim and understand complex emotions, while the neural network could do none of those things.

Despite the limitations of neural networks, there are quite a number of practical uses of artificial neural networks in the real world. These uses include…

Helping missiles identify enemy tanks or combatants on the battlefield
Helping to predict stock market trends (It’s rumored that several top traders have proprietary neural networks which help them preform better)
OCR (turning scanned images into text documents based on the text in the image)
General machine vision (like, for robots or security systems)
Controlling complex machinery at speeds a human wouldn’t be able to keep up with
Facial recognition in computers
Diagnosing medical conditions

I was reading an article in Scientific American recently about how a girl in high school trained a neural network to recognize certain types of cancer with 99% accuracy. She trained a neural network to analyze the results of a non-invasive cancer test which up til then had been too unreliable to use in any realistic situation. Her neural network learned some hidden pattern in the data that we have not yet discovered or understood.

Just like in that example, you can feed a network complex data for it to look for patterns in, but unfortunately it won’t be able to explain to you what it learned, or what it looks for when trying to recognize patterns. It can learn, but it can’t tell you what it learned.

My friend Doug often tells a funny story where this didn’t work out so well. I’m not sure if it’s true exactly as told or not, but it definitely is plausible. Apparently the US army for whatever reason was training a neural network to recognize tanks in the battlefield (surely for a missile or perhaps some kind of recon drone).

They fed the network hundreds or thousands of photos which either contained a tank or did not. While they fed each picture to the neural network, they also told it “yes this photo contains a tank” or “no, this photo does not contain a tank” so that it was able to infer what it was that people were trying to teach it.

The network learned well, and with their sample photos, it was getting a very high rate of correct answers about whether a tank was there or not.

However, when they deployed this to the battle field (or perhaps, just a realistic live test run), it failed miserably and was getting near 0% accuracy.

Since the network wasn’t able to explain it’s thought process, the people involved were forced to try and deduce what the problem was and eventually they noticed a pattern….

The photoshoots apparently happened on 2 seperate days… on the first day they took pictures of tanks, and the second day they took pictures without tanks. The unfortunate truth is that it was overcast on the first day, but clear skies and sunny on the second, which means…. the neural network learned to distinguish between a sunny day and an overcast day, not whether nor not there were tanks present!

A pretty funny story, but it shows the importance of giving good, realistic data to your network to learn from, or else you can run into silly problems like that!

Types of Neural Networks

There are a lot of different types of neural networks that have different properties and so thus are useful in different situations.

Some of the main types of flavors are…

Supervised learning – Just like in the tank example above, you give information to the neural network to learn, and you also tell it the information it should learn from each peice of data (ie… here’s a picture, it DOES contain a tank).
Unsupervised learning – These work by finding natural groupings of input data. You can then look at the groupings (or ask it to group more data) and can gain information about the nature of the data itself. This is often used for data mining, having the neural network pick out interesting correlations in the data that a human might not figure out.
Some networks are static once they are created and are unable to learn further.
Other networks are able to continuously learn as they get more and more data.

Local Minima vs Global Minimum

Just like a human brain, a neural network is not infallible. It can often think that is has found an answer, or something of interest, when in fact it hasn’t.

Similarly, we as humans can sometimes think we’ve found a solution to a problem, and then someone comes along and says “You forgot to consider this part of it!” and suddenly you realize your solution is not the right answer and it’s back to the drawing board.

A neural network can have the same issue believe it or not. This can come from the fact that it wasn’t provided with good enough data to learn from, or, just because! Just like humans, it can either just learn something wrong, or be incorrect about an answer.

If you think of a problem space as a graph, you can think of the lowest point on the graph being the optimal solution. How neural networks work is by starting at some point on the graph (often chosen randomly) and then traveling downhill til it finds the bottom of a dip.

This works great if you happen to find the lowest dip in the graph (also called the global minimum), but if you find a dip that isn’t the lowest, you have effectively become trapped in a local minima, and end up with the wrong answer, or an imperfect understanding of the problem space.

minima

There are ways of dealing with this problem luckily. One way is that if you find a minima, remember where it was at, but then choose another random point on the graph and try again to see if you find a deeper minima. Rinse and repeat until you are reasonably satisfied with the results.

Have you ever been stuck trying to figure something out and forgotten about it (or went to sleep), only to come back to it later and find the answer. I personally attribute that phenomenon to this “randomization” effect. I don’t know the results of studies to this effect, but if you stop thinking about something for a while, then come back to it, you often see it from a different angle (essentially starting at a different spot on the problem space graph) and can sometimes figure out a better solution, or a deeper understanding of the problem (pun intended!)

Anyways, for normal problem spaces, they probably aren’t going to be just 2d like the above, but are perhaps 3d, 4d, or even higher dimensions. In the end though, the neural network still is just trying to find the deepest valley it can find by essentially traveling downhill.

Now that you know the basics of neural networks, let’s get onto the implementation of one type of neural network!

Bidirectional Associative Memory (AKA BAM)

Bidrectional associative memory is perhaps the easiest useful neural network to create. All you need is the ability to multiply vectors by other vectors, multiply vectors by matrices, and add matrices together. If you know how to do those 3 things, you will be able to program your own neural network very quickly and easily.

In fact, I learned about this guy when I was 14 or so, and was able to implement a simple OCR system using microsoft excel (seriously!) 😛

The main point of BAM is to act as memory, where you can teach it to associate several patterns together. This way, if you teach it to associate a pattern A with a pattern B, when you give it A again, it will spit out B. Since it’s bidirectional, you can also give it pattern B and it will spit out pattern A in response. You can teach it several pattern pairs to associate, and can even corrupt some of the data, and it will give you what it thinks is the best match for the data you gave it. Just like a human brain, it sort of uses it’s best judgement and can say “umm… i think you meant pattern D but I’m not quite sure”.

Besides associating different patterns, you can also associate patterns with themselves (such as associating pattern A with pattern A and pattern B with pattern B). If you do this, you are able to put in possibly corrupted data and it will give you what it thinks the data really is. This way, if you have data that is noisy because it came over the radio, or because you scanned a document with a low quality scanner, it will be able to see through the noise and pick out the correct data (hopefully) in the same way a human could. There are limits to this of course though, just like sometimes we can’t make out messy handwriting sometimes.

While BAM is useful, even in some realistic uses of neural networks, it also is a little bit limited compared to more sophisticated neural network implementations.

BAM is fairly limited in how many patterns you can teach it
You have to teach it in advance via supervised learning. No further learning happens after it’s created.
It’s fairly strict in it’s mapping from input to output. This means if you use it to recognize written or typed letters, it will be thrown off by variations in handwriting or different fonts.

That being said, it’s still pretty cool, and lots of fun to play with.

Creating a BAM Network

Creating a BAM network is pretty straight forward. It has M input bits (you decide how many that is) and N output bits (again, you decide how many that is).

Once you have all of your input / output data pairs, the first step is to convert all the zeros in your pattern pairs to -1’s. Where 0’s and 1’s is called binary, this form of -1’s and 1’s is called “unipolar”.

Then, for each pattern you multiply the input pattern vector, by the transpose of the output pattern vector (turn it on it’s side) so that when you multiply them together, you get a matrix that is MxN in size.

Continue this for each data pair so that you end up with one matrix per data pair.

Then, add all the matrices together so that you end up with a final matrix. This is your trained neural network!

In the BAM neural network, the neural topology is that there are M input neurons and N output neurons, with no neurons in between. The more neurons you have in your network, the more data the neural network is able to store, and the more distinctions between different types of data it’s able to make. The fact that there are only 2 layers of neurons in BAM is part of the reason for it’s limitations.

For more information about why 2 layers of neurons are limited in their learning, I recommend searching for information on the perceptron xor problem and linear separability. A 2 layer network is inherently incapable of preforming (and perhaps “understanding”) xor!

Using a BAM Network

To use a neural network, you take an input vector (in binary) of size M and multiply it by the matrix. This will result in a vector of size N that it made up numbers which may be positive, negative, or zero. Convert all positive numbers to 1 and all negative numbers to 0, and you’ll end up with the N sized output pattern that the neural network associated with.

Dealing with zeros is sort of up to your own discretion unfortunately. I’ve seen some people say that zeros should be treated as 1’s, other people say that zeros should be treated as 0’s, and other people have other rules such as “if it’s zero, set it to whatever that bit output the last time you had it output something” which IMO is a pretty odd way to deal with it. I think this is an unfortunate flaw in how the BAM network works, but you can also chalk this up to the network being uncertain of the result, which it basically is.

Since BAM networks are bidirectional, you can also take the transpose of your matrix (turn it on it’s side) and then multiply it by a vector sized N to get a vector of size M as output, which is the vector associated with your N sized input vector. So, it works both ways; you can put in an input pattern and get an output pattern out, or you can put in an output pattern and get an input pattern back.

Don’t forget… you can also associate patterns with themselves if you want it to do “pattern recognition” instead of “pattern association”.

Example

Let’s say we want to have an input size of 6 bits and an output size of 4 bits and that we have these 2 data pairs that we want to associate in the neural network:

101011 <-> 0010
110010 <-> 0101

The first step is to convert all zeros to -1’s. Doing so our data pairs become this:

1 -1 1 -1 1 1 <-> -1 -1 1 -1
1 1 -1 -1 1 -1 <-> -1 1 -1 1

The next step is to multiply the input patterns with the output patterns to make a 6×3 matrix for each pattern pair.

First Pair:
1 * [-1 -1 1 -1]
-1
1
-1
1
1

-1 -1 1 -1
1 1 -1 1
-1 -1 1 -1
1 1 -1 1
-1 -1 1 -1
-1 -1 1 -1

Second Pair:
1 * [-1 1 -1 1]
1
-1
-1
1
-1

-1 1 -1 1
-1 1 -1 1
1 -1 1 -1
1 -1 1 -1
-1 1 -1 1
1 -1 1 -1

Next up, you add all the matrices together to get the final trained neural network:

-1 -1 1 -1
1 1 -1 1
-1 -1 1 -1
1 1 -1 1
-1 -1 1 -1
-1 -1 1 -1

-1 1 -1 1
-1 1 -1 1
1 -1 1 -1
1 -1 1 -1
-1 1 -1 1
1 -1 1 -1

-2 0 0 0
0 2 -2 2
0 -2 2 -2
2 0 0 0
-2 0 0 0
0 -2 2 -2

Now that we have our trained neural network, let’s plug in our first input pattern to make sure we get our first output pattern

-2 0 0 0
0 2 -2 2
0 -2 2 -2
2 0 0 0
-2 0 0 0
0 -2 2 -2

1
0
1
0
1
1

-4 -4 4 -4

When we convert negatives to zeros, and positives to ones we get:

0 0 1 0

Which is the first output pattern. It recalled our pattern correctly!

Next, let’s put the second output pattern into the transposed matrix to see if we can go the opposite direction and recall the second input pattern.

-2 0 0 2 -2 0
0 2 -2 0 0 -2
0 -2 2 0 0 2
0 2 -2 0 0 2

0
1
0
1

0 4 -4 0 0 0

Converting that to binary by turning negatives into zeros, and non negatives into ones, and zeros into question marks we get:

? 1 0 ? ? ?

The pattern it was supposed to recall is:

1 1 0 0 1 0

The two bits that it did recall are correct, but as you can see it only recalled 2 of the 6 bits. Not very good!

With just two patterns, the network was unable to recall some of the info it was trained with.

Normally BAM isn’t this bad, it looks like I just chose some unfortunate input / output pairs. If you encounter problems with a network recalling data, sometimes adding more neurons (larger input or output patterns) can help, but sometimes that will be ineffective too. Like i mentioned earlier, a neural network that has only 2 neuron layers – like BAM does – is incapable of learning XOR, no matter how many input or output neurons you have, so these types of networks are somewhat limited.

Using for OCR

If you wanted to use a BAM network for being able to recognize drawn or written letters, one way to do so would be to say “we are going to store our letters in an 8×8 black and white grid”.

That means that you have a grid of binary (black / white) that is 8×8. Another way to represent the grid of binary would be just to have 64 bits in a row.

So, for the letters you want to train the network to recognize, you would just draw out your letters in an 8×8 grid, take each letter as it’s 64 bits, and associate each letter with itself.

Your network matrix will be 64×64 but will be able to do simple OCR on 8×8 black and white images.

Often times, images will come to you in color, or not in 8×8 resolution, but what neural network engineers often do in this situation is they will process the images in advance to make them black and white, and 8×8, before feeding them into the neural network.

Now, you are able to feed characters into your neural network and it will attempt to correct any corruption in the image, and return to you an 8×8 image of what it think you entered.

Instead of associating a letter’s image with itself, you can also associate it with a number (say, the ascii code?) so that when you put in the image of a character, it will spit out the number corresponding to the closest match it can find instead of the raw character image itself.

That’s All!

BAM is a nice introductory neural network that nearly anyone can implement. It may be limited in some ways, but it actually is used in the real world for some applications.

In the future I’ll write about some more advanced neural networks, but until then, I hope you found this informative, or at least interesting! (: