Neural Network Recipe: Recognize Handwritten Digits With 95% Accuracy

This post is a recipe for making a neural network which is able to recognize hand written numeric digits (0-9) with 95% accuracy.

The intent is that you can use this recipe (and included simple C++ code, and interactive web demo!) as a starting point for some hands on experimentation.

A recent post of mine talks about all the things used in this recipe so give it a read if you want more info about anything: How to Train Neural Networks With Backpropagation.

This recipe is also taken straight from this amazing website (but coded from scratch in C++ by myself), where it’s implemented in python: Using neural nets to recognize handwritten digits.

Recipe

The neural network takes as input 28×28 greyscale images, so there will be 784 input neurons.

There is one hidden layer that has 30 neurons.

The final layer is the output layer which has 10 neurons.

The output neuron with the highest activation is the digit that was recognized. For instance if output neuron 0 had the highest activation, the network detected a 0. If output neuron 2 was highest, the network detected a 2.

The neurons use the sigmoid activation function, and the cost function used is half mean squared error.

Training uses a learning rate of 3.0 and the training data is processed by the network 30 times (aka 30 training epochs), using a minibatch size of 10.

A minibatch size of 10 just means that we calculate the gradient for 10 training samples at a time and adjust the weights and biases using that gradient. We do that for the entire (shuffled) 60,000 training items and call that a single epoch. 30 epochs mean we do this full process 30 times.

There are 60,000 items in the training data, mapping 28×28 greyscale images to what digit 0-9 they actually represent.

Besides the 60,000 training data items, there are also 10,000 separate items that are the test data. These test data items are items never seen by the network during training and are just used as a way to see how well the network has learned about the problem in general, versus learning about the specific training data items.

The test and training data is the MNIST data set. I have a link to zip file I made with the data in it below, but this is where I got the data from: The MNIST database of handwritten digits.

That is the entire recipe!

Results

The 30 training epochs took 1 minute 22 seconds on my machine in release x64 (with REPORT_ERROR_WHILE_TRAINING() set to 0 to speed things up), but the code could be made to run faster by using SIMD, putting it on the GPU, getting multithreading involved or other things.

Below is a graph of the accuracy during the training epochs.

Notice that most learning happened very early on and then only slowly improved from there. This is due to our neuron activation functions and also our cost function. There are better choices for both, but this is also an ongoing area of research to improve in neural networks.

The end result of my training run is 95.32% accuracy but you may get slightly higher or lower due to random initialization of weights and biases. That sounds pretty high, but if you were actually using this, 4 or 5 numbers wrong out of 100 is a pretty big deal! The record for MNIST is 99.77% accuracy using “a committee of convolutional networks” where they distorted the input data during training to make it learn in a more generalized way (described as “committee of 35 conv. net, 1-20-P-40-P-150-10 [elastic distortions]”).

A better cost function would probably be the cross entropy cost function, a better activation function than sigmoid would probably be an ELU (Exponential Linear Unit). A soft max layer could be used instead of just taking the maximum output neuron as the answer. The weights could be initialized to smarter values. We could also use a convolutional layer to help let the network learn features in a way that didn’t also tie the features to specific locations in the images.

Many of these things are described in good detail at http://neuralnetworksanddeeplearning.com/, particularly in chapter 3 where they make a python implementation of a convolutional neural network which performs better than this one. I highly recommend checking that website out!

HTML5 Demo

You can play with a network created with this recipe here: Recognize Handwritten Digit 95% Accuracy

Here is an example of it correctly detecting that I drew a 4.

The demo works “pretty well” but it does have a little less than 95% accuracy.

The reason for this though is that it isn’t comparing apples to apples.

A handwritten digit isn’t quite the same as a digit drawn with a mouse. Check out the image below to see 100 of the training images and see what i mean.

The demo finds the bounding box of the drawn image and rescales that bounding box to a 20×20 image, preserving the aspect ratio. It then puts that into a 28×28 image, using the center of mass of the pixels to center the smaller image in the larger one. This is how the MNIST data was generated, so makes the demo more accurate, but it also has the nice side effect of making it so you can draw a number of any size, in any part of the box, and it will treat it the same as if you drew it at a difference size, or in a different part of the box.

The code that goes with this post outputs the weights, biases and network structure in a json format that is very easy to drop into the html5 demo. This way, if you want to try different things in the network, it should be fairly low effort to adjust the demo to try your adjustments there as well.

Lastly, it might be interesting to get the derivatives of the inputs and play around with the input you gave it. Some experiments I can think of:

When it misclassifies what number you drew, have it adjust what you drew (the input) to be more like what the network would expect to see for that digit. This could help show why it misclassified your number.
Start with a well classified number and make it morph into something recognized by the network as a different number.
Start with a random static (noise) image and adjust it until the network recognizes it as a digit. It would be interesting to see if it looked anything like a number, or if it was still just static.

Source Code

The source code and mnist data is on github at MNIST1, but is also included below for your convenience.

If grabbing the source code from below instead of github, you will need to extract this zip file into the working directory of the program as well. It contains the test data used for training the network.
mnist.zip

Note: it’s been reported on that some compilers/OSs, std::random_device will use dev/urandom which can result in lower quality random numbers, which can result in the network not learning as quickly or to as high of accuracy. You can force it to use dev/random instead which will fix that problem. Details can be found here: http://en.cppreference.com/w/cpp/numeric/random/random_device/random_device

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <random>
#include <array>
#include <vector>
#include <algorithm>
#include <chrono>

typedef uint32_t uint32;
typedef uint16_t uint16;
typedef uint8_t uint8;

// Set to 1 to have it show error after each training and also writes it to an Error.csv file.
// Slows down the process a bit (+~50% time on my machine)
#define REPORT_ERROR_WHILE_TRAINING() 1 

const size_t c_numInputNeurons = 784;
const size_t c_numHiddenNeurons = 30;  // NOTE: setting this to 100 hidden neurons can give better results, but also can be worse other times.
const size_t c_numOutputNeurons = 10;

const size_t c_trainingEpochs = 30;
const size_t c_miniBatchSize = 10;
const float c_learningRate = 3.0f;

// ============================================================================================
//                                     SBlockTimer
// ============================================================================================
// times a block of code
struct SBlockTimer
{
	SBlockTimer (const char* label)
	{
		m_start = std::chrono::high_resolution_clock::now();
		m_label = label;
	}

	~SBlockTimer ()
	{
		std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
		printf("%s%0.2f seconds\n", m_label, seconds.count());
	}

	std::chrono::high_resolution_clock::time_point m_start;
	const char* m_label;
};

// ============================================================================================
//                                    MNIST DATA LOADER
// ============================================================================================

inline uint32 EndianSwap (uint32 a)
{
	return (a<<24) | ((a<<8) & 0x00ff0000) |
           ((a>>8) & 0x0000ff00) | (a>>24);
}

// MNIST data and file format description is from http://yann.lecun.com/exdb/mnist/
class CMNISTData
{
public:
	CMNISTData ()
	{
		m_labelData = nullptr;
		m_imageData = nullptr;

		m_imageCount = 0;
		m_labels = nullptr;
		m_pixels = nullptr;
	}

	bool Load (bool training)
	{
		// set the expected image count
		m_imageCount = training ? 60000 : 10000;

		// read labels
		const char* labelsFileName = training ? "train-labels.idx1-ubyte" : "t10k-labels.idx1-ubyte";
		FILE* file = fopen(labelsFileName,"rb");
		if (!file)
		{
			printf("could not open %s for reading.\n", labelsFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		long fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_labelData = new uint8[fileSize];
		fread(m_labelData, fileSize, 1, file);
		fclose(file);

		// read images
		const char* imagesFileName = training ? "train-images.idx3-ubyte" : "t10k-images.idx3-ubyte";
		file = fopen(imagesFileName, "rb");
		if (!file)
		{
			printf("could not open %s for reading.\n", imagesFileName);
			return false;
		}
		fseek(file, 0, SEEK_END);
		fileSize = ftell(file);
		fseek(file, 0, SEEK_SET);
		m_imageData = new uint8[fileSize];
		fread(m_imageData, fileSize, 1, file);
		fclose(file);

		// endian swap label file if needed, just first two uint32's.  The rest is uint8's.
		uint32* data = (uint32*)m_labelData;
		if (data[0] == 0x01080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
		}

		// verify that the label file has the right header
		if (data[0] != 2049 || data[1] != m_imageCount)
		{
			printf("Label data had unexpected header values.\n");
			return false;
		}
		m_labels = (uint8*)&(data[2]);

		// endian swap the image file if needed, just first 4 uint32's. The rest is uint8's.
		data = (uint32*)m_imageData;
		if (data[0] == 0x03080000)
		{
			data[0] = EndianSwap(data[0]);
			data[1] = EndianSwap(data[1]);
			data[2] = EndianSwap(data[2]);
			data[3] = EndianSwap(data[3]);
		}

		// verify that the image file has the right header
		if (data[0] != 2051 || data[1] != m_imageCount || data[2] != 28 || data[3] != 28)
		{
			printf("Label data had unexpected header values.\n");
			return false;
		}
		m_pixels = (uint8*)&(data[4]);

		// convert the pixels from uint8 to float
		m_pixelsFloat.resize(28 * 28 * m_imageCount);
		for (size_t i = 0; i < 28 * 28 * m_imageCount; ++i)
			m_pixelsFloat[i] = float(m_pixels[i]) / 255.0f;

		// success!
		return true;
	}

	~CMNISTData ()
	{
		delete[] m_labelData;
		delete[] m_imageData;
	}

	size_t NumImages () const { return m_imageCount; }

	const float* GetImage (size_t index, uint8& label) const
	{
		label = m_labels[index];
		return &m_pixelsFloat[index * 28 * 28];
	}

private:
	void* m_labelData;
	void* m_imageData;

	size_t m_imageCount;
	uint8* m_labels;
	uint8* m_pixels;

	std::vector<float> m_pixelsFloat;
};

// ============================================================================================
//                                    NEURAL NETWORK
// ============================================================================================

template <size_t INPUTS, size_t HIDDEN_NEURONS, size_t OUTPUT_NEURONS>
class CNeuralNetwork
{
public:
	CNeuralNetwork ()
	{
		// initialize weights and biases to a gaussian distribution random number with mean 0, stddev 1.0
		std::random_device rd;
		std::mt19937 e2(rd());
		std::normal_distribution<float> dist(0, 1);

		for (float& f : m_hiddenLayerBiases)
			f = dist(e2);

		for (float& f : m_outputLayerBiases)
			f = dist(e2);

		for (float& f : m_hiddenLayerWeights)
			f = dist(e2);

		for (float& f : m_outputLayerWeights)
			f = dist(e2);
	}

	void Train (const CMNISTData& trainingData, size_t miniBatchSize, float learningRate)
	{
		// shuffle the order of the training data for our mini batches
		if (m_trainingOrder.size() != trainingData.NumImages())
		{
			m_trainingOrder.resize(trainingData.NumImages());
			size_t index = 0;
			for (size_t& v : m_trainingOrder)
			{
				v = index;
				++index;
			}
		}
		static std::random_device rd;
		static std::mt19937 e2(rd());
		std::shuffle(m_trainingOrder.begin(), m_trainingOrder.end(), e2);

		// process all minibatches until we are out of training examples
		size_t trainingIndex = 0;
		while (trainingIndex < trainingData.NumImages())
		{
			// Clear out minibatch derivatives.  We sum them up and then divide at the end of the minimatch
			std::fill(m_miniBatchHiddenLayerBiasesDeltaCost.begin(), m_miniBatchHiddenLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerBiasesDeltaCost.begin(), m_miniBatchOutputLayerBiasesDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchHiddenLayerWeightsDeltaCost.begin(), m_miniBatchHiddenLayerWeightsDeltaCost.end(), 0.0f);
			std::fill(m_miniBatchOutputLayerWeightsDeltaCost.begin(), m_miniBatchOutputLayerWeightsDeltaCost.end(), 0.0f);

			// process the minibatch
			size_t miniBatchIndex = 0;
			while (miniBatchIndex < miniBatchSize && trainingIndex < trainingData.NumImages())
			{
				// get the training item
				uint8 imageLabel = 0;
				const float* pixels = trainingData.GetImage(m_trainingOrder[trainingIndex], imageLabel);

				// run the forward pass of the network
				uint8 labelDetected = ForwardPass(pixels, imageLabel);

				// run the backward pass to get derivatives of the cost function
				BackwardPass(pixels, imageLabel);

				// add the current derivatives into the minibatch derivative arrays so we can average them at the end of the minibatch via division.
				for (size_t i = 0; i < m_hiddenLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerBiasesDeltaCost[i] += m_hiddenLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerBiasesDeltaCost.size(); ++i)
					m_miniBatchOutputLayerBiasesDeltaCost[i] += m_outputLayerBiasesDeltaCost[i];
				for (size_t i = 0; i < m_hiddenLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchHiddenLayerWeightsDeltaCost[i] += m_hiddenLayerWeightsDeltaCost[i];
				for (size_t i = 0; i < m_outputLayerWeightsDeltaCost.size(); ++i)
					m_miniBatchOutputLayerWeightsDeltaCost[i] += m_outputLayerWeightsDeltaCost[i];

				// note that we've added another item to the minibatch, and that we've consumed another training example
				++trainingIndex;
				++miniBatchIndex;
			}

			// divide minibatch derivatives by how many items were in the minibatch, to get the average of the derivatives.
			// NOTE: instead of doing this explicitly like in the commented code below, we'll do it implicitly
			// by dividing the learning rate by miniBatchIndex.
			/*
			for (float& f : m_miniBatchHiddenLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerBiasesDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchHiddenLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			for (float& f : m_miniBatchOutputLayerWeightsDeltaCost)
				f /= float(miniBatchIndex);
			*/

			float miniBatchLearningRate = learningRate / float(miniBatchIndex);

			// apply training to biases and weights
			for (size_t i = 0; i < m_hiddenLayerBiases.size(); ++i)
				m_hiddenLayerBiases[i] -= m_miniBatchHiddenLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerBiases.size(); ++i)
				m_outputLayerBiases[i] -= m_miniBatchOutputLayerBiasesDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_hiddenLayerWeights.size(); ++i)
				m_hiddenLayerWeights[i] -= m_miniBatchHiddenLayerWeightsDeltaCost[i] * miniBatchLearningRate;
			for (size_t i = 0; i < m_outputLayerWeights.size(); ++i)
				m_outputLayerWeights[i] -= m_miniBatchOutputLayerWeightsDeltaCost[i] * miniBatchLearningRate;
		}
	}

	// This function evaluates the network for the given input pixels and returns the label it thinks it is from 0-9
	uint8 ForwardPass (const float* pixels, uint8 correctLabel)
	{
		// first do hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			float Z = m_hiddenLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				Z += pixels[inputIndex] * m_hiddenLayerWeights[HiddenLayerWeightIndex(inputIndex, neuronIndex)];

			m_hiddenLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// then do output layer
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			float Z = m_outputLayerBiases[neuronIndex];

			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				Z += m_hiddenLayerOutputs[inputIndex] * m_outputLayerWeights[OutputLayerWeightIndex(inputIndex, neuronIndex)];

			m_outputLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
		}

		// calculate error.
		// this is the magnitude of the vector that is Desired - Actual.
		// Commenting out because it's not needed.
		/*
		{
			error = 0.0f;
			for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
			{
				float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;
				float diff = (desiredOutput - m_outputLayerOutputs[neuronIndex]);
				error += diff * diff;
			}
			error = std::sqrt(error);
		}
		*/

		// find the maximum value of the output layer and return that index as the label
		float maxOutput = m_outputLayerOutputs[0];
		uint8 maxLabel = 0;
		for (uint8 neuronIndex = 1; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			if (m_outputLayerOutputs[neuronIndex] > maxOutput)
			{
				maxOutput = m_outputLayerOutputs[neuronIndex];
				maxLabel = neuronIndex;
			}
		}
		return maxLabel;
	}

	// Functions to get weights/bias values. Used to make the JSON file.
	const std::array<float, HIDDEN_NEURONS>& GetHiddenLayerBiases () const { return m_hiddenLayerBiases; }
	const std::array<float, OUTPUT_NEURONS>& GetOutputLayerBiases () const { return m_outputLayerBiases; }
	const std::array<float, INPUTS * HIDDEN_NEURONS>& GetHiddenLayerWeights () const { return m_hiddenLayerWeights; }
	const std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>& GetOutputLayerWeights () const { return m_outputLayerWeights; }

private:

	static size_t HiddenLayerWeightIndex (size_t inputIndex, size_t hiddenLayerNeuronIndex)
	{
		return hiddenLayerNeuronIndex * INPUTS + inputIndex;
	}

	static size_t OutputLayerWeightIndex (size_t hiddenLayerNeuronIndex, size_t outputLayerNeuronIndex)
	{
		return outputLayerNeuronIndex * HIDDEN_NEURONS + hiddenLayerNeuronIndex;
	}

	// this function uses the neuron output values from the forward pass to backpropagate the error
	// of the network to calculate the gradient needed for training.  It figures out what the error
	// is by comparing the label it came up with to the label it should have come up with (correctLabel).
	void BackwardPass (const float* pixels, uint8 correctLabel)
	{
		// since we are going backwards, do the output layer first
		for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each output neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			//
			// deltaCost/deltaO = O - desiredOutput
			// deltaO/deltaZ = O * (1 - O)
			//
			float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;

			float deltaCost_deltaO = m_outputLayerOutputs[neuronIndex] - desiredOutput;
			float deltaO_deltaZ = m_outputLayerOutputs[neuronIndex] * (1.0f - m_outputLayerOutputs[neuronIndex]);

			m_outputLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
				m_outputLayerWeightsDeltaCost[OutputLayerWeightIndex(inputIndex, neuronIndex)] = m_outputLayerBiasesDeltaCost[neuronIndex] * m_hiddenLayerOutputs[inputIndex];
		}

		// then do the hidden layer
		for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
		{
			// calculate deltaCost/deltaBias for each hidden neuron.
			// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
			//
			// deltaCost/deltaO =
			//   Sum for each output of this neuron:
			//     deltaCost/deltaDestinationZ * deltaDestinationZ/deltaSourceO
			//
			// deltaCost/deltaDestinationZ is already calculated and lives in m_outputLayerBiasesDeltaCost[destinationNeuronIndex].
			// deltaTargetZ/deltaSourceO is the value of the weight connecting the source and target neuron.
			//
			// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
			// deltaO/deltaZ = O * (1 - O)
			//
			float deltaCost_deltaO = 0.0f;
			for (size_t destinationNeuronIndex = 0; destinationNeuronIndex < OUTPUT_NEURONS; ++destinationNeuronIndex)
				deltaCost_deltaO += m_outputLayerBiasesDeltaCost[destinationNeuronIndex] * m_outputLayerWeights[OutputLayerWeightIndex(neuronIndex, destinationNeuronIndex)];
			float deltaO_deltaZ = m_hiddenLayerOutputs[neuronIndex] * (1.0f - m_hiddenLayerOutputs[neuronIndex]);
			m_hiddenLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

			// calculate deltaCost/deltaWeight for each weight going into the neuron
			//
			// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
			// deltaCost/deltaWeight = deltaCost/deltaBias * input
			//
			for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
				m_hiddenLayerWeightsDeltaCost[HiddenLayerWeightIndex(inputIndex, neuronIndex)] = m_hiddenLayerBiasesDeltaCost[neuronIndex] * pixels[inputIndex];
		}
	}

private:

	// biases and weights
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiases;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiases;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeights;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeights;

	// neuron activation values aka "O" values
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerOutputs;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerOutputs;

	// derivatives of biases and weights for a single training example
	std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_outputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeightsDeltaCost;

	// derivatives of biases and weights for the minibatch. Average of all items in minibatch.
	std::array<float, HIDDEN_NEURONS>					m_miniBatchHiddenLayerBiasesDeltaCost;
	std::array<float, OUTPUT_NEURONS>					m_miniBatchOutputLayerBiasesDeltaCost;

	std::array<float, INPUTS * HIDDEN_NEURONS>			m_miniBatchHiddenLayerWeightsDeltaCost;
	std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_miniBatchOutputLayerWeightsDeltaCost;

	// used for minibatch generation
	std::vector<size_t>									m_trainingOrder;
};

// ============================================================================================
//                                   DRIVER PROGRAM
// ============================================================================================

// training and test data
CMNISTData g_trainingData;
CMNISTData g_testData;

// neural network
CNeuralNetwork<c_numInputNeurons, c_numHiddenNeurons, c_numOutputNeurons> g_neuralNetwork;

float GetDataAccuracy (const CMNISTData& data)
{
	size_t correctItems = 0;
	for (size_t i = 0, c = data.NumImages(); i < c; ++i)
	{
		uint8 label;
		const float* pixels = data.GetImage(i, label);
		uint8 detectedLabel = g_neuralNetwork.ForwardPass(pixels, label);

		if (detectedLabel == label)
			++correctItems;
	}
	return float(correctItems) / float(data.NumImages());
}

void ShowImage (const CMNISTData& data, size_t imageIndex)
{
	uint8 label = 0;
	const float* pixels = data.GetImage(imageIndex, label);
	printf("showing a %i\n", label);
	for (int iy = 0; iy < 28; ++iy)
	{
		for (int ix = 0; ix < 28; ++ix)
		{
			if (*pixels < 0.125)
				printf(" ");
			else
				printf("+");
			++pixels;
		}
		printf("\n");
	}
}

int main (int argc, char** argv)
{
	// load the MNIST data if we can
	if (!g_trainingData.Load(true) || !g_testData.Load(false))
	{
		printf("Could not load mnist data, aborting!\n");
		system("pause");
		return 1;
	}

	#if REPORT_ERROR_WHILE_TRAINING()
	FILE *file = fopen("Error.csv","w+t");
	if (!file)
	{
		printf("Could not open Error.csv for writing, aborting!\n");
		system("pause");
		return 2;
	}
	fprintf(file, "\"Training Data Accuracy\",\"Testing Data Accuracy\"\n");
	#endif

	{
		SBlockTimer timer("Training Time:  ");

		// train the network, reporting error before each training
		for (size_t epoch = 0; epoch < c_trainingEpochs; ++epoch)
		{
			#if REPORT_ERROR_WHILE_TRAINING()
				float accuracyTraining = GetDataAccuracy(g_trainingData);
				float accuracyTest = GetDataAccuracy(g_testData);
				printf("Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
				printf("Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);
				fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
			#endif

			printf("Training epoch %zu / %zu...\n", epoch+1, c_trainingEpochs);
			g_neuralNetwork.Train(g_trainingData, c_miniBatchSize, c_learningRate);
			printf("\n");
		}
	}

	// report final error
	float accuracyTraining = GetDataAccuracy(g_trainingData);
	float accuracyTest = GetDataAccuracy(g_testData);
	printf("\nFinal Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
	printf("Final Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);

	#if REPORT_ERROR_WHILE_TRAINING()
		fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
		fclose(file);
	#endif

	// Write out the final weights and biases as JSON for use in the web demo
	{
		FILE* file = fopen("WeightsBiasesJSON.txt", "w+t");
		fprintf(file, "{\n");

		// network structure
		fprintf(file, "  \"InputNeurons\":%zu,\n", c_numInputNeurons);
		fprintf(file, "  \"HiddenNeurons\":%zu,\n", c_numHiddenNeurons);
		fprintf(file, "  \"OutputNeurons\":%zu,\n", c_numOutputNeurons);

		// HiddenBiases
		auto hiddenBiases = g_neuralNetwork.GetHiddenLayerBiases();
		fprintf(file, "  \"HiddenBiases\" : [\n");
		for (size_t i = 0; i < hiddenBiases.size(); ++i)
		{
			fprintf(file, "    %f", hiddenBiases[i]);
			if (i < hiddenBiases.size() -1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// HiddenWeights
		auto hiddenWeights = g_neuralNetwork.GetHiddenLayerWeights();
		fprintf(file, "  \"HiddenWeights\" : [\n");
		for (size_t i = 0; i < hiddenWeights.size(); ++i)
		{
			fprintf(file, "    %f", hiddenWeights[i]);
			if (i < hiddenWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// OutputBiases
		auto outputBiases = g_neuralNetwork.GetOutputLayerBiases();
		fprintf(file, "  \"OutputBiases\" : [\n");
		for (size_t i = 0; i < outputBiases.size(); ++i)
		{
			fprintf(file, "    %f", outputBiases[i]);
			if (i < outputBiases.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ],\n");

		// OutputWeights
		auto outputWeights = g_neuralNetwork.GetOutputLayerWeights();
		fprintf(file, "  \"OutputWeights\" : [\n");
		for (size_t i = 0; i < outputWeights.size(); ++i)
		{
			fprintf(file, "    %f", outputWeights[i]);
			if (i < outputWeights.size() - 1)
				fprintf(file, ",");
			fprintf(file, "\n");
		}
		fprintf(file, "  ]\n");

		// done
		fprintf(file, "}\n");
		fclose(file);
	}

	// You can use the code like the below to visualize an image if you want to.
	//ShowImage(g_testData, 0);

	system("pause");
	return 0;
}

Thanks for reading, and if you have any questions, comments, or just want to chat, hit me up in the comments below, or on twitter at @Atrix256.

Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences

In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences.

By special request, this is that post!

The post I already linked to explains backpropagation.

If you want an explanation of dual numbers, check out these posts:

If you want an explanation of finite differences, check out this post:
Finite Differences

Since the fundamentals are explained in the links above, we’ll go straight to the code.

We’ll be getting the gradient (learning values) for the network in example 4 in the backpropagation post:

Note that I am using “central differences” for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I’m using an epsilon of 0.001.

I didn’t compare the running times of each method as my code is meant to be readable, not fast, and the code isn’t doing enough work to make a meaningful performance test IMO. If you did do a speed test, the finite differences should be by far the slowest, backpropagation should be the fastest, and dual numbers are probably going to be closer to backpropagation than to finite differences.

The output of the program is below. Both backpropagation and dual numbers get the exact derivatives (within the tolerances of floating point math of course!) because they use the chain rule, whereas finite differences is a numerical approximation. This shows up in the fact that backpropagation and dual numbers agree for all values, where finite differences has some small error in the derivatives.

And here is the code:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>
  
#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

//----------------------------------------------------------------------
// Dual Number Class - CDualNumber
//----------------------------------------------------------------------

template <size_t NUMVARIABLES>
class CDualNumber
{
public:
 
    // constructor to make a constant
    CDualNumber (float f = 0.0f) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
    }
 
    // constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
    CDualNumber (float f, size_t variableIndex) {
        m_real = f;
        std::fill(m_dual.begin(), m_dual.end(), 0.0f);
        m_dual[variableIndex] = 1.0f;
    }

	// Set a constant value.
	void Set (float f) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
	}

	// Set a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
	void Set (float f, size_t variableIndex) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
		m_dual[variableIndex] = 1.0f;
	}
 
    // storage for real and dual values
    float                           m_real;
    std::array<float, NUMVARIABLES> m_dual;
};
  
//----------------------------------------------------------------------
// Dual Number Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real + b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
    return ret;
}
  
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real - b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real * b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = a.m_real / b.m_real;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
    return ret;
}

// NOTE: the "special functions" below all just use the chain rule, which you can also use to add more functions
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    float sqrtReal = sqrt(a.m_real);
    ret.m_real = sqrtReal;
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = pow(a.m_real, y);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
    return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> exp (const CDualNumber<NUMVARIABLES>& a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = exp(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * exp(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = sin(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = cos(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
    return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
    CDualNumber<NUMVARIABLES> ret;
    ret.m_real = tan(a.m_real);
    for (size_t i = 0; i < NUMVARIABLES; ++i)
        ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
    return ret;
}
 
// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
    return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Driver Program
//----------------------------------------------------------------------

enum EWeightsBiases
{
	e_weight0 = 0,
	e_weight1,
	e_weight2,
	e_weight3,
	e_weight4,
	e_weight5,
	e_weight6,
	e_weight7,
	e_bias0,
	e_bias1,
	e_bias2,
	e_bias3,

	e_count
};

// our dual number needs a "dual" for every value we want a derivative for: aka every weight and bias
typedef CDualNumber<EWeightsBiases::e_count> TDualNumber;

// templated so it can work for both the dual numbers, as well as the float finite differences
template <typename TBaseType>
void ForwardPass (const std::array<TBaseType, 2>& input, const std::array<TBaseType, 2>& desiredOutput, const std::array<TBaseType, EWeightsBiases::e_count>& weightsBiases, TBaseType& cost)
{
	// calculate hidden layer neuron activations
	TBaseType Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
	TBaseType O0 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z0 * TBaseType(-1.0f)));

	TBaseType Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
	TBaseType O1 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z1 * TBaseType(-1.0f)));

	// calculate output layer neuron activations
	TBaseType Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
	TBaseType O2 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z2 * TBaseType(-1.0f)));

	TBaseType Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
	TBaseType O3 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z3 * TBaseType(-1.0f)));

	// calculate the cost: 0.5 * ||target-actual||^2
	// aka cost = half (error squared)
	TBaseType diff1 = TBaseType(desiredOutput[0]) - O2;
	TBaseType diff2 = TBaseType(desiredOutput[1]) - O3;
	cost = TBaseType(0.5f) * (diff1*diff1 + diff2*diff2);
}

// backpropagation
void ForwardPassAndBackpropagation (
    const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, EWeightsBiases::e_count>& weightsBiases,
    float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, EWeightsBiases::e_count>& deltaWeightsBiases
) {
    // calculate Z0 and O0 for neuron0
    float Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
    float O0 = 1.0f / (1.0f + std::exp(-Z0));
 
    // calculate Z1 and O1 for neuron1
    float Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
    float O1 = 1.0f / (1.0f + std::exp(-Z1));
 
    // calculate Z2 and O2 for neuron2
    float Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
    float O2 = 1.0f / (1.0f + std::exp(-Z2));
 
    // calculate Z3 and O3 for neuron3
    float Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
    float O3 = 1.0f / (1.0f + std::exp(-Z3));
 
    // the actual output of the network is the activation of the output layer neurons
    actualOutput[0] = O2;
    actualOutput[1] = O3;
 
    // calculate error
    float diff0 = desiredOutput[0] - actualOutput[0];
    float diff1 = desiredOutput[1] - actualOutput[1];
    error = std::sqrt(diff0*diff0 + diff1*diff1);
 
    // calculate cost
    cost = 0.5f * error * error;
 
    //----- Neuron 2 -----
 
    // calculate how much a change in neuron 2 activation affects the cost function
    // deltaCost/deltaO2 = O2 - target0
    float deltaCost_deltaO2 = O2 - desiredOutput[0];
 
    // calculate how much a change in neuron 2 weighted input affects neuron 2 activation
    // deltaO2/deltaZ2 = O2 * (1 - O2)
    float deltaO2_deltaZ2 = O2 * (1 - O2);
 
    // calculate how much a change in neuron 2 weighted input affects the cost function.
    // This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
    // This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
    float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaWeightsBiases[e_bias2] = neuron2Error;
 
    // calculate how much a change in weight4 affects the cost function.
    // deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
    // deltaCost/deltaWeight4 = neuron2Error * O0
    // similar thing for weight5
	deltaWeightsBiases[e_weight4] = neuron2Error * O0;
	deltaWeightsBiases[e_weight5] = neuron2Error * O1;
 
    //----- Neuron 3 -----
 
    // calculate how much a change in neuron 3 activation affects the cost function
    // deltaCost/deltaO3 = O3 - target1
    float deltaCost_deltaO3 = O3 - desiredOutput[1];
 
    // calculate how much a change in neuron 3 weighted input affects neuron 3 activation
    // deltaO3/deltaZ3 = O3 * (1 - O3)
    float deltaO3_deltaZ3 = O3 * (1 - O3);
 
    // calculate how much a change in neuron 3 weighted input affects the cost function.
    // This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
    // This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
    float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaWeightsBiases[e_bias3] = neuron3Error;
 
    // calculate how much a change in weight6 affects the cost function.
    // deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
    // deltaCost/deltaWeight6 = neuron3Error * O0
    // similar thing for weight7
	deltaWeightsBiases[e_weight6] = neuron3Error * O0;
	deltaWeightsBiases[e_weight7] = neuron3Error * O1;
 
    //----- Neuron 0 -----
 
    // calculate how much a change in neuron 0 activation affects the cost function
    // deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
    // deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
    float deltaCost_deltaO0 = neuron2Error * weightsBiases[e_weight4] + neuron3Error * weightsBiases[e_weight6];
 
    // calculate how much a change in neuron 0 weighted input affects neuron 0 activation
    // deltaO0/deltaZ0 = O0 * (1 - O0)
    float deltaO0_deltaZ0 = O0 * (1 - O0);
 
    // calculate how much a change in neuron 0 weighted input affects the cost function.
    // This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
    // This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
    float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaWeightsBiases[e_bias0] = neuron0Error;
 
    // calculate how much a change in weight0 affects the cost function.
    // deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
    // deltaCost/deltaWeight0 = neuron0Error * input0
    // similar thing for weight1
	deltaWeightsBiases[e_weight0] = neuron0Error * input[0];
	deltaWeightsBiases[e_weight1] = neuron0Error * input[1];
 
    //----- Neuron 1 -----
 
    // calculate how much a change in neuron 1 activation affects the cost function
    // deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
    // deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
    float deltaCost_deltaO1 = neuron2Error * weightsBiases[e_weight5] + neuron3Error * weightsBiases[e_weight7];
 
    // calculate how much a change in neuron 1 weighted input affects neuron 1 activation
    // deltaO1/deltaZ1 = O1 * (1 - O1)
    float deltaO1_deltaZ1 = O1 * (1 - O1);
 
    // calculate how much a change in neuron 1 weighted input affects the cost function.
    // This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
    // This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
    float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaWeightsBiases[e_bias1] = neuron1Error;
 
    // calculate how much a change in weight2 affects the cost function.
    // deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
    // deltaCost/deltaWeight2 = neuron1Error * input0
    // similar thing for weight3
	deltaWeightsBiases[e_weight2] = neuron1Error * input[0];
	deltaWeightsBiases[e_weight3] = neuron1Error * input[1];
}
 
int main (int argc, char **argv)
{

	// weights and biases, inputs and desired output
	const std::array<float, EWeightsBiases::e_count> weightsBiases =
	{
		0.15f, // e_weight0
		0.2f,  // e_weight1
		0.25f, // e_weight2
		0.3f,  // e_weight3
		0.4f,  // e_weight4
		0.45f, // e_weight5
		0.5f,  // e_weight6
		0.55f, // e_weight7
		0.35f, // e_bias0
		0.35f, // e_bias1
		0.6f,  // e_bias2
		0.6f   // e_bias3
	};

	const std::array<float, 2> inputs = 
	{
		0.05f,
		0.1f
	};

	std::array<float, 2> desiredOutput = {
		0.01f,
		0.99f
	};

	// =====================================================
	// ===== FINITE DIFFERENCES CALCULATED DERIVATIVES =====
	// =====================================================

	std::array<float, EWeightsBiases::e_count> gradientFiniteDifferences;
	std::array<float, EWeightsBiases::e_count> weightsBiasesFloat;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesFloat[i] = weightsBiases[i];

	// use central differences to approximate the gradient
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float costSample1 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] - EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample1);

		float costSample2 = 0.0f;
		weightsBiasesFloat[i] = weightsBiases[i] + EPSILON;
		ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample2);

		gradientFiniteDifferences[i] = (costSample2 - costSample1) / (EPSILON * 2.0f);

		weightsBiasesFloat[i] = weightsBiases[i];
	}

	// ==============================================
	// ===== DUAL NUMBER CALCULATED DERIVATIVES =====
	// ==============================================

	// dual number weights and biases
	std::array<TDualNumber, EWeightsBiases::e_count> weightsBiasesDual;
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
		weightsBiasesDual[i].Set(weightsBiases[i], i);

	// dual number inputs and desired output
	std::array<TDualNumber, 2> inputsDual;
	for (size_t i = 0; i < inputsDual.size(); ++i)
		inputsDual[i].Set(inputs[i]);

	std::array<TDualNumber, 2> desiredOutputDual;
	for (size_t i = 0; i < desiredOutputDual.size(); ++i)
		desiredOutputDual[i].Set(desiredOutput[i]);

	// dual number derivatives
	TDualNumber gradientDualNumbers;
	ForwardPass(inputsDual, desiredOutputDual, weightsBiasesDual, gradientDualNumbers);

	// ==================================================
	// ===== BACKPROPAGATION CALCULATED DERIVATIVES =====
	// ==================================================

	float error;
	float cost;
	std::array<float, 2> actualOutput;
	std::array<float, EWeightsBiases::e_count> gradientBackPropagation;
	ForwardPassAndBackpropagation(inputs, desiredOutput, weightsBiases, error, cost, actualOutput, gradientBackPropagation);

	// ==========================
	// ===== Report Results =====
	// ==========================

	printf("Neural Network GradientnnBackpropagation     Dual Numbers (Error)       Finite Differences (Error)n");
	for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
	{
		float diffDual = gradientBackPropagation[i] - gradientDualNumbers.m_dual[i];
		float diffFinitDifferences = gradientBackPropagation[i] - gradientFiniteDifferences[i];
		printf("% 08f,         % 08f (% 08f),     % 08f (% 08f)n",
			gradientBackPropagation[i],
			gradientDualNumbers.m_dual[i], diffDual,
			gradientFiniteDifferences[i], diffFinitDifferences
		);
	}
	printf("n");

	system("pause");
    return 0;
}

How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

Explanation
Working through examples
Simple sample C++ source code using only standard includes
Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
  weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);

To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

$Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)$

Plugging in the specific values for the weights and biases, it looks like this:

$Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)$

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

$Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)$

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

$\sigma(x) = \frac{1}{1+e^{-x}}$

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

$Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295$

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula $y=mx+b$ and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is $\frac{dy}{dx}$ , which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by $\frac{dy}{dx}$ to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: $\frac{\partial y}{\partial x}$ .

The reason partial derivatives even exist is because if you have a function of multiple variables like $z=f(x,y)=x^2+3y+2$ , you have two variables that you can take the derivative of. You can calculate $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$ . The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

$\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\$

If you put both of those values together into a vector $(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
Multiply the gradient by a small “learning rate” value, such as 0.05
Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

$Cost = ||target-output||^2$

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like $\frac{\partial Cost}{\partial Weight0}$ , you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

$dy/dx$

That you can also write it like this:

$dy/du*du/dx$

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

$3/2 = 1.5$

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
$3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5$

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

dCost/dOutput: The derivative of how a neuron’s output affects cost
dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Backpropagation Example 2: Single Neuron, Two Training Examples

Let’s start with the same neural network from last time:

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
$\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476$

If we follow the same steps with the “input 0, output 1” training example we get these:
$\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887$

To get the actual derivatives to train the network with, we just average them!
$\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294$

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225$

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167$

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

$\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360$

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

$\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463$

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463$

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729$

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984$

Now we start the backward pass to calculate the derivatives for training.

Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

$\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385$

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

$\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\$

Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

$\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381$

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

$\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\$

Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

$\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364$

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

$\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\$

Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

$\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01$

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\$

Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

Check the links at the bottom of the post for more information about this!

Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
	float input, float desiredOutput,
	float weight, float bias,
	float& error, float& cost, float& actualOutput,
	float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
	// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
	float Z = input * weight + bias;
	float O = 1.0f / (1.0f + std::exp(-Z));

	// the actual output of the network is the activation of the neuron
	actualOutput = O;

	// calculate error
	error = std::abs(desiredOutput - actualOutput);

	// calculate cost
	cost = 0.5f * error * error;

	// calculate how much a change in neuron activation affects the cost function
	// deltaCost/deltaO = O - target
	float deltaCost_deltaO = O - desiredOutput;

	// calculate how much a change in neuron weighted input affects neuron activation
	// deltaO/deltaZ = O * (1 - O)
	float deltaO_deltaZ = O * (1 - O);


	// calculate how much a change in a neuron's weighted input affects the cost function.
	// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
	// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
	float neuronError = deltaCost_deltaO * deltaO_deltaZ;
	deltaCost_deltaBias = neuronError;

	// calculate how much a change in the weight affects the cost function.
	// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * input
	deltaCost_deltaWeight = neuronError * input;


	// As a bonus, calculate how much a change in the input affects the cost function.
	// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
	// deltaCost/deltaInput = neuronError * weight
	deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example1.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 10000;

	// training data
	// input: 1, output: 0
	const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float error = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		// run the network to get error and derivatives
		float output = 0.0f;
		float cost = 0.0f;
		float deltaCost_deltaWeight = 0.0f;
		float deltaCost_deltaBias = 0.0f;
		float deltaCost_deltaInput = 0.0f;
		Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= deltaCost_deltaWeight * c_learningRate;
		bias -= deltaCost_deltaBias * c_learningRate;
	}

	printf("Example1 Final Error: %fn", error);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example2.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 100000;

	// training data
	// input: 1, output: 0
	// input: 0, output: 1
	const std::array<std::array<float, 2>, 2> c_trainingData = { {
		{1.0f, 0.0f},
		{0.0f, 1.0f}
	} };

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		avgError = 0.0f;
		float avgOutput = 0.0f;
		float avgCost = 0.0f;
		float avgDeltaCost_deltaWeight = 0.0f;
		float avgDeltaCost_deltaBias = 0.0f;
		float avgDeltaCost_deltaInput = 0.0f;

		// run the network to get error and derivatives for each training example
		for (const std::array<float, 2>& trainingData : c_trainingData)
		{
			float error = 0.0f;
			float output = 0.0f;
			float cost = 0.0f;
			float deltaCost_deltaWeight = 0.0f;
			float deltaCost_deltaBias = 0.0f;
			float deltaCost_deltaInput = 0.0f;
			Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

			avgError += error;
			avgOutput += output;
			avgCost += cost;
			avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
			avgDeltaCost_deltaBias += deltaCost_deltaBias;
			avgDeltaCost_deltaInput += deltaCost_deltaInput;
		}

		avgError /= (float)c_trainingData.size();
		avgOutput /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
		avgDeltaCost_deltaBias /= (float)c_trainingData.size();
		avgDeltaCost_deltaInput /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= avgDeltaCost_deltaWeight * c_learningRate;
		bias -= avgDeltaCost_deltaBias * c_learningRate;
	}

	printf("Example2 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
	std::array<float, 2> m_input;
	std::array<float, 2> m_output;
};

void Example3RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 4>& weights, const std::array<float, 2>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// the actual output of the network is the activation of the neurons
	actualOutput[0] = O0;
	actualOutput[1] = O1;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = O0 - target0
	float deltaCost_deltaO0 = O0 - desiredOutput[0];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = O1 - target1
	float deltaCost_deltaO1 = O1 - desiredOutput[1];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO0/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example3.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 520000;

	// training data: OR/AND
	// input: 00, output: 00
	// input: 01, output: 10
	// input: 10, output: 10
	// input: 11, output: 11
	const std::array<SExample3Training, 4> c_trainingData = { {
		{{0.0f, 0.0f}, {0.0f, 0.0f}},
		{{0.0f, 1.0f}, {1.0f, 0.0f}},
		{{1.0f, 0.0f}, {1.0f, 0.0f}},
		{{1.0f, 1.0f}, {1.0f, 1.0f}},
	} };

	// starting weight and bias values
	std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
	std::array<float, 2> biases = { 0.5f, 0.1f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		//float avgCost = 0.0f;
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example3 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 8>& weights, const std::array<float, 4>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// calculate Z2 and O2 for neuron2
	float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
	float O2 = 1.0f / (1.0f + std::exp(-Z2));

	// calculate Z3 and O3 for neuron3
	float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
	float O3 = 1.0f / (1.0f + std::exp(-Z3));

	// the actual output of the network is the activation of the output layer neurons
	actualOutput[0] = O2;
	actualOutput[1] = O3;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 2 -----

	// calculate how much a change in neuron 2 activation affects the cost function
	// deltaCost/deltaO2 = O2 - target0
	float deltaCost_deltaO2 = O2 - desiredOutput[0];

	// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
	// deltaO2/deltaZ2 = O2 * (1 - O2)
	float deltaO2_deltaZ2 = O2 * (1 - O2);

	// calculate how much a change in neuron 2 weighted input affects the cost function.
	// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
	// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
	float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaCost_deltaBiases[2] = neuron2Error;

	// calculate how much a change in weight4 affects the cost function.
	// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * O0
	// similar thing for weight5
	deltaCost_deltaWeights[4] = neuron2Error * O0;
	deltaCost_deltaWeights[5] = neuron2Error * O1;

	//----- Neuron 3 -----

	// calculate how much a change in neuron 3 activation affects the cost function
	// deltaCost/deltaO3 = O3 - target1
	float deltaCost_deltaO3 = O3 - desiredOutput[1];

	// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
	// deltaO3/deltaZ3 = O3 * (1 - O3)
	float deltaO3_deltaZ3 = O3 * (1 - O3);

	// calculate how much a change in neuron 3 weighted input affects the cost function.
	// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
	// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
	float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaCost_deltaBiases[3] = neuron3Error;

	// calculate how much a change in weight6 affects the cost function.
	// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * O0
	// similar thing for weight7
	deltaCost_deltaWeights[6] = neuron3Error * O0;
	deltaCost_deltaWeights[7] = neuron3Error * O1;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
	// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
	float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
	// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
	float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO1/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example4.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 5000;

	// training data: 0.05, 0.1 in = 0.01, 0.99 out
	const std::array<SExample3Training, 1> c_trainingData = { {
		{{0.05f, 0.1f}, {0.01f, 0.99f}},
	} };

	// starting weight and bias values
	std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
	std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example4 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

int main (int argc, char **argv)
{
	Example1();
	Example2();
	Example3();
	Example4();
	system("pause");
	return 0;
}

Closing & Links

The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

A Step by Step Backpropagation Example
Neural Networks and Deep Learning
Backpropogation is Just Steepest Descent with Automatic Differentiation
Chain Rule
Deep Vis

Multivariable Dual Numbers & Automatic Differentiation

In a previous post I showed how to use dual numbers to be able to get both the value and derivative of a function at the same time:
Dual Numbers & Automatic Differentiation

That post mentions that you can extend it to multivariable functions but doesn’t explain how. This post is that explanation, including simple working C++ code!

Extending this to multivariable functions is useful for ray marching, calculating analytical surface normals and also likely useful for training a neural network if you want an alternative to back propagation. I’m not sure about the efficiency comparison of this versus back propagation but I intend on looking into it (:

How Does it Work?

It turns out to be really simple to use dual numbers with multivariable functions. The end result is that you want a partial derivative for each variable in the equation, so to do that, you just have a dual number per variable, and process the entire equation for each of those dual numbers separately.

We’ll work through an example. Let’s find the partial derivatives of x and y of the function $3x^2-2y^3$ , at input (5,2).

We’ll start by finding the derivative of x, and then the derivative of y.

Example: df/dx

We start by making a dual number for our x value, remembering that the real part is the actual value for x, and the dual part is the derivative of x, which is 1:

$5+1\epsilon$

or:

$5+\epsilon$

We multiply that value by itself to get the $x^2$ value, keeping in mind that $\epsilon^2$ is zero:
$(5+\epsilon)*(5+\epsilon)= \\ 25+10\epsilon+\epsilon^2= \\ 25+10\epsilon \\$

Next we need to multiply that by 3 to get the $3x^2$ term:

$3*(25+10\epsilon) = 75+30\epsilon$

Putting that aside for a moment, we need to make the $2y^3$ term. We start by making our y value:

$2+0\epsilon$

or:

$2$

If you are wondering why it has a zero for the epsilon term, it’s because when we are calculating the partial derivative of x, y is a constant, so has a derivative of zero.

Next, we multiply this y value by itself twice to get the $y^3$ value:

$2*2*2=8$

We then multiply it by 2 to get the $2y^3$ term:

$8*2=16$

Now that we have our two terms, we subtract the y term from the x term to get our final result:

$75+30\epsilon-16 = \\ 59+30\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30.

That checks out, let’s move on to the derivative of y!

Example: df/dy

Calculating the derivative of y is very similar to calculating the derivative of x, except that it’s the x term that has an epsilon value (derivative) of 0, instead of the y term. The y term has the epsilon value of 1 this time as well. We’ll work through it to see how it plays out.

First up, we need to make the value for x:

$5+0\epsilon$

or:

$5$

Next we square it and multiply it by 3 to get the $3x^2$ term:

$5*5*3=75$

Next we need to make the value for y, remembering that we use an epsilon value of 1 since the derivative of y is 1 this time around.

$2+\epsilon$

We cube that value and multiply by 2 to get the $2y^3$ term:
$2*(2+\epsilon)*(2+\epsilon)*(2+\epsilon)= \\ 2*(2+\epsilon)*(4+4\epsilon+\epsilon^2)= \\ 2*(2+\epsilon)*(4+4\epsilon)= \\ 2*(8+12\epsilon+4\epsilon^2)= \\ 2*(8+12\epsilon)= \\ 16+24\epsilon$

Now we subtract the y term from the x term to get the final result:

$75 - (16+24\epsilon)= \\ 59-24\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of y at that point is -24.

That also checks out, so we got the correct value and partial derivatives for the equation.

Reducing Redundancy

There was quite a bit of redundancy when working through the x and y derivatives wasn’t there? Increasing the number of variables will increase the amount of redundancy too, so it doesn’t scale up well.

Luckily there is a way to address this. Basically, instead of making two dual numbers which have two items, you make them share the real value (since it’s the same for both, as is the work to make it) and append the dual values for x and y to it.

Thus, instead of having:

$x'=(a+b\epsilon) \\ y'=(a+b\epsilon)$

You have:

$(a+b\epsilon_x+c\epsilon_y)$

Then, in your math or in your program, you treat it as if it’s two different dual numbers packed into one. This lets you do the work for the real number once instead of twice, but still lets you do your dual number work for each variable independently.

While it’s probably easiest to think of these as two dual numbers packed into one value, there is actually a mathematical basis for it as well, which may or may not surprise you.

Check out what happens when we multiply two of these together, keeping in mind that multiplying ANY two epsilon values together becomes zero, even if they are different epsilons:

$(a+b\epsilon_x+c\epsilon_y) * (d+e\epsilon_x+f\epsilon_y)= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + be\epsilon_x^2 + bf\epsilon_x\epsilon_y + cd\epsilon_y + ce\epsilon_x\epsilon_y + cf\epsilon_y^2= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + cd\epsilon_y= \\ ad + (ae+bd)\epsilon_x + (af+cd)\epsilon_y$

The interesting thing is that the above result gives you the same values as if you did the same work for two dual numbers individually.

Let’s see this three component dual number in action by re-doing the example again. Note that this pattern scales up to ANY number of variables!

Example: Both Derivatives (Gradient Vector)

Our goal is to calculate the value and partial derivatives of the function $3x^2-2y^3$ at location (5,2).

First we make our x value:

$5 + 1\epsilon_x + 0\epsilon_y$

or:

$5 + \epsilon_x$

We square that and multiply it by 3 to get our $3x^2$ term:

$3*(5 + \epsilon_x)*(5 + \epsilon_x)= \\ 3*(25+10\epsilon_x+\epsilon_x^2)= \\ 3*(25+10\epsilon_x)= \\ 75+30\epsilon_x$

Next, we make our y value:

$2 + 0\epsilon_x + 1\epsilon_y$

or:

$2 + \epsilon_y$

We cube it and multiply it by 2 to get our $2x^3$ term:

$16+24\epsilon_y$

Lastly we subtract the y term from the x term to get our final answer:

$(75+30\epsilon_x) - (16+24\epsilon_y)= \\ 59+30\epsilon_x-24\epsilon_y$

The result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30, and the derivative of y at that point is -24.

Neat, right?!

Example Code

Here is the example code output:

Here is the code that generated it:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>
 
#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

	// constructor to make a constant
	CDualNumber (float f = 0.0f) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
	}

	// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
	CDualNumber (float f, size_t variableIndex) {
		m_real = f;
		std::fill(m_dual.begin(), m_dual.end(), 0.0f);
		m_dual[variableIndex] = 1.0f;
	}

	// storage for real and dual values
	float							m_real;
	std::array<float, NUMVARIABLES> m_dual;
};
 
//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real + b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
	return ret;
}
 
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real - b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real * b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = a.m_real / b.m_real;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	float sqrtReal = sqrt(a.m_real);
	ret.m_real = sqrtReal;
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = pow(a.m_real, y);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = sin(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = cos(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = tan(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
	return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
	CDualNumber<NUMVARIABLES> ret;
	ret.m_real = tan(a.m_real);
	for (size_t i = 0; i < NUMVARIABLES; ++i)
		ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
	return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
	return x * x * (T(3.0f) - T(2.0f) * x);
}
 
//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------
 
void TestSmoothStep (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// calculate value and derivative using dual numbers
	CDualNumber<1> y = SmoothStep(x);

	// calculate numeric derivative using central differences
	float derivNumeric = (SmoothStep(input + EPSILON) - SmoothStep(input - EPSILON)) / (2.0f * EPSILON);

	// calculate actual derivative
	float derivActual = 6.0f * input - 6.0f * input * input;

	// show value and derivatives
	printf("(smoothstep) y=3x^2-2x^3  (x=%0.4f)n", input);
	printf("  y = %0.4fn", y.m_real);
	printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
	printf("  actual dy/dx = %0.4fn", derivActual);
	printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

void TestTrig (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// sin
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = sin(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (sin(input + EPSILON) - sin(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = cos(input);

		// show value and derivatives
		printf("sin(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// cos
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = cos(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (cos(input + EPSILON) - cos(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = -sin(input);

		// show value and derivatives
		printf("cos(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// tan
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = tan(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (tan(input + EPSILON) - tan(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.0f / (cos(input)*cos(input));

		// show value and derivatives
		printf("tan(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// atan
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = atan(x);

		// calculate numeric derivative using central differences
		float derivNumeric = (atan(input + EPSILON) - atan(input - EPSILON)) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.0f / (1.0f + input * input);

		// show value and derivatives
		printf("atan(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}
}

void TestSimple (float input)
{
	// create a dual number as the value of x
	CDualNumber<1> x(input, 0);

	// sqrt
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = CDualNumber<1>(3.0f) / sqrt(x);

		// calculate numeric derivative using central differences
		float derivNumeric = ((3.0f / sqrt(input + EPSILON)) - (3.0f / sqrt(input - EPSILON))) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = -3.0f / (2.0f * pow(input, 3.0f / 2.0f));

		// show value and derivatives
		printf("3/sqrt(%0.4f) = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}

	// pow
	{
		// calculate value and derivative using dual numbers
		CDualNumber<1> y = pow(x + CDualNumber<1>(1.0f), 1.337f);

		// calculate numeric derivative using central differences
		float derivNumeric = ((pow(input + 1.0f + EPSILON, 1.337f)) - (pow(input + 1.0f - EPSILON, 1.337f))) / (2.0f * EPSILON);

		// calculate actual derivative
		float derivActual = 1.337f * pow(input + 1.0f, 0.337f);

		// show value and derivatives
		printf("(%0.4f+1)^1.337 = %0.4fn", input, y.m_real);
		printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
		printf("  actual dy/dx = %0.4fn", derivActual);
		printf("  numeric dy/dx = %0.4fnn", derivNumeric);
	}
}

void Test2D (float inputx, float inputy)
{
	// create dual numbers as the value of x and y
	CDualNumber<2> x(inputx, 0);
	CDualNumber<2> y(inputy, 1);

	// z = 3x^2 - 2y^3
	{
		// calculate value and partial derivatives using dual numbers
		CDualNumber<2> z = CDualNumber<2>(3.0f) * x * x - CDualNumber<2>(2.0f) * y * y * y;

		// calculate numeric partial derivatives using central differences
		auto f = [] (float x, float y) {
			return 3.0f * x * x - 2.0f * y * y * y;
		};
		float derivNumericX = (f(inputx + EPSILON, inputy) - f(inputx - EPSILON, inputy)) / (2.0f * EPSILON);
		float derivNumericY = (f(inputx, inputy + EPSILON) - f(inputx, inputy - EPSILON)) / (2.0f * EPSILON);

		// calculate actual partial derivatives
		float derivActualX = 6.0f * inputx;
		float derivActualY = -6.0f * inputy * inputy;

		// show value and derivatives
		printf("z=3x^2-2y^3 (x = %0.4f, y = %0.4f)n", inputx, inputy);
		printf("  z = %0.4fn", z.m_real);
		printf("  dual# dz/dx = %0.4fn", z.m_dual[0]);
		printf("  dual# dz/dy = %0.4fn", z.m_dual[1]);
		printf("  actual dz/dx = %0.4fn", derivActualX);
		printf("  actual dz/dy = %0.4fn", derivActualY);
		printf("  numeric dz/dx = %0.4fn", derivNumericX);
		printf("  numeric dz/dy = %0.4fnn", derivNumericY);
	}
}

void Test3D (float inputx, float inputy, float inputz)
{
	// create dual numbers as the value of x and y
	CDualNumber<3> x(inputx, 0);
	CDualNumber<3> y(inputy, 1);
	CDualNumber<3> z(inputz, 2);

	// w = sin(x*cos(2*y)) / tan(z)
	{
		// calculate value and partial derivatives using dual numbers
		CDualNumber<3> w = sin(x * cos(CDualNumber<3>(2.0f)*y)) / tan(z);

		// calculate numeric partial derivatives using central differences
		auto f = [] (float x, float y, float z) {
			return sin(x*cos(2.0f*y)) / tan(z);
		};
		float derivNumericX = (f(inputx + EPSILON, inputy, inputz) - f(inputx - EPSILON, inputy, inputz)) / (2.0f * EPSILON);
		float derivNumericY = (f(inputx, inputy + EPSILON, inputz) - f(inputx, inputy - EPSILON, inputz)) / (2.0f * EPSILON);
		float derivNumericZ = (f(inputx, inputy, inputz + EPSILON) - f(inputx, inputy, inputz - EPSILON)) / (2.0f * EPSILON);

		// calculate actual partial derivatives
		float derivActualX = cos(inputx*cos(2.0f*inputy))*cos(2.0f * inputy) / tan(inputz);
		float derivActualY = cos(inputx*cos(2.0f*inputy)) *-2.0f*inputx*sin(2.0f*inputy) / tan(inputz);
		float derivActualZ = sin(inputx * cos(2.0f * inputy)) / -(sin(inputz) * sin(inputz));

		// show value and derivatives
		printf("w=sin(x*cos(2*y))/tan(z) (x = %0.4f, y = %0.4f, z = %0.4f)n", inputx, inputy, inputz);
		printf("  w = %0.4fn", w.m_real);
		printf("  dual# dw/dx = %0.4fn", w.m_dual[0]);
		printf("  dual# dw/dy = %0.4fn", w.m_dual[1]);
		printf("  dual# dw/dz = %0.4fn", w.m_dual[2]);
		printf("  actual dw/dx = %0.4fn", derivActualX);
		printf("  actual dw/dy = %0.4fn", derivActualY);
		printf("  actual dw/dz = %0.4fn", derivActualZ);
		printf("  numeric dw/dx = %0.4fn", derivNumericX);
		printf("  numeric dw/dy = %0.4fn", derivNumericY);
		printf("  numeric dw/dz = %0.4fnn", derivNumericZ);
	}
}

int main (int argc, char **argv)
{
	TestSmoothStep(0.5f);
	TestSmoothStep(0.75f);
	TestTrig(PI * 0.25f);
	TestSimple(3.0f);
	Test2D(1.5f, 3.28f);
	Test3D(7.12f, 8.93f, 12.01f);
    return 0;
}

Closing

One of the neatest things about dual numbers is that they give precise results. They are not approximations and they are not numerical methods, unlike the central differences method that I compared them to in the example program (More info on numerical derivatives here: Finite Differences). Using dual numbers gives you exact derivatives, within the limitations of (eg) floating point math.

It turns out that backpropagation (the method that is commonly used to train neural networks) is just steepest gradient descent. You can read about that here: Backpropogation is Just Steepest Descent with Automatic Differentiation

That makes me wonder how dual numbers would do in run time speed compared to back propagation as well as numerical methods for getting the gradient to adjust a neural network during training.

If I had to guess, I’d say that dual numbers may be slightly slower than backpropagation, but not as slow as numerical methods which are going to be much, much slower. We’ll see though. It may be much easier to implement neural network learning using dual numbers compared to backpropagation, so may be worth an exploration and write up, even if only to make neural networks a little bit more accessible to people.

Comments, corrections, etc? Let me know in the comments below, or contact me on twitter at @Atrix256

Incremental Least Squares Surface and Hyper-Volume Fitting

The last post showed how to fit a $y=f(x)$ equation to a set of 2d data points, using least squares fitting. It allowed you to do this getting only one data point at a time, and still come up with the same answer as if you had all the data points at once, so it was an incremental, or “online” algorithm.

This post generalizes that process to equations of any dimension such as $z=f(x,y)$ , $w=f(x,y,z)$ or greater.

Below is an image of a surface that is degree (2,2). This is a screenshot taken from the interactive webgl2 demo I made for this post: Least Squares Surface Fitting

How Do You Do It?

The two main players from the last post were the ATA matrix and the ATY vector. These two could be incrementally updated as new data points came in, which would allow you to do an incremental (or “online”) least squares fit of a curve.

When working with surfaces and volumes, you have the same things basically. Both the ATA matrix and the ATY vector still exist, but they contain slightly different information – which I’ll explain lower down. However, the ATY vector is renamed, since in the case of a surface it should be called ATZ, and for a volume it should be called ATW. I call it ATV to generalize it, where v stands for value, and represents the last component in a data point, which is the output value we are trying to fit given the input values. The input values are the rest of the components of the data point.

At the end, you once again need to solve the equation $A^TA*c=A^Tv$ to calculate the coefficients (named c) of the equation.

It’s all pretty similar to fitting a curve, but the details change a bit. Let’s work through an example to see how it differs.

Example: Bilinear Surface Fitting

Let’s fit 4 data points to a bilinear surface, otherwise known as a surface that is linear on each axis, or a surface of degree(1,1):
(0,0,5)
(0,1,3)
(1,0,8)
(1,1,2)

Since we are fitting those data points with a bilinear surface, we are looking for a function that takes in x,y values and gives as output the z coordinate. We want a surface that gives us the closest answer possible (minimizing the sum of the squared difference for each input data point) for the data points we do have, so that we can give it other data points and get z values as output that approximate what we expect to see for that input.

A linear equation looks like this, with coefficients A and B:
$y=Ax+B$

Since we want a bilinear equation this time around, this is the equation we are going to end up with, after solving for the coefficients A,B,C,D:
$y=Axy+Bx+Cy+D$

The first step is to make the A matrix. In the last post, this matrix was made up of powers of the x coordinates. In this post, they are actually going to be made up of the permutation of powers of the x and y coordinates.

Last time the matrix looked like this:
$A = \begin{bmatrix} x_0^0 & x_0^1 & x_0^2 \\ x_1^0 & x_1^1 & x_1^2 \\ x_2^0 & x_2^1 & x_2^2 \\ x_3^0 & x_3^1 & x_3^2 \\ \end{bmatrix}$

This time, the matrix is going to look like this:
$A = \begin{bmatrix} x_0^0y_0^0 & x_0^0y_0^1 & x_0^1y_0^0 & x_0^1y_0^1 \\ x_1^0y_1^0 & x_1^0y_1^1 & x_1^1y_1^0 & x_1^1y_1^1 \\ x_2^0y_2^0 & x_2^0y_2^1 & x_2^1y_2^0 & x_2^1y_2^1 \\ x_3^0y_3^0 & x_3^0y_3^1 & x_3^1y_3^0 & x_3^1y_3^1 \\ \end{bmatrix}$

Simplifying that matrix a bit, it looks like this:
$A = \begin{bmatrix} 1 & y_0 & x_0 & x_0y_0 \\ 1 & y_1 & x_1 & x_1y_1 \\ 1 & y_2 & x_2 & x_2y_2 \\ 1 & y_3 & x_3 & x_3y_3 \\ \end{bmatrix}$

To simplify it even further, there is one row in the A matrix per data point, where the row looks like this:
$\begin{bmatrix} 1 & y & x & xy \\ \end{bmatrix}$

You can see that every permutation of the powers of x and y for each data point is present in the matrix.

The A matrix for our data points is this:
$A = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix}$

Next we need to calculate the ATA matrix by multiplying the transpose of that matrix, by that matrix.

$A^TA = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} * \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix} = \begin{bmatrix} 4 & 2 & 2 & 1 \\ 2 & 2 & 1 & 1 \\ 2 & 1 & 2 & 1 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix}$

Taking the inverse of that matrix we get this:

$(A^TA)^{-1} = \begin{bmatrix} 1 & -1 & -1 & 1 \\ -1 & 2 & 1 & -2 \\ -1 & 1 & 2 & -2 \\ 1 & -2 & -2 & 4 \\ \end{bmatrix}$

Next we need to calculate the ATV vector (formerly known as ATY). We calculate that by multiplying the transpose of the A matrix by the Z values:

$A^TV = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} * \begin{bmatrix} 5 \\ 3 \\ 8 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 18 \\ 5 \\ 10 \\ 2 \\ \end{bmatrix}$

Lastly we multiply the inversed ATA matrix by the ATV vector to get our coefficients.

$\begin{bmatrix} 1 & -1 & -1 & 1 \\ -1 & 2 & 1 & -2 \\ -1 & 1 & 2 & -2 \\ 1 & -2 & -2 & 4 \\ \end{bmatrix} * \begin{bmatrix} 18 \\ 5 \\ 10 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ -2 \\ 3 \\ -4 \\ \end{bmatrix}$

In the last post, the coefficients we got out were in x power order, so the first (top) was for the $x^0$ term, the next was for the $x^1$ term etc.

This time around, the coefficients are in the same order as the permutations of the powers of x and y:
$\begin{bmatrix} 1 & y & x & xy \\ \end{bmatrix}$

That makes our final equation this:
$z = -4xy+3x-2y+5$

If you plug in the (x,y) values from the data set we fit, you’ll see that you get the corresponding z values as output. We perfectly fit the data set!

The process isn’t too different from last post and not too difficult either right?

Let’s see if we can generalize and formalize things a bit.

Some Observations

Firstly you may be wondering how we come up with the correct permutation of powers of our inputs. It actually doesn’t matter so long as you are consistent. You can have your A matrix rows have the powers in any order, so long as all orders are present, and you use the same order in all operations.

Regarding storage sizes needed, the storage of surfaces and (hyper) volumes are a bit different and generally larger than curves.

To see how, let’s look at the powers of the ATA matrix of a bilinear surface, using the ordering of powers that we used in the example:

$\begin{bmatrix} x^0y^0 & x^0y^1 & x^1y^0 & x^1y^1 \\ x^0y^1 & x^0y^2 & x^1y^1 & x^1y^2 \\ x^1y^0 & x^1y^1 & x^2y^0 & x^2y^1 \\ x^1y^1 & x^1y^2 & x^2y^1 & x^2y^2 \\ \end{bmatrix}$

Let’s rewrite it as just the powers:

$\begin{bmatrix} 00 & 01 & 10 & 11 \\ 01 & 02 & 11 & 12 \\ 10 & 11 & 20 & 21 \\ 11 & 12 & 21 & 22 \\ \end{bmatrix}$

And the permutation we used as just powers to help us find the pattern in the powers of x and y in the ATA matrix:
$\begin{bmatrix} 00 & 01 & 10 & 11 \\ \end{bmatrix}$

Can you find the pattern of the powers used at the different spots in the ATA matrix?

I had to stare at it for a while before I figured it out but it’s this: For the i,j location in the ATA matrix, the powers of x and y are the powers of x and y in the i permutation added to the powers of x and y in the j permutation.

For example, $A^TA_{0,2}$ has xy powers of 10. Permutation 0 has powers of 0,0 and permutation 2 has powers of 1,0, so we add those together to get powers 1,0.

Another example, $A^TA_{2,3}$ has xy powers of 21. Permutation 2 has powers of 1,0 and permutation 3 has powers 1,1. Adding those together we get 2,1 which is correct.

That’s a bit more complex than last post, not too much more difficult to construct the ATA matrix directly – and also construct it incrementally as new data points come in!

How many unique values are there in the ATA matrix though? We need to know this to know how much storage we need.

In the last post, we needed (degree+1)*2–1 values to store the unique ATA matrix values. That can also be written as degree*2+1.

It turns out that when generalizing this to surfaces and volumes, that we need to take the product of that for each axis.

For instance, a surface has ((degreeX*2)+1)*((degreeY*2)+1) unique values. A volume has ((degreeX*2)+1)*((degreeY*2)+1)*((degreeZ*2)+1) unique values.

The pattern continues for higher dimensions, as well as lower, since you can see how in the curve case, it’s the same formula as it was before.

For the same ATA matrix size, a surface has more unique values than a curve does.

As far as what those values actually are, they are the full permutations of the powers of a surface (or hyper volume) that is one degree higher on each axis. For a bilinear surface, that means the 9 permutations of x and y for powers 0,1 and 2:
$x^0y^0,x^0y^1,x^0y^2,x^1y^0,x^1y^1,x^1y^2,x^2y^0,x^2y^1,x^2y^2$
Or simplified:
$1,y,y^2,x,xy,xy^2,x^2,x^2y,x^2y^2$

What about the ATV vector?

For the bilinear case, The ATV vector is the sums of the permutations of x,y multiplied by z, for every data point. In other words, you add this to ATV for each data point:
$\begin{bmatrix} z & yz & xz & xyz \\ \end{bmatrix}$

How much storage space do we need in general for the ATV vector then? it’s the product of (degree+1) for each axis.

For instance, a surface has (degreeX+1)*(degreeY+1) values in ATV, and a volume has (degreeX+1)*(degreeY+1)*(degreeZ+1).

You may also be wondering how many data points are required minimum to fit a curve, surface or hypervolume to a data set. The answer is that you need as many data points as there are terms in the polynomial. We are trying to solve for the polynomial coefficients, so there are as many unknowns as there are polynomial terms.

How many polynomial terms are there? There are as many terms as there are permutations of the axes powers involved. In other words, the size of ATV is also the minimum number of points you need to fit a curve, surface, or hypervolume to a data set.

Measuring Quality of a Fit

You are probably wondering if there’s a way to calculate how good of a fit you have for a given data set. It turns out that there are a few ways to calculate a value for this.

The value I use in the code below and in the demos is called $R^2$ or residue squared.

First you calculate the average (mean) output value from the input data set.

Then you calculate SSTot which is the sum of the square of the mean subtracted from each input point’s output value. Pseudo code:

SSTot = 0;
for (point p in points)
  SSTot += (p.out - mean)^2;

You then calculate SSRes which is the sum of the square of the fitted function evaluated at a point, subtracted from each input points’ output value. Pseudo code:

SSRes= 0;
for (point p in points)
  SSRes += (p.out - f(p.in))^2;

The final value for R^2 is 1-SSRes/SSTot.

The value is nice because it’s unitless, and since SSRes and SSTot is a sum of squares, SSRes/SSTot is basically the value that the fitting algorithm minimizes. The value is subtracted from 1 so that it’s a fit quality metric. A value of 0 is a bad fit, and a value of 1 is a good fit and generally it will be between those values.

If you want to read more about this, check out this link: Coefficient of Determination

Example Code

Here is a run from the sample code:

And here is the source code:

#include <stdio.h>
#include <array>

#define FILTER_ZERO_COEFFICIENTS true // if false, will show terms which have a coefficient of 0

//====================================================================
template<size_t N>
using TVector = std::array<float, N>;

template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

//====================================================================
// Specify a degree per axis.
// 1 = linear, 2 = quadratic, etc
template <size_t... DEGREES>
class COnlineLeastSquaresFitter
{
public:
    COnlineLeastSquaresFitter ()
    {
        // initialize our sums to zero
        std::fill(m_SummedPowers.begin(), m_SummedPowers.end(), 0.0f);
        std::fill(m_SummedPowersTimesValues.begin(), m_SummedPowersTimesValues.end(), 0.0f);
    }

	// Calculate how many summed powers we need.
	// Product of degree*2+1 for each axis.
	template <class T>
	constexpr static size_t NumSummedPowers(T degree)
	{
		return degree * 2 + 1;
	}
	template <class T, class... DEGREES>
	constexpr static size_t NumSummedPowers(T first, DEGREES... degrees)
	{
		return NumSummedPowers(first) * NumSummedPowers(degrees...);
	}

	// Calculate how many coefficients we have for our equation.
	// Product of degree+1 for each axis.
	template <class T>
	constexpr static size_t NumCoefficients(T degree)
	{
		return (degree + 1);
	}
	template <class T, class... DEGREES>
	constexpr static size_t NumCoefficients(T first, DEGREES... degrees)
	{
		return NumCoefficients(first) * NumCoefficients(degrees...);
	}

	// Helper function to get degree of specific axis
	static size_t Degree (size_t axisIndex)
	{
		static const std::array<size_t, c_dimension-1> c_degrees = { DEGREES... };
		return c_degrees[axisIndex];
	}
	
	// static const values
	static const size_t c_dimension = sizeof...(DEGREES) + 1; 
	static const size_t c_numCoefficients = NumCoefficients(DEGREES...);
	static const size_t c_numSummedPowers = NumSummedPowers(DEGREES...);

	// Typedefs
	typedef TVector<c_numCoefficients> TCoefficients;
	typedef TVector<c_dimension> TDataPoint;

	// Function for converting from an index to a specific power permutation
	static void IndexToPowers (size_t index, std::array<size_t, c_dimension-1>& powers, size_t maxDegreeMultiply, size_t maxDegreeAdd)
	{
		for (int i = c_dimension-2; i >= 0; --i)
		{
			size_t degree = Degree(i) * maxDegreeMultiply + maxDegreeAdd;
			powers[i] = index % degree;
			index = index / degree;
		}
	}

	// Function for converting from a specific power permuation back into an index
	static size_t PowersToIndex (std::array<size_t, c_dimension - 1>& powers, size_t maxDegreeMultiply, size_t maxDegreeAdd)
	{
		size_t ret = 0;
		for (int i = 0; i < c_dimension - 1; ++i)
		{
			ret *= Degree(i) * maxDegreeMultiply + maxDegreeAdd;
			ret += powers[i];
		}
		return ret;
	}

	// Add a datapoint to our fitting
    void AddDataPoint (const TDataPoint& dataPoint)
    {
		// Note: It'd be a good idea to memoize the powers and calculate them through repeated
		// multiplication, instead of calculating them on demand each time, using std::pow.

        // add the summed powers of the input values
		std::array<size_t, c_dimension-1> powers;
        for (size_t i = 0; i < m_SummedPowers.size(); ++i)
        {
			IndexToPowers(i, powers, 2, 1);
			float valueAdd = 1.0;
			for (size_t j = 0; j < c_dimension - 1; ++j)
				valueAdd *= (float)std::pow(dataPoint[j], powers[j]);
			m_SummedPowers[i] += valueAdd;
        }

        // add the summed powers of the input value, multiplied by the output value
        for (size_t i = 0; i < m_SummedPowersTimesValues.size(); ++i)
        {
			IndexToPowers(i, powers, 1, 1);
			float valueAdd = dataPoint[c_dimension - 1];
			for (size_t j = 0; j < c_dimension-1; ++j)
				valueAdd *= (float)std::pow(dataPoint[j], powers[j]);
			m_SummedPowersTimesValues[i] += valueAdd;
        }
    }

	// Get the coefficients of the equation fit to the points
    bool CalculateCoefficients (TCoefficients& coefficients) const
    {
		// make the ATA matrix
		std::array<size_t, c_dimension - 1> powersi;
		std::array<size_t, c_dimension - 1> powersj;
		std::array<size_t, c_dimension - 1> summedPowers;
		TMatrix<c_numCoefficients, c_numCoefficients> ATA;
		for (size_t j = 0; j < c_numCoefficients; ++j)
		{
			IndexToPowers(j, powersj, 1, 1);

			for (size_t i = 0; i < c_numCoefficients; ++i)
			{
				IndexToPowers(i, powersi, 1, 1);

				for (size_t k = 0; k < c_dimension - 1; ++k)
					summedPowers[k] = powersi[k] + powersj[k];

				size_t summedPowersIndex = PowersToIndex(summedPowers, 2, 1);
				ATA[j][i] = m_SummedPowers[summedPowersIndex];
			}
		}

		// solve: ATA * coefficients = m_SummedPowers
		// for the coefficients vector, using Gaussian elimination.
		coefficients = m_SummedPowersTimesValues;
		for (size_t i = 0; i < c_numCoefficients; ++i)
		{
			for (size_t j = 0; j < c_numCoefficients; ++j)
			{
				if (ATA[i][i] == 0.0f)
					return false;

				float c = ((i == j) - ATA[j][i]) / ATA[i][i];
				coefficients[j] += c*coefficients[i];
				for (size_t k = 0; k < c_numCoefficients; ++k)
					ATA[j][k] += c*ATA[i][k];
			}
		}

		// Note: this is the old, "bad" way to solve the equation using matrix inversion.
		// It's a worse choice for larger matrices, and surfaces and volumes use larger matrices than curves in general.
		/*
		// Inverse the ATA matrix
		TMatrix<c_numCoefficients, c_numCoefficients> ATAInverse;
		if (!InvertMatrix(ATA, ATAInverse))
			return false;

		// calculate the coefficients
		for (size_t i = 0; i < c_numCoefficients; ++i)
			coefficients[i] = DotProduct(ATAInverse[i], m_SummedPowersTimesValues);
		*/

		return true;
    }

private:
	//Storage Requirements:
	// Summed Powers = Product of degree*2+1 for each axis.
	// Summed Powers Times Values = Product of degree+1 for each axis.
    TVector<c_numSummedPowers>		m_SummedPowers;
	TVector<c_numCoefficients>		m_SummedPowersTimesValues;
};

//====================================================================
char AxisIndexToLetter (size_t axisIndex)
{
	// x,y,z,w,v,u,t,....
	if (axisIndex < 3)
		return 'x' + char(axisIndex);
	else
		return 'x' + 2 - char(axisIndex);
}

//====================================================================
template <class T, size_t M, size_t N>
float EvaluateFunction (const T& fitter, const TVector<M>& dataPoint, const TVector<N>& coefficients)
{
	float ret = 0.0f;
	for (size_t i = 0; i < coefficients.size(); ++i)
	{
		// start with the coefficient
		float term = coefficients[i];

		// then the powers of the input variables
		std::array<size_t, T::c_dimension - 1> powers;
		fitter.IndexToPowers(i, powers, 1, 1);
		for (size_t j = 0; j < powers.size(); ++j)
			term *= (float)std::pow(dataPoint[j], powers[j]);

		// add this term to our return value
		ret += term;
	}
	return ret;
}

//====================================================================
template <size_t... DEGREES>
void DoTest (const std::initializer_list<TVector<sizeof...(DEGREES)+1>>& data)
{
	// say what we are are going to do
	printf("Fitting a function of degree (");
	for (size_t i = 0; i < COnlineLeastSquaresFitter<DEGREES...>::c_dimension - 1; ++i)
	{
		if (i > 0)
			printf(",");
		printf("%zi", COnlineLeastSquaresFitter<DEGREES...>::Degree(i));
	}
	printf(") to %zi data points: n", data.size());

	// show input data points
	for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
	{
		printf("  (");
		for (size_t i = 0; i < dataPoint.size(); ++i)
		{
			if (i > 0)
				printf(", ");
			printf("%0.2f", dataPoint[i]);
		}
		printf(")n");
	}

	// fit data
	COnlineLeastSquaresFitter<DEGREES...> fitter;
    for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
        fitter.AddDataPoint(dataPoint);

	// calculate coefficients if we can
	COnlineLeastSquaresFitter<DEGREES...>::TCoefficients coefficients;
	bool success = fitter.CalculateCoefficients(coefficients);
	if (!success)
	{
		printf("Could not calculate coefficients!nn");
		return;
	}

	// print the polynomial
	bool firstTerm = true;
	printf("%c = ", AxisIndexToLetter(sizeof...(DEGREES)));
    bool showedATerm = false;
	for (int i = (int)coefficients.size() - 1; i >= 0; --i)
	{
		// don't show zero terms
		if (FILTER_ZERO_COEFFICIENTS && std::abs(coefficients[i]) < 0.00001f)
			continue;

        showedATerm = true;

		// show an add or subtract between terms
		float coefficient = coefficients[i];
		if (firstTerm)
			firstTerm = false;
		else if (coefficient >= 0.0f)
			printf(" + ");
		else
		{
			coefficient *= -1.0f;
			printf(" - ");
		}

		printf("%0.2f", coefficient);

		std::array<size_t, COnlineLeastSquaresFitter<DEGREES...>::c_dimension - 1> powers;
		fitter.IndexToPowers(i, powers, 1, 1);

		for (size_t j = 0; j < powers.size(); ++j)
		{
			if (powers[j] > 0)
				printf("%c", AxisIndexToLetter(j));
			if (powers[j] > 1)
				printf("^%zi", powers[j]);
		}
	}
    if (!showedATerm)
        printf("0");
	printf("n");

	// Calculate and show R^2 value.
	float rSquared = 1.0f;
	if (data.size() > 0)
	{
		float mean = 0.0f;
		for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
			mean += dataPoint[sizeof...(DEGREES)];
		mean /= data.size();
		float SSTot = 0.0f;
		float SSRes = 0.0f;
		for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
		{
			float value = dataPoint[sizeof...(DEGREES)] - mean;
			SSTot += value*value;

			value = dataPoint[sizeof...(DEGREES)] - EvaluateFunction(fitter, dataPoint, coefficients);
			SSRes += value*value;
		}
		if (SSTot != 0.0f)
			rSquared = 1.0f - SSRes / SSTot;
	}
	printf("R^2 = %0.4fnn", rSquared);
}

//====================================================================
int main (int argc, char **argv)
{
	// bilinear - 4 data points
	DoTest<1, 1>(
		{
			TVector<3>{ 0.0f, 0.0f, 5.0f },
			TVector<3>{ 0.0f, 1.0f, 3.0f },
			TVector<3>{ 1.0f, 0.0f, 8.0f },
			TVector<3>{ 1.0f, 1.0f, 2.0f },
		}
	);

	// biquadratic - 9 data points
	DoTest<2, 2>(
		{
			TVector<3>{ 0.0f, 0.0f, 8.0f },
			TVector<3>{ 0.0f, 1.0f, 4.0f },
			TVector<3>{ 0.0f, 2.0f, 6.0f },
			TVector<3>{ 1.0f, 0.0f, 5.0f },
			TVector<3>{ 1.0f, 1.0f, 2.0f },
			TVector<3>{ 1.0f, 2.0f, 1.0f },
			TVector<3>{ 2.0f, 0.0f, 7.0f },
			TVector<3>{ 2.0f, 1.0f, 9.0f },
			TVector<3>{ 2.0f, 2.5f, 12.0f },
		}
	);

	// trilinear - 8 data points
	DoTest<1, 1, 1>(
		{
			TVector<4>{ 0.0f, 0.0f, 0.0f, 8.0f },
			TVector<4>{ 0.0f, 0.0f, 1.0f, 4.0f },
			TVector<4>{ 0.0f, 1.0f, 0.0f, 6.0f },
			TVector<4>{ 0.0f, 1.0f, 1.0f, 5.0f },
			TVector<4>{ 1.0f, 0.0f, 0.0f, 2.0f },
			TVector<4>{ 1.0f, 0.0f, 1.0f, 1.0f },
			TVector<4>{ 1.0f, 1.0f, 0.0f, 7.0f },
			TVector<4>{ 1.0f, 1.0f, 1.0f, 9.0f },
		}
	);

	// trilinear - 9 data points
	DoTest<1, 1, 1>(
		{
			TVector<4>{ 0.0f, 0.0f, 0.0f, 8.0f },
			TVector<4>{ 0.0f, 0.0f, 1.0f, 4.0f },
			TVector<4>{ 0.0f, 1.0f, 0.0f, 6.0f },
			TVector<4>{ 0.0f, 1.0f, 1.0f, 5.0f },
			TVector<4>{ 1.0f, 0.0f, 0.0f, 2.0f },
			TVector<4>{ 1.0f, 0.0f, 1.0f, 1.0f },
			TVector<4>{ 1.0f, 1.0f, 0.0f, 7.0f },
			TVector<4>{ 1.0f, 1.0f, 1.0f, 9.0f },
			TVector<4>{ 0.5f, 0.5f, 0.5f, 12.0f },
		}
	);

	// Linear - 2 data points
    DoTest<1>(
        {
            TVector<2>{ 1.0f, 2.0f },
            TVector<2>{ 2.0f, 4.0f },
        }
    );

	// Quadratic - 4 data points
    DoTest<2>(
        {
            TVector<2>{ 1.0f, 5.0f },
			TVector<2>{ 2.0f, 16.0f },
			TVector<2>{ 3.0f, 31.0f },
			TVector<2>{ 4.0f, 16.0f },
        }
    );

	// Cubic - 4 data points
    DoTest<3>(
        {
            TVector<2>{ 1.0f, 5.0f },
            TVector<2>{ 2.0f, 16.0f },
			TVector<2>{ 3.0f, 31.0f },
			TVector<2>{ 4.0f, 16.0f },
        }
    );

    system("pause");
    return 0;
}

Closing

The next logical step here for me would be to figure out how to break the equation for a surface or hypervolume up into multiple equations, like you’d have with a tensor product surface/hypervolume equation. It would also be interesting to see how to convert from these multidimensional polynomials to multidimensional Bernstein basis functions, which are otherwise known as Bezier rectangles (and Bezier hypercubes i guess).

The last post inverted the ATA matrix and multiplied by ATY to get the coefficients. Thanks to some feedback on reddit, I found out that is NOT how you want to solve this sort of equation. I ended up going with Gaussian elimination for this post which is more numerically robust while also being less computation to calculate. There are other options out there too that may be even better choices. I’ve found out that in general, if you are inverting a matrix in code, or even just using an inverted matrix that has already been given to you, you are probably doing it wrong. You can read more about that here: John D. Cook: Don’t invert that matrix.

I didn’t go over what to do if you don’t have enough data points because if you find yourself in that situation, you can either decrease the degree of one of the axes, or you could remove and axis completely if you wanted to. It’s situational and ambiguous what parameter to decrease when you don’t have enough data points to fit a specific curve or hypervolume, but it’s still possible to decay the fit to a lower degree or dimension if you hit this situation, because you will already have all the values you need in the ATA matrix values and the ATV vector. I leave that to you to decide how to handle it in your own usage cases. Something interesting to note is that ATA[0][0] is STILL the count of how many data points you have, so you can use this value to know how much you need to decay your fit to be able to handle the data set.

In the WebGL2 demo I mention, I use a finite difference method to calculate the normals of the surface, however since the surface is described by a polynomial, it’d be trivial to calculate the coefficients for the equations that described the partial derivatives of the surface for each axis and use those instead.

I also wanted to mention that in the case of surfaces and hypervolumes it’s still possible to get an imperfect fit to your data set, even though you may give the exact minimum required number of control points. The reason for this is that not all axes are necesarily created equal. If you have a surface of degree (1,2) it’s linear on the x axis, but quadratic on the y axis, and requires a minimum of 6 data points to be able to fit a data set. As you can imagine, it’s possible to give data points such that the data points ARE NOT LINEAR on the x axis. When this happens, the surface will not be a perfect fit.

Lastly, you may be wondering how to fit data sets where there is more than one output value, like an equation of the form $(z,w)=f(x,y)$ .

I’m not aware of any ways to least square fit that as a whole, but apparently a common practice is to fit one equation to z and another to w and treat them independently. There is a math stack exchange question about that here: Math Stack Exchange: Least square fitting multiple values

Here is the webgl demo that goes with this post again:
Least Squares Surface Fitting

Thanks for reading, and please speak up if you have anything to add or correct, or a comment to make!

Incremental Least Squares Curve Fitting

This Post In Short:

Fit a curve of degree N to a data set, getting data points 1 at a time.
Storage Required: 3*N+2 values.
Update Complexity: roughly 3*N+2 additions and multiplies.
Finalize Complexity: Solving Ax=b where A is an (N+1)x(N+1) matrix and b is a known vector. (Sample code inverts A matrix and multiplies by b, Gaussian elimination is better though).
Simple C++ code and HTML5 demo at bottom!

I was recently reading a post from a buddy on OIT or “Order Independent Transparency” which is an open problem in graphics:
Fourier series based OIT and why it won’t work

In the article he talks about trying to approximate a function per pixel and shows the details of some methods he tried. One of the difficulties with the problem is that during a render you can get any number of triangles affecting a specific pixel, but you need a fixed and bounded size amount of storage per pixel for those variable numbers of data points.

That made me wonder: Is there an algorithm that can approximate a data set with a function, getting only one data point at a time, and end up with a decent approximation?

It turns out that there is one, at least one that I am happy with: Incremental Least Squares Curve Fitting.

While this perhaps doesn’t address all the problems that need addressing for OIT specifically, I think this is a great technique for programming in general, and I’m betting it still has it’s uses in graphics, for other times when you want to approximate a data set per pixel.

We’ll work through a math oriented way to do it, and then we’ll convert it into an equivalent and simpler programmer friendly version.

At the bottom of the post is some simple C++ that implements everything we talk about and the image below is a screenshot of an an interactive HTML5 demo I made: Least Squares Curve Fitting

Mathy Version

I found out about this technique by asking on math stack exchange and got a great (if not mathematically dense!) answer:
Math Stack Exchange: Creating a function incrementally

I have to admit, I’m not so great with matrices outside of the typical graphics/gamedev usage cases of transormation and related, so it took me a few days to work through it and understand it all. If reading that answer made your eyes go blurry, give my explanation a shot. I’m hoping I gave step by step details enough such that you too can understand what the heck he was talking about. If not, let me know where you got lost and I can explain better and update the post.

The first thing we need to do is figure out what degree of a function we want to approximate our data with. For our example we’ll pick a degree 2 function, also known as a quadratic function. That means that when we are done we will get out a function of the form below:

$y=ax^2+bx+c$

We will give data points to the equation and it will calculate the values of a,b and c that approximate our function by minimizing the sum of the squared distance from each point to the curve.

We’ll deal with regular least squared fitting before moving onto incremental, so here’s the data set we’ll be fitting our quadratic curve to:

$(1,5),(2,16),(3,31),(4,50)$

The x values in my data set start at 1 and count up by 1, but that is not a requirement. You can use whatever x and y values you want to fit a curve to.

Next we need to calculate the matrix $A$ , where $A_{jk} = x_j^k$ and the matrix has NumDataPoints rows and Degree+1 columns. It looks like the below for a quadratic curve fitting 4 data points:

$A = \begin{bmatrix} x_0^0 & x_0^1 & x_0^2 \\ x_1^0 & x_1^1 & x_1^2 \\ x_2^0 & x_2^1 & x_2^2 \\ x_3^0 & x_3^1 & x_3^2 \\ \end{bmatrix}$

When we plug in our specific x values we get this:

$A = \begin{bmatrix} 1^0 & 1^1 & 1^2 \\ 2^0 & 2^1 & 2^2 \\ 3^0 & 3^1 & 3^2 \\ 4^0 & 4^1 & 4^2 \\ \end{bmatrix}$

Calculating it out we get this:

$A = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ \end{bmatrix}$

Next we need to calculate the matrix $A^TA$ , which we do below by multiplying the transpose of A by A:

$A^TA = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 4 & 9 & 16 \\ \end{bmatrix} * \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ \end{bmatrix} = \begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

Next we need to find the inverse of that matrix to get $(A^TA)^{-1}$ . The inverse is:

$(A^TA)^{-1} = \begin{bmatrix} 31/4 & -27/4 & 5/4 \\ -27/4 & 129/20 & -5/4 \\ 5/4 & -5/4 & 1/4 \\ \end{bmatrix}$

The next thing we need to calculate is $A^TY$ , which is the transpose of A multiplied by all of the Y values of our data:

$A^TY = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 4 & 9 & 16 \\ \end{bmatrix} * \begin{bmatrix} 5 \\ 16 \\ 31 \\ 50 \\ \end{bmatrix} = \begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

And finally, to calculate the coefficients of our quadratic function, we need to calculate $(A^TA)^{-1}*A^TY$ :

$(A^TA)^{-1}*A^TY = \begin{bmatrix} 31/4 & -27/4 & 5/4 \\ -27/4 & 129/20 & -5/4 \\ 5/4 & -5/4 & 1/4 \\ \end{bmatrix} * \begin{bmatrix} 102 \\ 330 \\ 1148 \\ \end{bmatrix} = \begin{bmatrix} -2 & 5 & 2 \\ \end{bmatrix}$

Those coefficients are listed in power order of x, so the first value -2 is the coefficient for x^0, 5 is the coefficient for x^1 and 2 is the coefficient for x^2. That gives us the equation:

$y=2x^2+5x-2$

If you plug in the x values from our data set, you’ll find that this curve perfectly fits all 4 data points.

It won’t always be (and usually won’t be) that a resulting curve matches the input set for all values. It just so happened that this time it does. The only guarantee you’ll get when fitting a curve to the data points is that the squared distance of the point to the curve (distance on the Y axis only, so vertical distance), is minimized for all data points.

Now that we’ve worked through the math, let’s make some observations and make it more programmer friendly.

Making it Programmer Friendly

Let’s look at the $A^TA$ matrix again:

$\begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

One thing you probably noticed right away is that it’s symmetric across the diagonal. Another thing you may have noticed is that there are only 5 unique values in that matrix.

As it turns out, those 5 values are just the sum of the x values, when those x values are raised to increasing powers.

If you take all x values of our data set, raise them to the 0th power and sum the results, you get 4.
If you take all x values of our data set, raise them to the 1st power and sum the results, you get 10.
If you take all x values of our data set, raise them to the 2nd power and sum the results, you get 30.
If you take all x values of our data set, raise them to the 3rd power and sum the results, you get 100.
If you take all x values of our data set, raise them to the 4th power and sum the results, you get 354.

Further more, the power of the x values in each part of the matrix is the zero based x axis index plus the zero based y axis index. Check out what i mean below, which shows which power the x values are taken to before being summed for each location in the matrix:

$\begin{bmatrix} 0 & 1 & 2 \\ 1 & 2 & 3 \\ 2 & 3 & 4 \\ \end{bmatrix}$

That is interesting for two reasons…

This tells us that we only really need to store the 5 unique values, and that we can reconstruct the full matrix later when it’s time to calculate the coefficients.
It also tells us that if we’ve fit a curve to some data points, but then want to add a new data point, that we can just raise the x value of our new data point to the different powers and add it into these 5 values we already have stored. In other words, the $A^TA$ matrix can be incrementally adjusted as new data comes in.

This generalizes beyond quadratic functions too luckily. If you are fitting your data points with a degree N curve, the $A^TA$ matrix will have N+1 rows, and N+1 columns, but will only have (N+1)*2-1 unique values stored in it. Those values will be the sum of the x values taken from the 0th power up to the (N+1)*2-2th power.

As a concrete example, a cubic fit will have an $A^TA$ array that is 4×4, which will only have 7 unique values stored in it. Those values will be the x values raised to the 0th power and summed, all the way up to the x values raised to the 6th power and summed.

So, the $A^TA$ matrix has a fixed storage amount of (degree+1)*2 – 1 values, and it can be incrementally updated.

That is great, but there is another value we need to look at too, which is the $A^TY$ vector. Let’s see that again:

$\begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

There are some patterns to this vector too luckily. You may have noticed that the first entry is the sum of the Y values from our data set. It’s actually the sum of the y values multiplied by the x values raised to the 0th power.

The next number is the sum of the y values multiplied by the x values raised to the 1st power, and so on.

To generalize it, each entry in that vector is the sum of taking the x from each data point, raising it to the power that is the index in the vector, and multiplying it by the y value.

Taking each data point’s x value, raising it to the 0th power, multiplying by the y value, and summing the results gives you 102.
Taking each data point’s x value, raising it to the 1st power, multiplying by the y value, and summing the results gives you 330.
Taking each data point’s x value, raising it to the 2nd power, multiplying by the y value, and summing the results gives you 1148.

So, this vector is incrementally updatable too. When you get a new data point, for each entry in the vector, you take the x value to the specific power, multiply by y, and add that result to the entry in the vector.

This generalizes for other curve types as well. If you are fitting your data points with a degree N curve, the $A^TY$ vector will have N+1 entries, corresponding to the powers: 0,1,…N.

As a concrete example, a cubic fit will have an $A^TY$ vector of size 4, corresponding to the powers: 0,1,2,3.

Combining the storage needs of the values needed for the $A^TA$ matrix, as well as the values needed for the $A^TY$ vector, the amount of storage we need for a degree N curve fit is 3*N+2 values.

Algorithm Summarized

Here is a summary of the algorithm:

First decide on the degree of the fit you want. Call it N.
Ensure you have storage space for 3*N+2 values and initialize them all to zero. These represent the (N+1)*2-1 values needed for the $A^TA$ matrix values, as well as the N+1 values needed for the $A^TY$ vector.
For each data point you get, you will need to update both the matrix values, as well as the vector valuess. (Note that there are not the same number of values in ATA and ATY!)
- for(i in ATA) ATA[i] += x^i
- for(i in ATY) ATY[i] += x^i*y
When it’s time to calculate the coefficients of your polynomial, convert the ATA values back into the $A^TA$ matrix, invert it and multiply that by the $A^TY$ value.

Pretty simple right?

Not Having Enough Points

When working through the mathier version of this algorithm, you may have noticed that if we were trying to fit using a degree N curve, that we needed N+1 data points at minimum for the math to even be able to happen.

So, you might ask, what happens in the real world, such as in a pixel shader, where we decide to do a cubic fit, but end up only getting 1 data point, instead of the 4 minimum that we need?

Well, first off, if you use the programmer friendly method of incrementally updating ATA and ATY, you’ll end up with an uninvertible matrix (0 determinant), but that doesn’t really help us any besides telling us when we don’t have enough data.

There is something pretty awesome hiding here though. Let’s look at the ATA matrix and ATY values from our quadratic example again.

$A^TA = \begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

$A^TY = \begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

The above values are for a quadratic fit. What if we wanted a linear fit instead? Well… the upper left 2×2 matrix in ATA is the ATA matrix for the linear fit! Also, the first two values in the ATY vector is the ATY vector if we were doing a linear fit.

$A^TA = \begin{bmatrix} 4 & 10 \\ 10 & 30 \\ \end{bmatrix}$

$A^TY = \begin{bmatrix} 102 & 330 \\ \end{bmatrix}$

You can verify that the linear fit above is correct if you want, but let’s take it down another degree, down to approximating the fit with a point. They become scalars instead of matrices and vectors:

$A^TA = 4 \\ A^TY = 102$

If we take the inverse of ATA and multiply it by ATY, we get:

$1/4 * 102 = 25.5$

if you average the Y values of our input data, you’ll find that it is indeed 25.5, so we have verified that it does give us a degree 0 fit.

This is neat and all, but how can we KNOW if we’ve collected enough data or not? Do we just try to invert our ATA matrix, and if it fails, try one degree lower, repeatedly, until we succeed or fail at a degree 0 approximation? Do we maybe instead store a counter to keep track of how many points we have seen?

Luckily no, and maybe you have already put it together. The first value in the ATA array actually TELLS you how many points you have been given. You can use that to decide what degree you are going to have to actually fit the data set to when it’s time to calculate your coefficients, to avoid the uninvertible matrix and still get your data fit.

Interesting Tid Bits

Something pretty awesome about this algorithm is that it can work in a multithreaded fashion very easily. One way would be to break apart the work into multiple job threads, have them calculate ATA and ATY independently, and then sum them all together on the main thread. Another way to do it would be to let all threads share the same ATA and ATY storage, but to use an atomic add operation to update them.

Going the atomic add route, I think this could be a relatively GPU friendly algorithm. You could use actual atomic operations in your shader code, or you could use alpha blending to add your result in.

Even though we saw it in the last section, I’ll mention it again. If you do a degree 0 curve fit to data points (aka fitting a point to the data), this algorithm is mathematically equivalent to just taking the average y value. The ATA values will have a single value which is the sum of the x values to the 0th degree, so will be the count of how many x items there are. The ATY values will also have only a single value, which will be the sum of the x^0*y values, so will be the sum of the y values. Taking the inverse of our 1×1 ATA matrix will give us one divided by how many items there are, so when we multiply that by the ATA vector which only has one item, it will be the same as if we divided our Y value sum by how many data points we had. So, in a way, this algorithm seems to be some sort of generalization of averaging, which is weird to me.

Another cool thing: if you have the minimum number of data points for your degree (aka degree+1 data points) or fewer, you can actually use the ATA and ATY values to get back your ORIGINAL data points – both the X and the Y values! I’ll leave it as an exercise for you, but if you look at it, you will always have more equations than you do unknowns.

If reconstructing the original data points is important to you, you could also have this algorithm operate in two modes.

Basically, always use the ATA[0] storage space to count the number of data points you’ve been given, since that is it’s entire purpose. You can then use the rest of the storage space as RAW data storage for your 2d input values. As soon as adding another value would cause you to overflow your storage, you could process your data points into the correct format of storing just ATA and ATY values, so that from then on, it was an approximation, instead of explicit point storage.

When decoding those values, you would use the ATA[0] storage space to know whether the rest of the storage contained ATA and ATY values, or if they contained data points. If they contained data points, you would also know how many there were, and where they were in the storage space, using the same logic to read data points out as you used to put them back in – basically like saying that the first data point goes immediately after ATA[0], the second data point after that, etc.

The last neat thing, let’s say that you are in a pixel shader as an exmaple, and that you wanted to approximate 2 different values for each pixel, but let’s say that the X value was always the same for these data sets – maybe you are approximating two different values over depth of the pixel for instance so X of both data points is the depth, but the Y values of the data points are different.

If you find yourself in a situation like this, you don’t actually need to pay the full cost of storage to have a second approximation curve.

Since the ATA values are all based on powers of the x values only, the ATA values would be the same for both of these data sets. You only need to pay the cost of the ATY values for the second curve.

This means that fitting a curve costs an initial 3*degree+2 in storage, but each additional curve only costs degree+1 in storage.

Also, since the ATA storage for a curve of degree N also contains the same values used for a curve of degree N-1, N-2, etc, you don’t have to use the same degree when approximating multiple values using the same storage. Your ATA just has to be large enough to hold the largest degree curve, and then you can have ATY values that are sized to the degree of the curve you want to use to approximate each data set.

This way, if you have limited storage, you could perhaps cubically fit one data set, and then linearly fit another data set where accuracy isn’t as important.

For that example, you would pay 11 values of storage for the cubic fit, and then only 2 more values of storage to have a linear fit of some other data.

Example Code

There is some example code below that implements the ideas from this post.

The code is meant to be clear and readable firstly, with being a reasonably decent implementation second. If you are using this in a place where you want high precision and/or high speeds, there are likely both macro and micro optimizations and code changes to be made. The biggest of these is probably how the matrix is inverted.

You can read more on the reddit discussion: Reddit: Incremental Least Squares Curve Fitting

Here’s a run of the example code:

Here is the example code:

#include <stdio.h>
#include <array>

//====================================================================
template<size_t N>
using TVector = std::array<float, N>;

template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

template<size_t N>
using TSquareMatrix = TMatrix<N,N>;

typedef TVector<2> TDataPoint; 

//====================================================================
template <size_t N>
float DotProduct (const TVector<N>& A, const TVector<N>& B)
{
    float ret = 0.0f;
    for (size_t i = 0; i < N; ++i)
        ret += A[i] * B[i];
    return ret;
}

//====================================================================
template <size_t M, size_t N>
void TransposeMatrix (const TMatrix<M, N>& in, TMatrix<N, M>& result)
{
    for (size_t j = 0; j < M; ++j)
        for (size_t k = 0; k < N; ++k)
            result[k][j] = in[j][k];
}

//====================================================================
template <size_t M, size_t N>
void MinorMatrix (const TMatrix<M, N>& in, TMatrix<M-1, N-1>& out, size_t excludeI, size_t excludeJ)
{
    size_t destI = 0;
    for (size_t i = 0; i < N; ++i)
    {
        if (i != excludeI)
        {
            size_t destJ = 0;
            for (size_t j = 0; j < N; ++j)
            {
                if (j != excludeJ)
                {
                    out[destI][destJ] = in[i][j];
                    ++destJ;
                }
            }
            ++destI;
        }
    }
}

//====================================================================
template <size_t M, size_t N>
float Determinant (const TMatrix<M,N>& in)
{
    float determinant = 0.0f;
	TMatrix<M - 1, N - 1> minor;
    for (size_t j = 0; j < N; ++j)
    {
        MinorMatrix(in, minor, 0, j);

        float minorDeterminant = Determinant(minor);
        if (j % 2 == 1)
            minorDeterminant *= -1.0f;

        determinant += in[0][j] * minorDeterminant;
    }
    return determinant;
}

//====================================================================
template <>
float Determinant<1> (const TMatrix<1,1>& in)
{
	return in[0][0];
}

//====================================================================
template <size_t N>
bool InvertMatrix (const TSquareMatrix<N>& in, TSquareMatrix<N>& out)
{
    // calculate the cofactor matrix and determinant
    float determinant = 0.0f;
    TSquareMatrix<N> cofactors;
    TSquareMatrix<N-1> minor;
    for (size_t i = 0; i < N; ++i)
    {
        for (size_t j = 0; j < N; ++j)
        {
            MinorMatrix(in, minor, i, j);

            cofactors[i][j] = Determinant(minor);
            if ((i + j) % 2 == 1)
                cofactors[i][j] *= -1.0f;

            if (i == 0)
                determinant += in[i][j] * cofactors[i][j];
        }
    }

    // matrix cant be inverted if determinant is zero
    if (determinant == 0.0f)
        return false;

    // calculate the adjoint matrix into the out matrix
    TransposeMatrix(cofactors, out);

    // divide by determinant
    float oneOverDeterminant = 1.0f / determinant;
    for (size_t i = 0; i < N; ++i)
        for (size_t j = 0; j < N; ++j)
            out[i][j] *= oneOverDeterminant;
    return true;
}

//====================================================================
template <>
bool InvertMatrix<2> (const TSquareMatrix<2>& in, TSquareMatrix<2>& out)
{
    float determinant = Determinant(in);
    if (determinant == 0.0f)
        return false;

    float oneOverDeterminant = 1.0f / determinant;
    out[0][0] =  in[1][1] * oneOverDeterminant;
    out[0][1] = -in[0][1] * oneOverDeterminant;
    out[1][0] = -in[1][0] * oneOverDeterminant;
    out[1][1] =  in[0][0] * oneOverDeterminant;
    return true;
}

//====================================================================
template <size_t DEGREE>  // 1 = linear, 2 = quadratic, etc
class COnlineLeastSquaresFitter
{
public:
    COnlineLeastSquaresFitter ()
    {
        // initialize our sums to zero
        std::fill(m_SummedPowersX.begin(), m_SummedPowersX.end(), 0.0f);
        std::fill(m_SummedPowersXTimesY.begin(), m_SummedPowersXTimesY.end(), 0.0f);
    }

    void AddDataPoint (const TDataPoint& dataPoint)
    {
        // add the summed powers of the x value
        float xpow = 1.0f;
        for (size_t i = 0; i < m_SummedPowersX.size(); ++i)
        {
            m_SummedPowersX[i] += xpow;
            xpow *= dataPoint[0];
        }

        // add the summed powers of the x value, multiplied by the y value
        xpow = 1.0f;
        for (size_t i = 0; i < m_SummedPowersXTimesY.size(); ++i)
        {
            m_SummedPowersXTimesY[i] += xpow * dataPoint[1];
            xpow *= dataPoint[0];
        }
    }

    bool CalculateCoefficients (TVector<DEGREE+1>& coefficients) const
    {
        // initialize all coefficients to zero
        std::fill(coefficients.begin(), coefficients.end(), 0.0f);

        // calculate the coefficients
        return CalculateCoefficientsInternal<DEGREE>(coefficients);
    }

private:

    template <size_t EFFECTIVEDEGREE>
    bool CalculateCoefficientsInternal (TVector<DEGREE + 1>& coefficients) const
    {
        // if we don't have enough data points for this degree, try one degree less
        if (m_SummedPowersX[0] <= EFFECTIVEDEGREE)
            return CalculateCoefficientsInternal<EFFECTIVEDEGREE - 1>(coefficients);

        // Make the ATA matrix
        TMatrix<EFFECTIVEDEGREE + 1, EFFECTIVEDEGREE + 1> ATA;
        for (size_t i = 0; i < EFFECTIVEDEGREE + 1; ++i)
            for (size_t j = 0; j < EFFECTIVEDEGREE + 1; ++j)
                ATA[i][j] = m_SummedPowersX[i + j];

        // calculate inverse of ATA matrix
        TMatrix<EFFECTIVEDEGREE + 1, EFFECTIVEDEGREE + 1> ATAInverse;
        if (!InvertMatrix(ATA, ATAInverse))
            return false;

        // calculate the coefficients for this degree. The higher ones are already zeroed out.
        TVector<EFFECTIVEDEGREE + 1> summedPowersXTimesY;
        std::copy(m_SummedPowersXTimesY.begin(), m_SummedPowersXTimesY.begin() + EFFECTIVEDEGREE + 1, summedPowersXTimesY.begin());
        for (size_t i = 0; i < EFFECTIVEDEGREE + 1; ++i)
            coefficients[i] = DotProduct(ATAInverse[i], summedPowersXTimesY);
        return true;
    }

    // Base case when no points are given, or if you are fitting a degree 0 curve to the data set.
    template <>
    bool CalculateCoefficientsInternal<0> (TVector<DEGREE + 1>& coefficients) const
    {
        if (m_SummedPowersX[0] > 0.0f)
            coefficients[0] = m_SummedPowersXTimesY[0] / m_SummedPowersX[0];
        return true;
    }

    // Total storage space (# of floats) needed is 3 * DEGREE + 2
    // Where y is number of values that need to be stored and x is the degree of the polynomial
    TVector<(DEGREE + 1) * 2 - 1>   m_SummedPowersX;
    TVector<DEGREE + 1>             m_SummedPowersXTimesY;
};

//====================================================================
template <size_t DEGREE>
void DoTest(const std::initializer_list<TDataPoint>& data)
{
	printf("Fitting a curve of degree %zi to %zi data points:n", DEGREE, data.size());

    COnlineLeastSquaresFitter<DEGREE> fitter;

	// show data
    for (const TDataPoint& dataPoint : data)
		printf("  (%0.2f, %0.2f)n", dataPoint[0], dataPoint[1]);

	// fit data
    for (const TDataPoint& dataPoint : data)
        fitter.AddDataPoint(dataPoint);

	// calculate coefficients if we can
	TVector<DEGREE+1> coefficients;
	bool success = fitter.CalculateCoefficients(coefficients);
	if (!success)
	{
		printf("ATA Matrix could not be inverted!n");
		return;
	}

	// print the polynomial
	bool firstTerm = true;
	printf("y = ");
    bool showedATerm = false;
	for (int i = (int)coefficients.size() - 1; i >= 0; --i)
	{
		// don't show zero terms
		if (std::abs(coefficients[i]) < 0.00001f)
			continue;

        showedATerm = true;

		// show an add or subtract between terms
		float coefficient = coefficients[i];
		if (firstTerm)
			firstTerm = false;
		else if (coefficient >= 0.0f)
			printf(" + ");
		else
		{
			coefficient *= -1.0f;
			printf(" - ");
		}

		printf("%0.2f", coefficient);

		if (i > 0)
			printf("x");

		if (i > 1)
			printf("^%i", i);
	}
    if (!showedATerm)
        printf("0");
	printf("nn");
}

//====================================================================
int main (int argc, char **argv)
{
	// Point - 1 data points
    DoTest<0>(
        {
            TDataPoint{ 1.0f, 2.0f },
        }
    );

	// Point - 2 data points
    DoTest<0>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
        }
    );

	// Linear - 2 data points
    DoTest<1>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
        }
    );

	// Linear - 3 colinear data points
    DoTest<1>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
            TDataPoint{ 3.0f, 6.0f },
        }
    );

	// Linear - 3 non colinear data points
    DoTest<1>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
            TDataPoint{ 3.0f, 5.0f },
        }
    );

	// Quadratic - 3 colinear data points
    DoTest<2>(
        {
            TDataPoint{ 1.0f, 2.0f },
            TDataPoint{ 2.0f, 4.0f },
            TDataPoint{ 3.0f, 6.0f },
        }
    );

	// Quadratic - 3 data points
    DoTest<2>(
        {
            TDataPoint{ 1.0f, 5.0f },
            TDataPoint{ 2.0f, 16.0f },
            TDataPoint{ 3.0f, 31.0f },
        }
    );

	// Cubic - 4 data points
    DoTest<3>(
        {
            TDataPoint{ 1.0f, 5.0f },
            TDataPoint{ 2.0f, 16.0f },
            TDataPoint{ 3.0f, 31.0f },
            TDataPoint{ 4.0f, 16.0f },
        }
    );

	// Cubic - 2 data points
    DoTest<3>(
        {
            TDataPoint{ 1.0f, 7.0f },
            TDataPoint{ 3.0f, 17.0f },
        }
    );

	// Cubic - 1 data point
    DoTest<3>(
        {
            TDataPoint{ 1.0f, 7.0f },
        }
    );

	// Cubic - 0 data points
    DoTest<3>(
        {
        }
    );

    system("pause");
    return 0;
}

Feedback

There’s some interesting feedback on twitter.

Links

Here’s an interactive demo to let you get a feel for how least squares curve fitting behaves:
Least Squares Curve Fitting

Wolfram Mathworld: Least Squares Fitting
Wikipedia: Least Squares Fitting
Inverting a 2×2 Matrix
Inverting Larger Matrices

A good online polynomial curve fitting calculator

By the way, the term for an algorithm which works incrementally by taking only some of the data at a time is called an “online algorithm”. If you are ever in search of an online algorithm to do X (whatever X may be), using this term can be very helpful when searching for such an algorithm, or when asking people if such an algorithm exists (like on stack exchange). Unfortunately, online is a bit overloaded in the modern world, so it can also give false hits (;
Wikipedia: Online algorithm

The Secret to Writing Fast Code / How Fast Code Gets Slow

This is a “soft tech” post. If that isn’t your thing, don’t worry, I’ll be returning to some cool “hard tech” and interesting algorithms after this. I’ve been abusing the heck out of the GPU texture sampler lately, so be on the lookout for some posts on that soon (;

I’m about to show you some of the fastest code there is. It’s faster than the fastest real time raytracer, it’s faster than Duff’s Device.

Heck, despite the fact that it runs on a classical computer, it runs faster than Shor’s Algorithm which uses quantum computing to factor integers so quickly that it breaks modern cryptographic algorithms.

This code also runs faster than Grover’s Algorithm which is another quantum algorithm that can search an unsorted list in O(sqrt(N)).

Even when compiled in debug it runs faster than all of those things.

Are you ready? here it is…

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
    return 0;
}

Yes, the code does nothing and that is precisely why it runs so fast.

The Secret to Writing Fast Code

The secret to writing fast code, no matter what you are writing is simple: Don’t do anything that is too slow.

Follow me on a made up example to see what I’m talking about.

Let’s say you started with a main() function like i showed above and you decided you want to make a real time raytracer that runs on the CPU.

First thing you do is figure out what frame rate you want it to run at, at the desired resolution. From there, you know how many milliseconds you have to render each frame, and now you have a defined budget you need to stay inside of. If you stay in that budget, you’ll consider it a real time raytracer. If you go outside of that budget, it will no longer be real time, and will be a failed program.

You may get camera control working and primary rays intersecting a plane, and find you’ve used 10% of your budget and 90% of the budget remains. So far so good.

Next up you add some spheres and boxes, diffuse and specular shade them with a directional light and a couple point lights. You find that you’ve used 40% of your budget, and 60% remains. We are still looking good.

Next you decide you want to add reflection and refraction, allowing up to 3 ray bounces. You find you are at 80% of your budget and are still looking good. We are still running fast enough to be considered real time.

Now you say to yourself “You know what? I’m going to do 4x super sampling for anti aliasing!”, so you shoot 4 rays out per pixel instead of 1 and average them.

You profile and uh oh! You are at 320% of your budget! Your ray tracer is no longer real time!

What do you do now? Well, hopefully it’s obvious: DON’T DO THAT, IT’S TOO SLOW!

So you revert it and maybe drop in some FXAA as a post processing pass on your render each frame. Now you are at 95% of your budget let’s say.

Now you may want to add another feature, but with only 5% of your budget left you probably don’t have much performance to spare to do it.

So, you implement whatever it is, find that you are at 105% of your budget.

Unlike the 4x super sampling, which was 220% overbudget, this new feature being only 5% over budget isn’t THAT much. At this point you could profile something that already exists (maybe even your new feature) and see if you can improve it’s performance, or if you can find some clever solution that gives you a performance boost, at the cost of things you don’t care about, you can do that to get some performance back. This is a big part of the job as a successful programmer / software engineer – make trade offs where you gain benefits you care about, at the cost of things you do not care about.

At this point, you can also decide if this new feature is more desired than any of the existing features. If it is, and you can cut an old feature you don’t care about anymore, go for it and make the trade.

Rinse and repeat this process with new features and functionality until you have the features you want, that fit within the performance budget you have set.

Follow this recipe and you too will have your very own real time raytracer (BTW related:Making a Ray Traced Snake Game in Shadertoy).

Maintaining a performance budget isn’t magic. It’s basically subtractive synthesis. Carve time away from your performance budget by adding a feature, then optimize or remove features if you are over budget. Rinse and repeat until the sun burns out.

Ok, so if it’s so easy, why do we EVER have performance problems?

How Fast Code Gets Slow

Performance problems come up when we are not paying attention. Sometimes we cause them for ourselves, and sometimes things outside of our control cause them.

The biggest way we cause performance problems for ourselves is by NOT MEASURING.

If you don’t know how your changes affect performance, and performance is something you care about, you are going to have a bad time.

If you care about performance, measure performance regularly! Profile before and after your changes and compare the differences. Have automated tests that profile your software and report the results. Understand how your code behaves in the best and worst case. Watch out for algorithms that sometimes take a lot longer than their average case. Stable algorithms make for stable experiences (and stable frame rates in games). This is because algorithms that have “perf spikes” sometimes line up on the same frame, and you’ll have more erratic frame rate, which makes your game seem much worse than having a stable but lower frame rate.

But, again, performance problems aren’t always the programmers fault. Sometimes things outside of our control change and cause us perf problems.

Like what you might ask?

Well, let’s say that you are tasked with writing some very light database software which keeps track of all employee’s birthdays.

Maybe you use a hash map to store birthdays. The key is the string of the person’s name, and the value is a unix epoch timestamp.

Simple and to the point. Not over-engineered.

Everything runs quickly, your decisions about the engineering choices you made were appropriate and your software runs great.

Now, someone else has a great idea – we have this database software you wrote, what if we use it to keep track of all of our customers and end user birthdays as well?

So, while you are out on vacation, they make this happen. You come back and the “database” software you made is running super slow. There are hundreds of thousands of people stored in the database, and it takes several seconds to look up a single birthday. OUCH!

So hotshot, looks like your code isn’t so fast huh? Actually no, it’s just that your code was used for something other than the original intended usage case. If this was included in the original specs, you would have done something different (and more complex) to handle this need.

This was an exaggerated example, but this sort of thing happens ALL THE TIME.

If you are working on a piece of software, and the software requirements change, it could turn any of your previous good decisions into poor decisions in light of the new realities.

However, you likely don’t have time to go back and re-think and possibly re-work every single thing you had written up to that point. You move onward and upward, a little more heavy hearted.

The target moved, causing your code to rot a bit, and now things are likely in a less than ideal situation. You wouldn’t have planned for the code you have with the info you have now, but it’s the code you do have, and the code you have to stick with for the time being.

Every time that happens, you incur a little more tech debt / code complexity and likely performance problems as well.

You’ll find that things run a little slower than they should, and that you spend more time fighting symptoms with small changes and somewhat arbitrary rules – like telling people not to use name lengths more than 32 characters for maximum performance of your birthday database.

Unfortunately change is part of life, and very much part of software development, and it’s impossible for anyone to fully predict what sort of changes might be coming.

Those changes are often due to business decisions (feedback on product, jockying for a new position in the marketplace, etc), so are ultimately what give us our paychecks and are ultimately good things. Take it from me, who has worked at ~7 companies in 15 years. Companies that don’t change/adapt die.

So, change sucks for our code, but it’s good for our wallets and keeps us employed 😛

Eventually the less than ideal choices of the past affecting the present will reach some threshold where something will have to be done about it. This will likely happen at the point that it’s easier to refactor some code, than to keep fighting the problems it’s creating by being less than ideal, or when something that really NEEDS to happen CAN’T happen without more effort than the refactor would take.

When that happens, the refactor comes in, where you DO get to go back and rethink your decisions, with knowledge of the current realities.

The great thing about the refactor is that you probably have a lot of stuff that your code is doing which it doesn’t really even NEED to be doing.

Culling that dead functionality feels great, and it’s awesome watching your code become simple again. It’s also nice not having to explain why that section of code behaves the way it does (poorly) and the history of it coming to be. “No really, I do know better, but…!!!”

One of the best feelings as a programmer is looking at a complex chunk of code that has been a total pain, pressing the delete key, and getting a little bit closer back to the fastest code in the world:

// Some of the fastest code the world has ever seen
int main (int argc, char **argc)
{
    return 0;
}

PS: Another quality of a successful engineer is being able to constantly improve software as it’s touched. If you are working in an area of code, and you see something ugly that can be fixed quickly and easily, do it while you are there. Since the only constant in software development is change, and change causes code quality to continually degrade, make yourself a force of continual code improvement and help reverse the flow of the code flowing into the trash can.

Engines

In closing, I want to talk about game engines – 3rd party game engines, and re-using an engine from a different project. This also applies to using middleware.

Existing engines are great in that when you and your team know how to use them, you can get things set up very quickly. It lets you hit the ground running.

However, no engine is completely generic. No engine is completely flexible.

That means that when you use an existing engine, there will be some amount of features and functionality which were made without your specific usage case in mind.

You will be stuck in the world where from day 1 you are incurring the tech debt type problems I describe above, but you will likely be unable or unwilling to refactor everything to suit your needs specifically.

I don’t mention this to say that engines are bad. Lots of successful games have used engines made by other people, or re-used engines from previous projects.

However, it’s a different kind of beast using an existing engine.

Instead of making things that suit your needs, and then using them, you’ll be spending your time figuring out how to use the existing puzzle pieces to do what you want. You’ll also be spending time backtracking as you hit dead ends, or where your first cobbled together solution didn’t hold up to the new realities, and you need to find a new path to success that is more robust.

Just something to be aware of when you are looking at licensing or re-using an engine, and thinking that it’ll solve all your problems and be wonderful. Like all things, it comes at a cost!

Using an existing engine does put you ahead of the curve: At day 1 you already have several months of backlogged technical debt!

Unfortunately business realities mean we can’t all just always write brand new engines all the time. It’s unsustainable

Agree / Disagree / Have something to say?

Leave a comment below, or tweet at me on twitter: @Atrix256

Minimizing Code Complexity by Programming Declaratively

Writing good code is something all programmers aspire to, but the definition of what actually makes good code can be a bit tricky to pin down. The idea of good code varies from person to person, from language to language, and also varies between problem domains. Web services, embedded devices and game programming are few software domains that all have very different needs and so also have very different software development styles, methods and best practices.

I truly believe that we are in the stone ages of software development (ok, maybe the bronze age?), and that 100 years from now, people will be doing things radically differently than we do today because they (or we) will have figured out better best practices, and the languages of the day will usher people towards increased success with decreased effort.

This post is on something called declarative programming. The idea is nothing new, as prolog from 1972 is a declarative language, but the idea of declarative programming is something I don’t think is talked about enough in the context of code quality.

By the end of this read, I hope you will agree that programming declaratively by default is a good best practice that pertains to all languages and problem domains. If not, leave a comment and let me know why!

Declarative vs Imperative Programming

Declarative programming is when you write code that says what to do. Imperative programming is when you write code that says how to do it.

Below is some C++ code written imperatively. How long does it take you to figure out what the code is doing?

	int values[4] = { 8, 23, 2, 4 };
	int sum = 0;
	for (int i = 0; i < 4; ++i)
		sum += values[i];
	int temp = values[0];
	for (int i = 0; i < 3; ++i)
		values[i] = values[i + 1];
	values[3] = temp;

Hopefully it didn’t take you very long to understand the code, but you had to read it line by line and reason about what each piece was doing. It may not be difficult, but it wasn’t trivial.

Here is the same code with some comments, which helps it be understandable more quickly, assuming the comments haven’t become out of date (:

	// Declare array
	int values[4] = { 8, 23, 2, 4 };

	// Calculate sum
	int sum = 0;
	for (int i = 0; i < 4; ++i)
		sum += values[i];

	// Rotate array items one slot left.
	int temp = values[0];
	for (int i = 0; i < 3; ++i)
		values[i] = values[i + 1];
	values[3] = temp;

Here is some declarative code that does the same thing:

	int values[4] = { 8, 23, 2, 4 };
	int sum = SumArray(values);
	RotateArrayIndices(values, -1);

The code is a lot quicker and easier to understand. In fact the comments aren’t even needed anymore because the code is basically what the comments were.

Comments are often declarative, saying what to do right next to the imperative code that says how to do it. If your code is also declarative though, there is no need for the declarative comments because they are redundant! In fact, if you decide to start trying to write code more declaratively, one way to do so is if you ever find yourself writing a declarative comment to explain what some code is doing, wrap it in a new function, or see if there is an existing function you ought to be using instead.

As a quick tangent, you can use the newer C++ features to make code more declarative, like the below. You arguably should be doing that when possible, if your code base uses STL, a custom STL implementation, or an in house STL type replacement, but I want to stress that this is a completely separate topic than whether or not we should be using new C++ features. Some folks not used to STL will find the below hard to read compared to the last example, which takes away from the main point. So, if you aren’t a fan of STL due to it’s readability (I agree!), or it’s performance characteristics (I also agree!), don’t worry about it. For people on the other side of the fence, you can take this as a pro STL argument though, as it does make code more declarative, if the readability and perf things aren’t impacting you.

	std::array<int,4> values = { 8, 23, 2, 4 };
	int sum = std::accumulate(values.begin(), values.end(), 0);
	std::rotate(values.begin(), values.begin() + 1, values.end());

We Already Live in a Semi-Declarative World

When reading the tip about using (declarative) comments as a hint for when to break some functionality out into another function, you may be thinking to yourself: “Wait, isn’t that just the rule about keeping functions small, like to a few lines per function?”

Yeah, that is true. There is overlap between that rule and writing declarative code. IMO declarative code is a more general version of that rule. That rule is part of making code declarative, and gives some of the benefits, but isn’t the whole story.

The concept of D.R.Y. “Don’t Repeat Yourself” also ends up causing your code to become more declarative. When you are repeating yourself, it’s often because you are either duplicating code, or because there is boiler plate code that must be added in multiple places to make something work. By applying D.R.Y. and making a single authoritative source of your information or work, you end up taking imperative details out of the code, thus making what remains more declarative. For more information on that check out this post of mine: Macro Lists For The Win

TDD

If your particular engineering culture uses TDD (test driven development), you may also say “Hey, this isn’t actually anything special, this is also what you get when you use TDD.”

Yes, that is also true. Test driven development forces you to write code such that each individual unit of work is broken up into it’s own contained, commonly stateless, function or object.

It’s suggested that the biggest value of TDD comes not from the actual testing, but from how TDD forces you to organize your code into small logical units of work, that are isolatable from the rest of the code.

In other words, TDD forces you to make smaller functions that do exactly what they say by their name and nothing else. Sound familiar? Yep, that is declarative programming.

Compilers, Optimizers and Interpreters

The whole goal of compilers, optimizers and interpreters is to make it so you the coder can be more declarative and less imperative.

Compilers make it so you don’t have to write assembly (assembly being just about as imperative as you can get!). You can instead write higher level concepts about what you want done – like loop over an array and sum up the values – instead of having to write the assembly (or machine code!) to load values into memory or registers, do work, and write them back out to memory or registers.

Similarly, the whole goal of optimizers are to take code where you describe what you want to happen, and find a way to do the equivalent functionality in a faster way. In other words, you give the WHAT and it figures out the HOW. That is declarative programming.

Interestingly, switch statements are declarative as well. You tell the compiler what items to test for at run time but leave it up to the compiler to figure out how to test for them. It turns out that switch statements can decide at compile time whether they want to use binary searching, if/else if statements, or other tricks to try and make an efficient lookup for the values you’ve provided.

Surprised to hear that? Give this a read: Something You May Not Know About the Switch Statement in C/C++

Similarly, profile guided optimization (PGO) is a way for the optimizer to know how your code actually behaves at runtime, to get a better idea at what machine code it ought to generate to be more optimal. Again, it’s taking your more declarative high level instructions, and creating imperative low level instructions that actually handle the HOW of doing what your code wants to do in a better way.

C#

If you’ve spent any time using C#, I’ll bet you’ve come to the same conclusion I have: If it takes you more than one line of code to do a single logical unit of work (read a file into a string, sort a list, etc), then you are probably doing it wrong, and there is probably some built in functionality already there to do it for you.

When used “correctly”, C# really tends to be declarative.

C++ Advancements Towards Being Declarative

In the old days of C, there were no constructors or destructors. This meant that you had to code carefully and initialize, deinitialize things at just the right time.

These were imperative details that if you got wrong, would cause bugs and make a bad day for you and the users of your software.

C++ improved on this problem by adding constructors and destructors. You could now put these imperative details off in another section and then not worry about it in the bulk of the code. C++ made C code more declarative by letting you focus more on the WHAT to do, and less on HOW to do it, in every line of code.

In more recent years, we’ve seen C++ get a lot of upgrades, many of which make C++ more declarative. In other words, common things are now getting language and/or STL library support.

For one, there are many operations built in which people used to do by hand that are now built in – such as std::sort or std::qsort. You no longer have to write out a sorting algorithm imperatively, you just use std::sort and move on.

Another really good example of C++ getting more declarative is lambdas. Lambdas look fancy and new, but they are really just a syntactic shortcut to doing something we could do all along. When you make a lambda, the compiler makes a class for you that overloads the parentheses operator, has storage for your captures and captures those captures. A struct that looks like this is called a functor and has existed for a long time before lambdas ever entered C++. The only difference is that if you want to use a functor now, you don’t have to go through a bunch of nitty gritty imperative details for making your functor class. Now, you just defined a lambda and move on.

Domain Specific Languages

Domain specific languages – aka DSLs – exist to let people write code meant for specific purposes. Some examples of DSLs are:

HTML – what we use to make static webpages
XSLT – a language to transform XML data into other data
SQL – a language to query information from databases
Regex – a language to search text

Believe it or not, DSL is a synonym of declarative programming languages.

HTML for instance completely cuts out things like loops, memory allocation and image loading, and lets you just specify how a web page should look. HTML is declarative because you deal only with the issues in the problem space, not with the imperative details of how to make it all happen.

It’s similar for the others in the list, and other DSLs not on the list. They all try to remove complexity you don’t care about to try and distill a language that deals only with the things in the problem space.

Our Job As Programmers

As programmers, it’s only one part of our job to make “good code” that is easy to read and easy to maintain, but many non programmers would laugh to hear that we spend so much time thinking about that.

The other part of our job is the end result of what our program does. This is what non programmers focus more heavily on of course, and is ultimately what makes software successful or not – at least in the short term. Software needs to do good stuff well to be successful, but if you don’t make good code, you are going to sink your business in bugs, inflexibility, maintenance costs, etc.

Programmers mainly either write code for use by other programmers (such as libraries and APIs), or they make software for use by other people.

In both cases, the job is really that we are trying to hide away imperative details (implementation complexity) and give our customers a domain specific language to do what they want to do in the easiest and best way possible. It’s very important in both cases that the users of your API or the users of your software don’t have to deal with things outside the problem space. They want to work declaratively, saying only what to do, and have our software handle the imperative details of how to do it. They paid for the software so they didn’t have to deal with those details.

As an example, when you work in an excel spreadsheet and it does an average of a row of columns, it doesn’t make you decide whether it should use SIMD instructions to do the math or scalar instructions. You don’t really care, and it almost certainly doesn’t matter enough to anyone using excel which to do, so excel just does whatever it does internally to give you what you asked for.

It can be a challenge knowing what needs to be hidden away when making an API or software for users, but that comes from understanding what it is that your customers actually need and what they are trying to do, which is already a super important step.

The good news is that you don’t have to perfectly match the customers needs to improve their lives. Any imperative details that you can remove is a win. I’m not talking about taking away abilities that people want and should have, I’m talking about removing “chores”, especially ones that if done wrong can cause problems – like nulling out a pointer after deleting it, or initializing all members of a class when creating an object, or the details of loading an image into memory.

None of this should really be that surprising to anyone, but hopefully thinking of these things in a declarative vs imperative way formalizes and generalizes some ideas.

Why Wouldn’t You Program Declaratively?

Purely declarative programming means that you only describe the things you care about and nothing else. If this is true, why wouldn’t you ALWAYS program declaratively? In fact, why do imperative languages even exist? Why would you ever want to waste time dealing with what you by definition did not care about?

Well, for one, it’s impossible to nail down what it is exactly that people do and do not care about, especially in something like a programming language which is likely to be used for lots of different purposes by lots of different people. It’s been real eye opening seeing the diverse needs of the C++ community in recent years for instance. As a C++ game programmer, surrounded by primarily C++ game programmers, I thought I knew what the language needed, but there are lots of things I never considered because I don’t have a need for, unlike some other C++ programmers out there.

Another big point is that declarative languages by definition are a sort of black box. You tell it what to do but not how. It has to figure out the details of how to do it in a good way. The problem is that the compiler (or similar process) has limited abilities to make these choices, and also has limited information about the problem space.

For instance, a declarative language may let you work with a set and say “put item X into the set” and “does item Y exist in this set?”. You can imagine it could perhaps use a hash table, where each hash bucket was a linked list of values. This way, when you queried if the item Y was in the set, it could hash it, then do comparisons against whatever items were in that bucket.

That implementation is fairly reasonable for many programs.

What if instead, you want to keep a set of unique visitors to a popular website, like say google.com? That set is going to use a ton of memory and/or be very slow because it’s going to be HUGE.

In that case, someone is likely to go with a probabilistic algorithm perhaps (maybe a bloom filter), where it’s ok that the answer isn’t exactly right, because the memory usage and computation time drops off significantly going probabilistic, and actually makes the feature possible.

The declarative language is likely not going to be smart enough to figure out that it should use a probabilistic algorithm, and nobody ever told it that it could.

Sure, you could add probabilistic set support to the declarative language, and people could specifically ask for it when they need it (they care about it, so it should be part of the DSL), but we could make this argument about many other things. The point is just that without super smart AI and lots more information (and freedom to make decisions independently of humans), a declarative language is going to be pretty limited in how well it can do in all situations.

Because of this, it’s important for the programmer to be able to profile processing time and other resource usage, and be able to “go imperative” where needed to address any problems that come up.

This is similar to how when writing C++, when we REALLY NEED TO, we can write some inline assembly. The C++ is the more declarative language, that allows you to write more imperative assembly when you need to.

It’s important to note that I’m not saying that declarative programming is inherently slower than imperative programming though. Declarative languages can actually be much faster and more efficient with resources believe it or not. In the example at the beginning of the post where i used std::rotate to replace a loop that moved items in an array, it’s possible that std::rotate uses a memmove to move the bulk of the items, instead of an item by item copy like what I coded. That would be a much better solution, especially for large array sizes.

So, declarative programming isn’t necessarily slower than imperative programming, but, for the times it isn’t doing well enough, we need a way to turn off “auto pilot” mode and give imperative instructions for how to do something better.

In more human terms: If you asked someone to go get the mail, you likely will say “can you get my mail? here’s the key and it’s in box 62.”. You wouldn’t tell the person how to walk to the door, open it, walk out, close it, etc. However, if there were special instructions such as “please check the package locker too”, you would give those details.

Basically, you give only the details that are needed, as simply as possible, but reserve the right to give as fine grained details as needed, when they are needed.

So, i propose this:

We as programmers ought to be programming declaratively by default, only resorting to imperative programming when we need to.
Our job is to empower our customers to work declaratively by making them good DSLs (aka interfaces), but we should remember that it might be important to let them also go more imperative when needed.

Thanks for reading, and let me know your thoughts!

Links

Here are some interesting links about managing code complexity and writing high quality code:
Youtube: Simplicity Matters (36 minutes)
Functions Should Be Short And Sweet, But Why?
Bitsquid: Managing Coupling
Thoughts on Declarative and Imperative Languages
Declarative vs. Imperative Programming for the Web

Low Tech Homomorphic Encryption

Homomorphic encryption is a special type of encryption that lets you do calculations on encrypted values as if they weren’t encrypted. One reason it’s desired is that secure computing could be done in the cloud, if practical homomorphic encryption were available.

Homomorphic encryption has been a hot research topic since 2009, when Craig Gentry figured out a way to do it while working on his PhD. Since then, people have been working on making it better, faster and more efficient.

You can read more about a basic incarnation of his ideas in my blog posts:
Super Simple Symmetric Leveled Homomorphic Encryption Implementation
Improving the Security of the Super Simple Symmetric Leveled Homomorphic Encryption Implementation

This post is about a low tech type of homomorphic encryption that anyone can easily do and understand. There is also some very simple C++ that implements it.

This idea may very well be known about publically, but I’m not aware of any places that talk about it. I may just be ignorant of them though so ::shrug::

Quick Update

I’ve gotten some feedback on this article, the most often feedback being that this is obfuscation not encryption. I think that’s a fair assessment as the secret value you are trying to protect is in no way transformed, but is just hidden. This post could easily be titled Homomorphic Obfuscation, and perhaps should be.

To see other feedback and responses to this post, check out the reddit links at the bottom!

The Idea

The idea is actually super simple:

Take the value you want to encrypt.
Hide it in a list of a bunch of other random values, and remember where it is in the list. The position in the list is your key.
Send this list to an untrusted party.
They do the same calculation on every item in the list and send it back.
Since you know which value was your secret value, you know which answer is the one you care about.

At the end of that process, you have the resulting value, and they have no idea what value was your secret value. You have done, by definition, homomorphic encryption!

There is a caveat of course… they know that your secret value was ONE of the values on the list.

Security Details

The thing here is that security is a sliding scale between resource usage (computation time, RAM, network bandwidth, etc) and security.

The list size is your security parameter in this case.

A larger list of random values means that it takes longer to transfer the data, more memory to store it, it takes longer to do the homomorphic computations, but the untrusted party is less sure about what your secret value is.

On the other hand, a shorter list is faster to transmit, easier to store, quicker to compute with, but the untrusted party has a better idea what your secret value is.

For maximal security you can just take this to the extreme – if your secret value is a 32 bit floating point number, you could make a list with all possible 2^32 floating point numbers in it, have them do the calculation and send it back. You can even do an optimization here and not even generate or send the list, but rather just have the person doing the calculations generate the full 2^32 float list, do the calculations, and send you the results.

That gets pretty big pretty fast though. That list would actually be 16 gigabytes, but the untrusted party would know almost nothing about your value, other than it can be represented by a 32 bit floating point number.

Depending on your security needs, you might be ok with shortening your list a bit to bring that number down. Making your list only be one million numbers long (999,999 random numbers and one value you actually care about), your list is only 3.8 megabytes.

Not quite as bad.

Some Interesting Abilities

Using this homomorphic encryption, like other homomorphic encryption, you can do computation involving multiple encrypted values. AKA you could multiply two encrypted values together. To do this, you are going to need to encrypt all values involved using the same key. In other words, they are going to have to be at the same index in each of their respective lists of random numbers.

Something else that is interesting is that you can also encode MULTIPLE secret values in your encrypted value list. You could have 1 secret value at index 50 and another at index 100 for instance. Doing this, you get a sort of homomorphic SIMD setup.

Homomorphic SIMD is actually a real thing in other homomorphic encryption methods as well. Check out this paper for instance:
Fully Homomorphic SIMD Operations

The only problem with homomorphic SIMD is that adding more secret values to the same encrypted list decreases the security, since there are more values in the list that you don’t want other people to know about.

You can of course also modify encrypted values by unencrypted values. You could multiply an encrypted value by 3, by multiplying every value in the list by 3.

Extending to Public Key Cryptography

If you wanted to use asymmetric key cryptography (public/private keys) instead of symmetric key cryptography, that is doable here too.

What you would do is have the public key public as per usual, and that key would be used in a public key algorithm to encrypt the index of the secret value in the random list.

Doing this, the person who has the private key would be able to receive the list and encrypted index, decrypt the index, and then get the secret value out using that index.

Sample Code Tests

The sample code only does Symmetric key encryption, and does these 3 tests:

Encrypts two floating point numbers into a single list, SIMD style, does an operation on the encrypted values, then unencrypts and verifies the results.
Does the same with two sets of floats (three floats in each set), to show how you can make encrypted values interact with each other. Does the operation, then unencrypts and verifies the results.
Encrypts three values of a 3 byte structure, does an operation on the encrypted values, then unencrypts and verifies the results.

All secret data was hidden in lists of 10,000,000 random values. That made the first two tests (the ones done with 4 byte floats) have encrypted files of 38.1MB (40,000,000 bytes), and the last test (the one done with a 3 byte struct) had a file size of 28.6 MB (30,000,000 bytes).

Here are the timing of the above tests:

Sample Code

Here is the header, LTHE.h:

/*

Written by Alan Wolfe
http://blog.demofox.org
Tweets by Atrix256

*/

#pragma once
#include <vector>
#include <random>

// A static class with template functions in it.
// A namespace would be nice, except I want to hide some things as private.
class LTHE
{
public:

    //=================================================================================
    template <typename T>
    static bool Encrypt (std::vector<T> values, size_t listSize, const char* fileName, std::vector<size_t>& keys, bool generateKeys = true)
    {
        // Make sure we have a list that is at least as long as the values we want to encrypt
        if (values.size() > listSize)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): values.size() > listSize.n");
            return false;
        }

        // Generate a list of keys if we are told to
        // Ideally you want to take the first M items of a cryptographically secure shuffle
        // of N items.
        // This could be done with format preserving encryption or some other method
        // to make it not roll and check, and also more secure random.
        if (generateKeys)
        {
            keys.clear();
            for (size_t i = 0, c = values.size(); i < c; ++i)
            {
                size_t newKey;
                do
                {
                    newKey = RandomInt<size_t>(0, listSize - 1);
                }
                while (std::find(keys.begin(), keys.end(), newKey) != keys.end());
                keys.push_back(newKey);
            }
        }

        // make a file of random values, size of T, count of <listSize> 
        FILE *file = fopen(fileName, "w+b");
        if (!file)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", fileName);
            return false;
        }

        // Note: this may not be the most efficient way to generate this much random data or 
        // write it all to the file.
        // In a real crypto usage case, you'd want a crypto secure random number generator.
        // You'd also want to make sure the random numbers had the same properties as your
        // input values to help anonymize them better.
        // Like if your numbers are not whole numbers, you don't want to generate only whole numbers.
        // Or if your numbers are salaries, you may not want purely random values, but more "salaryish"
        // looking numbers.
        // You could alternately just do all 2^N possible values which would definitely anonymize
        // the values you wanted to encrypt.  This is maximum security, but also takes most
        // memory and most processing time.
        size_t numUint32s = (listSize * sizeof(T)) / sizeof(uint32_t);
        size_t numExtraBytes = (listSize * sizeof(T)) % sizeof(uint32_t);
        for (size_t i = 0; i < numUint32s; ++i)
        {
            uint32_t value = RandomInt<uint32_t>();
            if (fwrite(&value, sizeof(value), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (uint32s).n");
                fclose(file);
                return false;
            }
        }
        for (size_t i = 0; i < numExtraBytes; ++i)
        {
            uint8_t value = RandomInt<uint8_t>();
            if (fwrite(&value, sizeof(value), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write random numbers (extra bytes).n");
                fclose(file);
                return false;
            }
        }

        // Now put the values in the file where they go, based on their key
        for (size_t i = 0, c = values.size(); i < c; ++i)
        {
            long pos = (long)(keys[i] * sizeof(T));
            if (fseek(file, pos, SEEK_SET) != 0)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
                fclose(file);
                return false;
            }
            if (fwrite(&values[i], sizeof(values[i]), 1, file) != 1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write secret value.n");
                fclose(file);
                return false;
            }
        }

        // close file and return success
        fclose(file);
        return true;
    }

    //=================================================================================
    template <typename T, typename LAMBDA>
    static bool TransformHomomorphically (const char* srcFileName, const char* destFileName, const LAMBDA& function)
    {
        // open the source and dest file if we can
        FILE *srcFile = fopen(srcFileName, "rb");
        if (!srcFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", srcFileName);
            return false;
        }
        FILE *destFile = fopen(destFileName, "w+b");
        if (!destFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
            fclose(srcFile);
            return false;
        }

        // Process the data in the file and write it back out.
        // This could be done much better.
        // We could read more from the file at once.
        // We could use SIMD.
        // We could go multithreaded.
        // We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
        // We could decouple the disk access from processing, so it was reading and writing while it was processing.
        const size_t c_bufferSize = 1024;
        std::vector<T> dataBuffer;
        dataBuffer.resize(c_bufferSize);
        size_t elementsRead;
        do
        {
            // read data from the source file
            elementsRead = fread(&dataBuffer[0], sizeof(T), c_bufferSize, srcFile);

            // transform the data
            for (size_t i = 0; i < elementsRead; ++i)
                dataBuffer[i] = function(dataBuffer[i]);

            // write the transformed data to the dest file
            if (fwrite(&dataBuffer[0], sizeof(T), elementsRead, destFile) != elementsRead)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
                fclose(srcFile);
                fclose(destFile);
                return false;
            }
        }
        while (!feof(srcFile));

        // close files and return success
        fclose(srcFile);
        fclose(destFile);
        return true;
    }

    //=================================================================================
    template <typename T, typename LAMBDA>
    static bool TransformHomomorphically (const char* src1FileName, const char* src2FileName, const char* destFileName, const LAMBDA& function)
    {
        // open the source and dest file if we can
        FILE *srcFile1 = fopen(src1FileName, "rb");
        if (!srcFile1)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src1FileName);
            return false;
        }
        FILE *srcFile2 = fopen(src2FileName, "rb");
        if (!srcFile2)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", src2FileName);
            fclose(srcFile1);
            return false;
        }
        FILE *destFile = fopen(destFileName, "w+b");
        if (!destFile)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for writing.n", destFileName);
            fclose(srcFile1);
            fclose(srcFile2);
            return false;
        }

        // Process the data in the file and write it back out.
        // This could be done much better.
        // We could read more from the file at once.
        // We could use SIMD.
        // We could go multithreaded.
        // We could do this on the GPU for large data sets and longer transformations! Assuming data transfer time isn't too prohibitive.
        // We could decouple the disk access from processing, so it was reading and writing while it was processing.
        const size_t c_bufferSize = 1024;
        std::vector<T> dataBuffer1, dataBuffer2;
        dataBuffer1.resize(c_bufferSize);
        dataBuffer2.resize(c_bufferSize);
        size_t elementsRead1;
        size_t elementsRead2;
        do
        {
            // read data from the source files
            elementsRead1 = fread(&dataBuffer1[0], sizeof(T), c_bufferSize, srcFile1);
            elementsRead2 = fread(&dataBuffer2[0], sizeof(T), c_bufferSize, srcFile2);

            if (elementsRead1 != elementsRead2)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Different numbers of elements in each file!n");
                fclose(srcFile1);
                fclose(srcFile2);
                fclose(destFile);
                return false;
            }

            // transform the data
            for (size_t i = 0; i < elementsRead1; ++i)
                dataBuffer1[i] = function(dataBuffer1[i], dataBuffer2[i]);

            // write the transformed data to the dest file
            if (fwrite(&dataBuffer1[0], sizeof(T), elementsRead1, destFile) != elementsRead1)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not write transformed elements.n");
                fclose(srcFile1);
                fclose(srcFile2);
                fclose(destFile);
                return false;
            }
        }
        while (!feof(srcFile1));

        // close files and return success
        fclose(srcFile1);
        fclose(srcFile2);
        fclose(destFile);
        return true;
    }

    //=================================================================================
    template <typename T>
    static bool Decrypt (const char* fileName, std::vector<T>& values, std::vector<size_t>& keys)
    {
        // Open the file if we can
        FILE *file = fopen(fileName, "rb");
        if (!file)
        {
            fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not open %s for reading.n", fileName);
            return false;
        }

        // Read the values from the file.  The key is their location in the file.
        values.clear();
        for (size_t i = 0, c = keys.size(); i < c; ++i)
        {
            long pos = (long)(keys[i] * sizeof(T));
            if (fseek(file, pos, SEEK_SET) != 0)
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not fseek.n");
                fclose(file);
                return false;
            }
            T value;
            if (!fread(&value, sizeof(T), 1, file))
            {
                fprintf(stderr, "ERROR in " __FUNCTION__ "(): Could not decrypt value for key.n");
                fclose(file);
                return false;
            }
            values.push_back(value);
        }

        // Close file and return success
        fclose(file);
        return true;
    }

private:
    template <typename T>
    static T RandomInt (T min = std::numeric_limits<T>::min(), T max = std::numeric_limits<T>::max())
    {
        static std::random_device rd;
        static std::mt19937 mt(rd());
        static std::uniform_int<T> dist(min, max);
        return dist(mt);
    }
};

And here is the test program, main.cpp:

#include <stdio.h>
#include "LTHE.h"
#include <chrono>

//=================================================================================
// times a block of code
struct SBlockTimer
{
    SBlockTimer()
    {
        m_start = std::chrono::high_resolution_clock::now();
    }

    ~SBlockTimer()
    {
        std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
        printf("    %0.2f secondsn", seconds.count());
    }

    std::chrono::high_resolution_clock::time_point m_start;
};

//=================================================================================
float TransformDataUnitary (float& value)
{
    return (float)sqrt(value * 2.17f + 0.132);
}

//=================================================================================
float TransformDataBinary (float& value1, float value2)
{
    return (float)sqrt(value1 * value1 + value2 * value2);
}

//=================================================================================
struct SStruct
{
    uint8_t x, y, z;

    static SStruct Transform (const SStruct& b)
    {
        SStruct ret;
        ret.x = b.x * 2;
        ret.y = b.y * 3;
        ret.z = b.z * 4;
        return ret;
    }

    bool operator != (const SStruct& b) const
    {
        return b.x != x || b.y != y || b.z != z;
    }
};

//=================================================================================
int Test_FloatUnitaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<float> secretValues = { 3.14159265359f, 435.0f };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<float>("Encrypted.dat", "Transformed.dat", TransformDataUnitary))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<float> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues.size(); i < c; ++i)
        {
            if (TransformDataUnitary(secretValues[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int Test_FloatBinaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<float> secretValues1 = { 3.14159265359f, 435.0f, 1.0f };
    std::vector<float> secretValues2 = { 1.0f, 5.0f, 9.0f };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues1, 10000000, "Encrypted1.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
        if (!LTHE::Encrypt(secretValues2, 10000000, "Encrypted2.dat", keys, false)) // reuse the keys made for secretValues1
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<float>("Encrypted1.dat", "Encrypted2.dat", "Transformed.dat", TransformDataBinary))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<float> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues1.size(); i < c; ++i)
        {
            if (TransformDataBinary(secretValues1[i], secretValues2[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int Test_StructUnitaryOperation ()
{
    printf("n----- " __FUNCTION__ " -----n");

    // Encrypt the data
    printf("Encrypting data:  ");
    std::vector<SStruct> secretValues = { {0,1,2},{ 3,4,5 },{ 6,7,8 } };
    std::vector<size_t> keys;
    {
        SBlockTimer timer;
        if (!LTHE::Encrypt(secretValues, 10000000, "Encrypted.dat", keys))
        {
            fprintf(stderr, "Could not encrypt data.n");
            return -1;
        }
    }

    // Transform the data
    printf("Transforming data:");
    {
        SBlockTimer timer;
        if (!LTHE::TransformHomomorphically<SStruct>("Encrypted.dat", "Transformed.dat", SStruct::Transform))
        {
            fprintf(stderr, "Could not transform encrypt data.n");
            return -2;
        }
    }

    // Decrypt the data
    printf("Decrypting data:  ");
    std::vector<SStruct> decryptedValues;
    {
        SBlockTimer timer;
        if (!LTHE::Decrypt("Transformed.dat", decryptedValues, keys))
        {
            fprintf(stderr, "Could not decrypt data.n");
            return -3;
        }
    }

    // Verify the data
    printf("Verifying data:   ");
    {
        SBlockTimer timer;
        for (size_t i = 0, c = secretValues.size(); i < c; ++i)
        {
            if (SStruct::Transform(secretValues[i]) != decryptedValues[i])
            {
                fprintf(stderr, "decrypted value mismatch!n");
                return -4;
            }
        }
    }

    return 0;
}

//=================================================================================
int main (int argc, char **argv)
{
    // test doing an operation on a single encrypted float
    int ret = Test_FloatUnitaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }

    // test doing an operation on two encrypted floats
    ret = Test_FloatBinaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }

    // test doing an operation on a single 3 byte struct
    ret = Test_StructUnitaryOperation();
    if (ret != 0)
    {
        system("pause");
        return ret;
    }
    
    printf("nAll Tests Passed!nn");
    system("pause");
    return 0;
}

If you found this post interesting or useful, or you have anything to add or talk about, let me know!

Reddit discussion:
r/programming
r/cryptography

A Data Point for MSVC vs Clang Code Generation

I’m a windows PC game developer using MSVC 2015 update 1, working in C++.

More and more, I hear people talk about how great clang is, and that it generates much better code than MSVC, among other niceties.

I have been chalking it up to maybe just being a fad, but keeping my feelers out there to see if I can get some concrete comparitive info.

Well, I’ve stumbled on some of that concrete comparitive info and want to share it with anyone else who is wondering if clang really is better than MSVC. This is just one data point, but it feels like it’s just the tip of the iceberg.

On twitter, Jason Turner (@lefticus) had an interesting tweet:

I wasn’t really sure what he was getting at, so clicked the link to check it out (http://godbolt.org/g/WDpPYq).

It turned out to be very relevant to my interest, because his particular example is comparing a value to a bunch of compile time constants. That is basically the core of what I’ve been looking into with my last few posts asking whether code or data was faster!

This first particular example is comparing a compile time constant to other compile time constants, so the code completely melts away and at runtime just returns the compile time calculated result. That isn’t very interesting, but it is nice to see that clang did so much at compile time. FWIW MSVC was able to do this all at compile time as well, so they are even so far.

What is more interesting is what happens when you test a compile time unknown against compile time constants. Let’s check it out… (https://godbolt.org/g/dKBDSK)

What the assembly does is subtract 1 from the input (it’s unsigned so if 0, wraps around to max int value), and then compares it against 5 to know if it’s in the group or not. Clang realized the numbers were continuous and so made a nice optimization.

In this case, MSVC did similar in release x64:

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
    bool match = false;
    (void)std::initializer_list<bool>{ (match = match || u == t)... };
    return match;
}

int main(int argc, char** argv)
{
00007FF62DB61000  lea         eax,[rcx-1]  
00007FF62DB61003  cmp         eax,4  
00007FF62DB61006  setbe       al  
    return one_of(argc, 1, 2, 3, 4, 5);
00007FF62DB61009  movzx       eax,al  
}
00007FF62DB6100C  ret

But in x86 release it did a bunch of if/else if/else if’s!

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
    bool match = false;
    (void)std::initializer_list<bool>{ (match = match || u == t)... };
    return match;
}

int main(int argc, char** argv)
{
00331002  in          al,dx  
    return one_of(argc, 1, 2, 3, 4, 5);
00331003  mov         eax,dword ptr [argc]  
00331006  cmp         eax,1  
00331009  je          main+26h (0331026h)  
0033100B  cmp         eax,2  
0033100E  je          main+26h (0331026h)  
00331010  cmp         eax,3  
00331013  je          main+26h (0331026h)  
00331015  cmp         eax,4  
00331018  je          main+26h (0331026h)  
0033101A  cmp         eax,5  
0033101D  je          main+26h (0331026h)  
0033101F  xor         al,al  
00331021  movzx       eax,al  
}
00331024  pop         ebp  
00331025  ret  
    return one_of(argc, 1, 2, 3, 4, 5);
00331026  mov         al,1  
00331028  movzx       eax,al  
}
0033102B  pop         ebp  
0033102C  ret

You are probably asking “what does clang do in x86?” well it turns out it does the same thing as in x64, it doesn’t fall back to if/else if/else if like MVSC does (proof: add -m32 in goldbolt. https://godbolt.org/g/khnrtO). One point to clang!

What if the numbers are not so continuous though? It turns out it can actually switch to using a binary search! (https://godbolt.org/g/iBkqja)

MSVC on the other hand just does a bunch of if/else if/else if tests, in both x86 release and x64 release.

#include <initializer_list>

template<typename U, typename ... T>
bool one_of(U&& u, T && ... t)
{
    bool match = false;
    (void)std::initializer_list<bool>{ (match = match || u == t)... };
    return match;
}

int main(const int argc, const char *[])
{
00007FF6C05A1000  cmp         ecx,1AB42h  
00007FF6C05A1006  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1008  cmp         ecx,40Fh  
00007FF6C05A100E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1010  cmp         ecx,0B131h  
00007FF6C05A1016  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1018  cmp         ecx,93BBh  
00007FF6C05A101E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1020  cmp         ecx,121Bh  
00007FF6C05A1026  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1028  cmp         ecx,0EE9h  
00007FF6C05A102E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1030  cmp         ecx,0E1Fh  
00007FF6C05A1036  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1038  cmp         ecx,995h  
00007FF6C05A103E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1040  cmp         ecx,5FEh  
00007FF6C05A1046  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1048  cmp         ecx,5BFh  
00007FF6C05A104E  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1050  cmp         ecx,5  
00007FF6C05A1053  je          main+63h (07FF6C05A1063h)  
00007FF6C05A1055  cmp         ecx,0FFFEh  
00007FF6C05A105B  je          main+63h (07FF6C05A1063h)  
    return one_of(argc, 1471, 2453, 3817, 45361, 37819, 109378, 1534, 4635, 1039, 3615, 5, 65534);
00007FF6C05A105D  xor         al,al  
00007FF6C05A105F  movzx       eax,al  
}
00007FF6C05A1062  ret  
    return one_of(argc, 1471, 2453, 3817, 45361, 37819, 109378, 1534, 4635, 1039, 3615, 5, 65534);
00007FF6C05A1063  mov         al,1  
00007FF6C05A1065  movzx       eax,al  
}
00007FF6C05A1068  ret

Closing

This is just one data point about how clang is better than MSVC, but it is a data point. I’m betting there are more if we looked for them.

This makes me wonder how switch statements do in clang vs msvc, and also makes me wonder if clang ever uses jump tables or more advanced data structures in either switch statements, or other code that does comparison against a potentially large number of compile time constants. Those thoughts are driven by this things seen in this article: Something You May Not Know About the Switch Statement in C/C++

The examples above used C++14 level C++ to implement “one_of”. If you can use C++17 level C++, you can also implement it this way, which also does the binary search (Also written by Jason Turner):
https://godbolt.org/g/RZgjRQ

PS wouldn’t it be nice if godbolt supported MSVC so we could do this sort of analysis on MSVC code? It’s in the works, but unsure when it’ll be available. Apparently licensing isn’t the issue, so lets hope it comes sooner rather than later! If you want it, maybe ping @mattgodbolt and let him know how much you want that functionality (:

Have any other clang vs MSVC info? If so, I’d love to hear about it!