## Plastic Bag Ban – Semi Reusable Bag Kiosks a Better Solution?

I an in favor of people generating less trash, and have been amazed that where I live (southern California), people have taken a “plastic bag ban” so well in stride. It felt like one of those things where we couldn’t live without it, but it turns out we can quite easily. Maybe there’s a lesson there, but that’s not the point of this post so I’ll get back to it (;

Where I live, you can BUY plastic bags for use from the grocery store for about 10 cents each, which is cost prohibitive enough that people tend to bring their own bags.

Now let me tell you, I’m no dummy. I am pretty sure those re-usable plastic bags everyone has gotten use far more resources to make and create more pollution in the process so seem like they will take some usage before they hit a break even point compared to the disposable bags. Google seems to give mixed results, saying that some bags are better than others: Google: are reusable bags better for the environment.

I think that the sentiment of what’s going on is good for sure though, and I’m hoping it’s a net positive (?), and I think that there is some winning to be had here environmentally – if done right.

However, let’s put the environmental concerns on the back burner for a second.

Grocery store baggers now get handed all sorts of differently shaped bags of various capacity, and folks often want their stuff bagged specific ways to make sure it all fits in however many they brought. It’s also very common for people to forget their bags at home, leave them in the car, etc. This makes things a bit awkward and definitely not as fast and smooth as it used to be with disposable bags.

My idea to address these issues is this:

• When you check out from the store, there are semi-reusable bags in bails similar to how disposable bags are in bails now. You pay to use the bags, but the cost is mostly a refundable deposit.
• You take the bags home, unload, etc like normal, but ideally you do not throw the bags away.
• The next time you come to the store, you feed your empty bags into a kiosk at the front of the store (similar to the coin to cash machines many currently have) and it prints you out a voucher saying how many bags you returned.
• The machine bails the bags back up for use by the store employees at the register, or perhaps some off site service does this and washes the bags, gets rid of damaged ones etc.
• You shop as normal, but when you check out, you give them your bag voucher. If the voucher is for fewer bags than you need for this trip, you can pay for some extra ones. If your voucher is for more than you need this time, you get the deposit refunded on the ones you don’t need.

The idea here is that at the end of the day…

• The baggers have a uniform type of bag to work with, which makes their job easier and allows them to bag more quickly (like the old days!).
• The bags are ideally as environmentally friendly made as possible (heck, they could be made from corn, hemp, burlap, whatever).
• It isn’t a big deal if you forgot your bags at home or in the car – you pay a little extra to use new bags, but when you return them, you get most of that back.

There are some obvious issues to work through, including:

• Getting those kiosks into the stores, and ideally having all stores use compatible bags.
• Dealing with damaged or dirty bags. It would be nice if the machine or whatever process bails them is able to detect and segregate them somehow, and ideally have some washing / minimal recycling process off site to make new ones (like pulp and re-form?)
• The plastic bags are a profit center for grocery stores now. They would need sufficient incentive, or pressure from customers to make it happen.

So, that’s my idea. Environmentally friendly semi-reusable bags – but with the convenience we all have come to enjoy from disposable bags. The best of both worlds.

As a video game programmer this is far outside of my interest and ability, so please take this business idea if you have the desire and the means. Let’s make it happen!

## Neural Network Recipe: Recognize Handwritten Digits With 95% Accuracy

This post is a recipe for making a neural network which is able to recognize hand written numeric digits (0-9) with 95% accuracy.

The intent is that you can use this recipe (and included simple C++ code, and interactive web demo!) as a starting point for some hands on experimentation.

A recent post of mine talks about all the things used in this recipe so give it a read if you want more info about anything: How to Train Neural Networks With Backpropagation.

This recipe is also taken straight from this amazing website (but coded from scratch in C++ by myself), where it’s implemented in python: Using neural nets to recognize handwritten digits.

# Recipe

The neural network takes as input 28×28 greyscale images, so there will be 784 input neurons.

There is one hidden layer that has 30 neurons.

The final layer is the output layer which has 10 neurons.

The output neuron with the highest activation is the digit that was recognized. For instance if output neuron 0 had the highest activation, the network detected a 0. If output neuron 2 was highest, the network detected a 2.

The neurons use the sigmoid activation function, and the cost function used is half mean squared error.

Training uses a learning rate of 3.0 and the training data is processed by the network 30 times (aka 30 training epochs), using a minibatch size of 10.

A minibatch size of 10 just means that we calculate the gradient for 10 training samples at a time and adjust the weights and biases using that gradient. We do that for the entire (shuffled) 60,000 training items and call that a single epoch. 30 epochs mean we do this full process 30 times.

There are 60,000 items in the training data, mapping 28×28 greyscale images to what digit 0-9 they actually represent.

Besides the 60,000 training data items, there are also 10,000 separate items that are the test data. These test data items are items never seen by the network during training and are just used as a way to see how well the network has learned about the problem in general, versus learning about the specific training data items.

The test and training data is the MNIST data set. I have a link to zip file I made with the data in it below, but this is where I got the data from: The MNIST database of handwritten digits.

That is the entire recipe!

## Results

The 30 training epochs took 1 minute 22 seconds on my machine in release x64 (with REPORT_ERROR_WHILE_TRAINING() set to 0 to speed things up), but the code could be made to run faster by using SIMD, putting it on the GPU, getting multithreading involved or other things.

Below is a graph of the accuracy during the training epochs.

Notice that most learning happened very early on and then only slowly improved from there. This is due to our neuron activation functions and also our cost function. There are better choices for both, but this is also an ongoing area of research to improve in neural networks.

The end result of my training run is 95.32% accuracy but you may get slightly higher or lower due to random initialization of weights and biases. That sounds pretty high, but if you were actually using this, 4 or 5 numbers wrong out of 100 is a pretty big deal! The record for MNIST is 99.77% accuracy using “a committee of convolutional networks” where they distorted the input data during training to make it learn in a more generalized way (described as “committee of 35 conv. net, 1-20-P-40-P-150-10 [elastic distortions]”).

A better cost function would probably be the cross entropy cost function, a better activation function than sigmoid would probably be an ELU (Exponential Linear Unit). A soft max layer could be used instead of just taking the maximum output neuron as the answer. The weights could be initialized to smarter values. We could also use a convolutional layer to help let the network learn features in a way that didn’t also tie the features to specific locations in the images.

Many of these things are described in good detail at http://neuralnetworksanddeeplearning.com/, particularly in chapter 3 where they make a python implementation of a convolutional neural network which performs better than this one. I highly recommend checking that website out!

# HTML5 Demo

You can play with a network created with this recipe here: Recognize Handwritten Digit 95% Accuracy

Here is an example of it correctly detecting that I drew a 4.

The demo works “pretty well” but it does have a little less than 95% accuracy.

The reason for this though is that it isn’t comparing apples to apples.

A handwritten digit isn’t quite the same as a digit drawn with a mouse. Check out the image below to see 100 of the training images and see what i mean.

The demo finds the bounding box of the drawn image and rescales that bounding box to a 20×20 image, preserving the aspect ratio. It then puts that into a 28×28 image, using the center of mass of the pixels to center the smaller image in the larger one. This is how the MNIST data was generated, so makes the demo more accurate, but it also has the nice side effect of making it so you can draw a number of any size, in any part of the box, and it will treat it the same as if you drew it at a difference size, or in a different part of the box.

The code that goes with this post outputs the weights, biases and network structure in a json format that is very easy to drop into the html5 demo. This way, if you want to try different things in the network, it should be fairly low effort to adjust the demo to try your adjustments there as well.

Lastly, it might be interesting to get the derivatives of the inputs and play around with the input you gave it. Some experiments I can think of:

1. When it misclassifies what number you drew, have it adjust what you drew (the input) to be more like what the network would expect to see for that digit. This could help show why it misclassified your number.
2. Start with a well classified number and make it morph into something recognized by the network as a different number.
3. Start with a random static (noise) image and adjust it until the network recognizes it as a digit. It would be interesting to see if it looked anything like a number, or if it was still just static.

# Source Code

The source code and mnist data is on github at MNIST1, but is also included below for your convenience.

If grabbing the source code from below instead of github, you will need to extract this zip file into the working directory of the program as well. It contains the test data used for training the network.
mnist.zip

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <random>
#include <array>
#include <vector>
#include <algorithm>
#include <chrono>

typedef uint32_t uint32;
typedef uint16_t uint16;
typedef uint8_t uint8;

// Set to 1 to have it show error after each training and also writes it to an Error.csv file.
// Slows down the process a bit (+~50% time on my machine)
#define REPORT_ERROR_WHILE_TRAINING() 1

const size_t c_numInputNeurons = 784;
const size_t c_numHiddenNeurons = 30;  // NOTE: setting this to 100 hidden neurons can give better results, but also can be worse other times.
const size_t c_numOutputNeurons = 10;

const size_t c_trainingEpochs = 30;
const size_t c_miniBatchSize = 10;
const float c_learningRate = 3.0f;

// ============================================================================================
//                                     SBlockTimer
// ============================================================================================
// times a block of code
struct SBlockTimer
{
SBlockTimer (const char* label)
{
m_start = std::chrono::high_resolution_clock::now();
m_label = label;
}

~SBlockTimer ()
{
std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
printf("%s%0.2f seconds\n", m_label, seconds.count());
}

std::chrono::high_resolution_clock::time_point m_start;
const char* m_label;
};

// ============================================================================================
// ============================================================================================

inline uint32 EndianSwap (uint32 a)
{
return (a<<24) | ((a<<8) & 0x00ff0000) |
((a>>8) & 0x0000ff00) | (a>>24);
}

// MNIST data and file format description is from http://yann.lecun.com/exdb/mnist/
class CMNISTData
{
public:
CMNISTData ()
{
m_labelData = nullptr;
m_imageData = nullptr;

m_imageCount = 0;
m_labels = nullptr;
m_pixels = nullptr;
}

{
// set the expected image count
m_imageCount = training ? 60000 : 10000;

const char* labelsFileName = training ? "train-labels.idx1-ubyte" : "t10k-labels.idx1-ubyte";
FILE* file = fopen(labelsFileName,"rb");
if (!file)
{
printf("could not open %s for reading.\n", labelsFileName);
return false;
}
fseek(file, 0, SEEK_END);
long fileSize = ftell(file);
fseek(file, 0, SEEK_SET);
m_labelData = new uint8[fileSize];
fclose(file);

const char* imagesFileName = training ? "train-images.idx3-ubyte" : "t10k-images.idx3-ubyte";
file = fopen(imagesFileName, "rb");
if (!file)
{
printf("could not open %s for reading.\n", imagesFileName);
return false;
}
fseek(file, 0, SEEK_END);
fileSize = ftell(file);
fseek(file, 0, SEEK_SET);
m_imageData = new uint8[fileSize];
fclose(file);

// endian swap label file if needed, just first two uint32's.  The rest is uint8's.
uint32* data = (uint32*)m_labelData;
if (data[0] == 0x01080000)
{
data[0] = EndianSwap(data[0]);
data[1] = EndianSwap(data[1]);
}

// verify that the label file has the right header
if (data[0] != 2049 || data[1] != m_imageCount)
{
return false;
}
m_labels = (uint8*)&(data[2]);

// endian swap the image file if needed, just first 4 uint32's. The rest is uint8's.
data = (uint32*)m_imageData;
if (data[0] == 0x03080000)
{
data[0] = EndianSwap(data[0]);
data[1] = EndianSwap(data[1]);
data[2] = EndianSwap(data[2]);
data[3] = EndianSwap(data[3]);
}

// verify that the image file has the right header
if (data[0] != 2051 || data[1] != m_imageCount || data[2] != 28 || data[3] != 28)
{
return false;
}
m_pixels = (uint8*)&(data[4]);

// convert the pixels from uint8 to float
m_pixelsFloat.resize(28 * 28 * m_imageCount);
for (size_t i = 0; i < 28 * 28 * m_imageCount; ++i)
m_pixelsFloat[i] = float(m_pixels[i]) / 255.0f;

// success!
return true;
}

~CMNISTData ()
{
delete[] m_labelData;
delete[] m_imageData;
}

size_t NumImages () const { return m_imageCount; }

const float* GetImage (size_t index, uint8& label) const
{
label = m_labels[index];
return &m_pixelsFloat[index * 28 * 28];
}

private:
void* m_labelData;
void* m_imageData;

size_t m_imageCount;
uint8* m_labels;
uint8* m_pixels;

std::vector<float> m_pixelsFloat;
};

// ============================================================================================
//                                    NEURAL NETWORK
// ============================================================================================

template <size_t INPUTS, size_t HIDDEN_NEURONS, size_t OUTPUT_NEURONS>
class CNeuralNetwork
{
public:
CNeuralNetwork ()
{
// initialize weights and biases to a gaussian distribution random number with mean 0, stddev 1.0
std::random_device rd;
std::mt19937 e2(rd());
std::normal_distribution<float> dist(0, 1);

for (float& f : m_hiddenLayerBiases)
f = dist(e2);

for (float& f : m_outputLayerBiases)
f = dist(e2);

for (float& f : m_hiddenLayerWeights)
f = dist(e2);

for (float& f : m_outputLayerWeights)
f = dist(e2);
}

void Train (const CMNISTData& trainingData, size_t miniBatchSize, float learningRate)
{
// shuffle the order of the training data for our mini batches
if (m_trainingOrder.size() != trainingData.NumImages())
{
m_trainingOrder.resize(trainingData.NumImages());
size_t index = 0;
for (size_t& v : m_trainingOrder)
{
v = index;
++index;
}
}
static std::random_device rd;
static std::mt19937 e2(rd());
std::shuffle(m_trainingOrder.begin(), m_trainingOrder.end(), e2);

// process all minibatches until we are out of training examples
size_t trainingIndex = 0;
while (trainingIndex < trainingData.NumImages())
{
// Clear out minibatch derivatives.  We sum them up and then divide at the end of the minimatch
std::fill(m_miniBatchHiddenLayerBiasesDeltaCost.begin(), m_miniBatchHiddenLayerBiasesDeltaCost.end(), 0.0f);
std::fill(m_miniBatchOutputLayerBiasesDeltaCost.begin(), m_miniBatchOutputLayerBiasesDeltaCost.end(), 0.0f);
std::fill(m_miniBatchHiddenLayerWeightsDeltaCost.begin(), m_miniBatchHiddenLayerWeightsDeltaCost.end(), 0.0f);
std::fill(m_miniBatchOutputLayerWeightsDeltaCost.begin(), m_miniBatchOutputLayerWeightsDeltaCost.end(), 0.0f);

// process the minibatch
size_t miniBatchIndex = 0;
while (miniBatchIndex < miniBatchSize && trainingIndex < trainingData.NumImages())
{
// get the training item
uint8 imageLabel = 0;
const float* pixels = trainingData.GetImage(m_trainingOrder[trainingIndex], imageLabel);

// run the forward pass of the network
uint8 labelDetected = ForwardPass(pixels, imageLabel);

// run the backward pass to get derivatives of the cost function
BackwardPass(pixels, imageLabel);

// add the current derivatives into the minibatch derivative arrays so we can average them at the end of the minibatch via division.
for (size_t i = 0; i < m_hiddenLayerBiasesDeltaCost.size(); ++i)
m_miniBatchHiddenLayerBiasesDeltaCost[i] += m_hiddenLayerBiasesDeltaCost[i];
for (size_t i = 0; i < m_outputLayerBiasesDeltaCost.size(); ++i)
m_miniBatchOutputLayerBiasesDeltaCost[i] += m_outputLayerBiasesDeltaCost[i];
for (size_t i = 0; i < m_hiddenLayerWeightsDeltaCost.size(); ++i)
m_miniBatchHiddenLayerWeightsDeltaCost[i] += m_hiddenLayerWeightsDeltaCost[i];
for (size_t i = 0; i < m_outputLayerWeightsDeltaCost.size(); ++i)
m_miniBatchOutputLayerWeightsDeltaCost[i] += m_outputLayerWeightsDeltaCost[i];

// note that we've added another item to the minibatch, and that we've consumed another training example
++trainingIndex;
++miniBatchIndex;
}

// divide minibatch derivatives by how many items were in the minibatch, to get the average of the derivatives.
// NOTE: instead of doing this explicitly like in the commented code below, we'll do it implicitly
// by dividing the learning rate by miniBatchIndex.
/*
for (float& f : m_miniBatchHiddenLayerBiasesDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchOutputLayerBiasesDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchHiddenLayerWeightsDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchOutputLayerWeightsDeltaCost)
f /= float(miniBatchIndex);
*/

float miniBatchLearningRate = learningRate / float(miniBatchIndex);

// apply training to biases and weights
for (size_t i = 0; i < m_hiddenLayerBiases.size(); ++i)
m_hiddenLayerBiases[i] -= m_miniBatchHiddenLayerBiasesDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_outputLayerBiases.size(); ++i)
m_outputLayerBiases[i] -= m_miniBatchOutputLayerBiasesDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_hiddenLayerWeights.size(); ++i)
m_hiddenLayerWeights[i] -= m_miniBatchHiddenLayerWeightsDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_outputLayerWeights.size(); ++i)
m_outputLayerWeights[i] -= m_miniBatchOutputLayerWeightsDeltaCost[i] * miniBatchLearningRate;
}
}

// This function evaluates the network for the given input pixels and returns the label it thinks it is from 0-9
uint8 ForwardPass (const float* pixels, uint8 correctLabel)
{
// first do hidden layer
for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
{
float Z = m_hiddenLayerBiases[neuronIndex];

for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
Z += pixels[inputIndex] * m_hiddenLayerWeights[HiddenLayerWeightIndex(inputIndex, neuronIndex)];

m_hiddenLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
}

// then do output layer
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
float Z = m_outputLayerBiases[neuronIndex];

for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
Z += m_hiddenLayerOutputs[inputIndex] * m_outputLayerWeights[OutputLayerWeightIndex(inputIndex, neuronIndex)];

m_outputLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
}

// calculate error.
// this is the magnitude of the vector that is Desired - Actual.
// Commenting out because it's not needed.
/*
{
error = 0.0f;
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;
float diff = (desiredOutput - m_outputLayerOutputs[neuronIndex]);
error += diff * diff;
}
error = std::sqrt(error);
}
*/

// find the maximum value of the output layer and return that index as the label
float maxOutput = m_outputLayerOutputs[0];
uint8 maxLabel = 0;
for (uint8 neuronIndex = 1; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
if (m_outputLayerOutputs[neuronIndex] > maxOutput)
{
maxOutput = m_outputLayerOutputs[neuronIndex];
maxLabel = neuronIndex;
}
}
return maxLabel;
}

// Functions to get weights/bias values. Used to make the JSON file.
const std::array<float, HIDDEN_NEURONS>& GetHiddenLayerBiases () const { return m_hiddenLayerBiases; }
const std::array<float, OUTPUT_NEURONS>& GetOutputLayerBiases () const { return m_outputLayerBiases; }
const std::array<float, INPUTS * HIDDEN_NEURONS>& GetHiddenLayerWeights () const { return m_hiddenLayerWeights; }
const std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>& GetOutputLayerWeights () const { return m_outputLayerWeights; }

private:

static size_t HiddenLayerWeightIndex (size_t inputIndex, size_t hiddenLayerNeuronIndex)
{
return hiddenLayerNeuronIndex * INPUTS + inputIndex;
}

static size_t OutputLayerWeightIndex (size_t hiddenLayerNeuronIndex, size_t outputLayerNeuronIndex)
{
return outputLayerNeuronIndex * HIDDEN_NEURONS + hiddenLayerNeuronIndex;
}

// this function uses the neuron output values from the forward pass to backpropagate the error
// of the network to calculate the gradient needed for training.  It figures out what the error
// is by comparing the label it came up with to the label it should have come up with (correctLabel).
void BackwardPass (const float* pixels, uint8 correctLabel)
{
// since we are going backwards, do the output layer first
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
// calculate deltaCost/deltaBias for each output neuron.
// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
//
// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
//
// deltaCost/deltaO = O - desiredOutput
// deltaO/deltaZ = O * (1 - O)
//
float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;

float deltaCost_deltaO = m_outputLayerOutputs[neuronIndex] - desiredOutput;
float deltaO_deltaZ = m_outputLayerOutputs[neuronIndex] * (1.0f - m_outputLayerOutputs[neuronIndex]);

m_outputLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

// calculate deltaCost/deltaWeight for each weight going into the neuron
//
// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
// deltaCost/deltaWeight = deltaCost/deltaBias * input
//
for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
m_outputLayerWeightsDeltaCost[OutputLayerWeightIndex(inputIndex, neuronIndex)] = m_outputLayerBiasesDeltaCost[neuronIndex] * m_hiddenLayerOutputs[inputIndex];
}

// then do the hidden layer
for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
{
// calculate deltaCost/deltaBias for each hidden neuron.
// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
//
// deltaCost/deltaO =
//   Sum for each output of this neuron:
//
// deltaTargetZ/deltaSourceO is the value of the weight connecting the source and target neuron.
//
// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
// deltaO/deltaZ = O * (1 - O)
//
float deltaCost_deltaO = 0.0f;
for (size_t destinationNeuronIndex = 0; destinationNeuronIndex < OUTPUT_NEURONS; ++destinationNeuronIndex)
deltaCost_deltaO += m_outputLayerBiasesDeltaCost[destinationNeuronIndex] * m_outputLayerWeights[OutputLayerWeightIndex(neuronIndex, destinationNeuronIndex)];
float deltaO_deltaZ = m_hiddenLayerOutputs[neuronIndex] * (1.0f - m_hiddenLayerOutputs[neuronIndex]);
m_hiddenLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

// calculate deltaCost/deltaWeight for each weight going into the neuron
//
// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
// deltaCost/deltaWeight = deltaCost/deltaBias * input
//
for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
m_hiddenLayerWeightsDeltaCost[HiddenLayerWeightIndex(inputIndex, neuronIndex)] = m_hiddenLayerBiasesDeltaCost[neuronIndex] * pixels[inputIndex];
}
}

private:

// biases and weights
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiases;
std::array<float, OUTPUT_NEURONS>					m_outputLayerBiases;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeights;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeights;

// neuron activation values aka "O" values
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerOutputs;
std::array<float, OUTPUT_NEURONS>					m_outputLayerOutputs;

// derivatives of biases and weights for a single training example
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiasesDeltaCost;
std::array<float, OUTPUT_NEURONS>					m_outputLayerBiasesDeltaCost;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeightsDeltaCost;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeightsDeltaCost;

// derivatives of biases and weights for the minibatch. Average of all items in minibatch.
std::array<float, HIDDEN_NEURONS>					m_miniBatchHiddenLayerBiasesDeltaCost;
std::array<float, OUTPUT_NEURONS>					m_miniBatchOutputLayerBiasesDeltaCost;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_miniBatchHiddenLayerWeightsDeltaCost;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_miniBatchOutputLayerWeightsDeltaCost;

// used for minibatch generation
std::vector<size_t>									m_trainingOrder;
};

// ============================================================================================
//                                   DRIVER PROGRAM
// ============================================================================================

// training and test data
CMNISTData g_trainingData;
CMNISTData g_testData;

// neural network
CNeuralNetwork<c_numInputNeurons, c_numHiddenNeurons, c_numOutputNeurons> g_neuralNetwork;

float GetDataAccuracy (const CMNISTData& data)
{
size_t correctItems = 0;
for (size_t i = 0, c = data.NumImages(); i < c; ++i)
{
uint8 label;
const float* pixels = data.GetImage(i, label);
uint8 detectedLabel = g_neuralNetwork.ForwardPass(pixels, label);

if (detectedLabel == label)
++correctItems;
}
return float(correctItems) / float(data.NumImages());
}

void ShowImage (const CMNISTData& data, size_t imageIndex)
{
uint8 label = 0;
const float* pixels = data.GetImage(imageIndex, label);
printf("showing a %i\n", label);
for (int iy = 0; iy < 28; ++iy)
{
for (int ix = 0; ix < 28; ++ix)
{
if (*pixels < 0.125)
printf(" ");
else
printf("+");
++pixels;
}
printf("\n");
}
}

int main (int argc, char** argv)
{
// load the MNIST data if we can
{
printf("Could not load mnist data, aborting!\n");
system("pause");
return 1;
}

#if REPORT_ERROR_WHILE_TRAINING()
FILE *file = fopen("Error.csv","w+t");
if (!file)
{
printf("Could not open Error.csv for writing, aborting!\n");
system("pause");
return 2;
}
fprintf(file, "\"Training Data Accuracy\",\"Testing Data Accuracy\"\n");
#endif

{
SBlockTimer timer("Training Time:  ");

// train the network, reporting error before each training
for (size_t epoch = 0; epoch < c_trainingEpochs; ++epoch)
{
#if REPORT_ERROR_WHILE_TRAINING()
float accuracyTraining = GetDataAccuracy(g_trainingData);
float accuracyTest = GetDataAccuracy(g_testData);
printf("Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
printf("Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);
fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
#endif

printf("Training epoch %zu / %zu...\n", epoch+1, c_trainingEpochs);
g_neuralNetwork.Train(g_trainingData, c_miniBatchSize, c_learningRate);
printf("\n");
}
}

// report final error
float accuracyTraining = GetDataAccuracy(g_trainingData);
float accuracyTest = GetDataAccuracy(g_testData);
printf("\nFinal Training Data Accuracy: %0.2f%%\n", 100.0f*accuracyTraining);
printf("Final Test Data Accuracy: %0.2f%%\n\n", 100.0f*accuracyTest);

#if REPORT_ERROR_WHILE_TRAINING()
fprintf(file, "\"%f\",\"%f\"\n", accuracyTraining, accuracyTest);
fclose(file);
#endif

// Write out the final weights and biases as JSON for use in the web demo
{
FILE* file = fopen("WeightsBiasesJSON.txt", "w+t");
fprintf(file, "{\n");

// network structure
fprintf(file, "  \"InputNeurons\":%zu,\n", c_numInputNeurons);
fprintf(file, "  \"HiddenNeurons\":%zu,\n", c_numHiddenNeurons);
fprintf(file, "  \"OutputNeurons\":%zu,\n", c_numOutputNeurons);

// HiddenBiases
auto hiddenBiases = g_neuralNetwork.GetHiddenLayerBiases();
fprintf(file, "  \"HiddenBiases\" : [\n");
for (size_t i = 0; i < hiddenBiases.size(); ++i)
{
fprintf(file, "    %f", hiddenBiases[i]);
if (i < hiddenBiases.size() -1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ],\n");

// HiddenWeights
auto hiddenWeights = g_neuralNetwork.GetHiddenLayerWeights();
fprintf(file, "  \"HiddenWeights\" : [\n");
for (size_t i = 0; i < hiddenWeights.size(); ++i)
{
fprintf(file, "    %f", hiddenWeights[i]);
if (i < hiddenWeights.size() - 1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ],\n");

// OutputBiases
auto outputBiases = g_neuralNetwork.GetOutputLayerBiases();
fprintf(file, "  \"OutputBiases\" : [\n");
for (size_t i = 0; i < outputBiases.size(); ++i)
{
fprintf(file, "    %f", outputBiases[i]);
if (i < outputBiases.size() - 1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ],\n");

// OutputWeights
auto outputWeights = g_neuralNetwork.GetOutputLayerWeights();
fprintf(file, "  \"OutputWeights\" : [\n");
for (size_t i = 0; i < outputWeights.size(); ++i)
{
fprintf(file, "    %f", outputWeights[i]);
if (i < outputWeights.size() - 1)
fprintf(file, ",");
fprintf(file, "\n");
}
fprintf(file, "  ]\n");

// done
fprintf(file, "}\n");
fclose(file);
}

// You can use the code like the below to visualize an image if you want to.
//ShowImage(g_testData, 0);

system("pause");
return 0;
}


Thanks for reading, and if you have any questions, comments, or just want to chat, hit me up in the comments below, or on twitter at @Atrix256.

## Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences

In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences.

By special request, this is that post!

If you want an explanation of dual numbers, check out these posts:

If you want an explanation of finite differences, check out this post:
Finite Differences

Since the fundamentals are explained in the links above, we’ll go straight to the code.

We’ll be getting the gradient (learning values) for the network in example 4 in the backpropagation post:

Note that I am using “central differences” for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I’m using an epsilon of 0.001.

I didn’t compare the running times of each method as my code is meant to be readable, not fast, and the code isn’t doing enough work to make a meaningful performance test IMO. If you did do a speed test, the finite differences should be by far the slowest, backpropagation should be the fastest, and dual numbers are probably going to be closer to backpropagation than to finite differences.

The output of the program is below. Both backpropagation and dual numbers get the exact derivatives (within the tolerances of floating point math of course!) because they use the chain rule, whereas finite differences is a numerical approximation. This shows up in the fact that backpropagation and dual numbers agree for all values, where finite differences has some small error in the derivatives.

And here is the code:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>

#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

//----------------------------------------------------------------------
// Dual Number Class - CDualNumber
//----------------------------------------------------------------------

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

// constructor to make a constant
CDualNumber (float f = 0.0f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
CDualNumber (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// Set a constant value.
void Set (float f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// Set a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
void Set (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// storage for real and dual values
float                           m_real;
std::array<float, NUMVARIABLES> m_dual;
};

//----------------------------------------------------------------------
// Dual Number Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real + b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real - b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real * b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real / b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
return ret;
}

// NOTE: the "special functions" below all just use the chain rule, which you can also use to add more functions

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
float sqrtReal = sqrt(a.m_real);
ret.m_real = sqrtReal;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = pow(a.m_real, y);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> exp (const CDualNumber<NUMVARIABLES>& a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = exp(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * exp(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = sin(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = cos(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Driver Program
//----------------------------------------------------------------------

enum EWeightsBiases
{
e_weight0 = 0,
e_weight1,
e_weight2,
e_weight3,
e_weight4,
e_weight5,
e_weight6,
e_weight7,
e_bias0,
e_bias1,
e_bias2,
e_bias3,

e_count
};

// our dual number needs a "dual" for every value we want a derivative for: aka every weight and bias
typedef CDualNumber<EWeightsBiases::e_count> TDualNumber;

// templated so it can work for both the dual numbers, as well as the float finite differences
template <typename TBaseType>
void ForwardPass (const std::array<TBaseType, 2>& input, const std::array<TBaseType, 2>& desiredOutput, const std::array<TBaseType, EWeightsBiases::e_count>& weightsBiases, TBaseType& cost)
{
// calculate hidden layer neuron activations
TBaseType Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
TBaseType O0 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z0 * TBaseType(-1.0f)));

TBaseType Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
TBaseType O1 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z1 * TBaseType(-1.0f)));

// calculate output layer neuron activations
TBaseType Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
TBaseType O2 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z2 * TBaseType(-1.0f)));

TBaseType Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
TBaseType O3 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z3 * TBaseType(-1.0f)));

// calculate the cost: 0.5 * ||target-actual||^2
// aka cost = half (error squared)
TBaseType diff1 = TBaseType(desiredOutput[0]) - O2;
TBaseType diff2 = TBaseType(desiredOutput[1]) - O3;
cost = TBaseType(0.5f) * (diff1*diff1 + diff2*diff2);
}

// backpropagation
void ForwardPassAndBackpropagation (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, EWeightsBiases::e_count>& weightsBiases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, EWeightsBiases::e_count>& deltaWeightsBiases
) {
// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// calculate Z2 and O2 for neuron2
float Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
float O2 = 1.0f / (1.0f + std::exp(-Z2));

// calculate Z3 and O3 for neuron3
float Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
float O3 = 1.0f / (1.0f + std::exp(-Z3));

// the actual output of the network is the activation of the output layer neurons
actualOutput[0] = O2;
actualOutput[1] = O3;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 2 -----

// calculate how much a change in neuron 2 activation affects the cost function
// deltaCost/deltaO2 = O2 - target0
float deltaCost_deltaO2 = O2 - desiredOutput[0];

// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
// deltaO2/deltaZ2 = O2 * (1 - O2)
float deltaO2_deltaZ2 = O2 * (1 - O2);

// calculate how much a change in neuron 2 weighted input affects the cost function.
// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
deltaWeightsBiases[e_bias2] = neuron2Error;

// calculate how much a change in weight4 affects the cost function.
// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * O0
// similar thing for weight5
deltaWeightsBiases[e_weight4] = neuron2Error * O0;
deltaWeightsBiases[e_weight5] = neuron2Error * O1;

//----- Neuron 3 -----

// calculate how much a change in neuron 3 activation affects the cost function
// deltaCost/deltaO3 = O3 - target1
float deltaCost_deltaO3 = O3 - desiredOutput[1];

// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
// deltaO3/deltaZ3 = O3 * (1 - O3)
float deltaO3_deltaZ3 = O3 * (1 - O3);

// calculate how much a change in neuron 3 weighted input affects the cost function.
// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
deltaWeightsBiases[e_bias3] = neuron3Error;

// calculate how much a change in weight6 affects the cost function.
// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * O0
// similar thing for weight7
deltaWeightsBiases[e_weight6] = neuron3Error * O0;
deltaWeightsBiases[e_weight7] = neuron3Error * O1;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
float deltaCost_deltaO0 = neuron2Error * weightsBiases[e_weight4] + neuron3Error * weightsBiases[e_weight6];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaWeightsBiases[e_bias0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaWeightsBiases[e_weight0] = neuron0Error * input[0];
deltaWeightsBiases[e_weight1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
float deltaCost_deltaO1 = neuron2Error * weightsBiases[e_weight5] + neuron3Error * weightsBiases[e_weight7];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO1/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaWeightsBiases[e_bias1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaWeightsBiases[e_weight2] = neuron1Error * input[0];
deltaWeightsBiases[e_weight3] = neuron1Error * input[1];
}

int main (int argc, char **argv)
{

// weights and biases, inputs and desired output
const std::array<float, EWeightsBiases::e_count> weightsBiases =
{
0.15f, // e_weight0
0.2f,  // e_weight1
0.25f, // e_weight2
0.3f,  // e_weight3
0.4f,  // e_weight4
0.45f, // e_weight5
0.5f,  // e_weight6
0.55f, // e_weight7
0.35f, // e_bias0
0.35f, // e_bias1
0.6f,  // e_bias2
0.6f   // e_bias3
};

const std::array<float, 2> inputs =
{
0.05f,
0.1f
};

std::array<float, 2> desiredOutput = {
0.01f,
0.99f
};

// =====================================================
// ===== FINITE DIFFERENCES CALCULATED DERIVATIVES =====
// =====================================================

std::array<float, EWeightsBiases::e_count> weightsBiasesFloat;
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
weightsBiasesFloat[i] = weightsBiases[i];

// use central differences to approximate the gradient
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
{
float costSample1 = 0.0f;
weightsBiasesFloat[i] = weightsBiases[i] - EPSILON;
ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample1);

float costSample2 = 0.0f;
weightsBiasesFloat[i] = weightsBiases[i] + EPSILON;
ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample2);

gradientFiniteDifferences[i] = (costSample2 - costSample1) / (EPSILON * 2.0f);

weightsBiasesFloat[i] = weightsBiases[i];
}

// ==============================================
// ===== DUAL NUMBER CALCULATED DERIVATIVES =====
// ==============================================

// dual number weights and biases
std::array<TDualNumber, EWeightsBiases::e_count> weightsBiasesDual;
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
weightsBiasesDual[i].Set(weightsBiases[i], i);

// dual number inputs and desired output
std::array<TDualNumber, 2> inputsDual;
for (size_t i = 0; i < inputsDual.size(); ++i)
inputsDual[i].Set(inputs[i]);

std::array<TDualNumber, 2> desiredOutputDual;
for (size_t i = 0; i < desiredOutputDual.size(); ++i)
desiredOutputDual[i].Set(desiredOutput[i]);

// dual number derivatives

// ==================================================
// ===== BACKPROPAGATION CALCULATED DERIVATIVES =====
// ==================================================

float error;
float cost;
std::array<float, 2> actualOutput;
ForwardPassAndBackpropagation(inputs, desiredOutput, weightsBiases, error, cost, actualOutput, gradientBackPropagation);

// ==========================
// ===== Report Results =====
// ==========================

printf("Neural Network Gradient\n\nBackpropagation     Dual Numbers (Error)       Finite Differences (Error)\n");
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
{
printf("% 08f,         % 08f (% 08f),     % 08f (% 08f)\n",
);
}
printf("\n");

system("pause");
return 0;
}


## How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

1. Explanation
2. Working through examples
3. Simple sample C++ source code using only standard includes
4. Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

# Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);


To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

# Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

$Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)$

Plugging in the specific values for the weights and biases, it looks like this:

$Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)$

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

$Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)$

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

$\sigma(x) = \frac{1}{1+e^{-x}}$

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

$Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295$

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula $y=mx+b$ and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is $\frac{dy}{dx}$, which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by $\frac{dy}{dx}$ to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: $\frac{\partial y}{\partial x}$.

The reason partial derivatives even exist is because if you have a function of multiple variables like $z=f(x,y)=x^2+3y+2$, you have two variables that you can take the derivative of. You can calculate $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$. The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

$\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\$

If you put both of those values together into a vector $(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

1. Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
2. Multiply the gradient by a small “learning rate” value, such as 0.05
3. Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

$Cost = ||target-output||^2$

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

# More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like $\frac{\partial Cost}{\partial Weight0}$, you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

# Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

$dy/dx$

That you can also write it like this:

$dy/du*du/dx$

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

$3/2 = 1.5$

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
$3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5$

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

1. dCost/dOutput: The derivative of how a neuron’s output affects cost
2. dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
3. dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

# Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Let’s work through the training process manually.

We put a 1 as input and calculate Z which is the “weighted input” of the neuron.

$Z = input*weight+bias \\ Z = 1*0.3+0.5 \\ Z = 0.8$

Next, we need to calculate O which is the neuron output. We calculate this by putting the weighted input through the sigmoid activation function.

$O = \sigma(0.8) \\ O = 0.6900$

The last step of the “forward pass” is to calculate the cost. We are going to use half squared error as our cost, to slightly simplify some math.

$Cost = 0.5*||target - actual||^2\\ Cost = 0.5*||0-0.6900||^2\\ Cost = 0.5*0.4761\\ Cost = 0.238$

Now the backpropagation pass begins, which will give us the partial derivatives we need to be able to adjust this network and do one iteration of training.

First we want to calculate dCost/dO which tells us how much the cost changes when the neuron output changes. Thanks to the nice form of our cost function, that value is simple to calculate:

$\frac{\partial Cost}{\partial O} = O - target\\ \frac{\partial Cost}{\partial O} = 0.69 - 0\\ \frac{\partial Cost}{\partial O} = 0.69$

Next we want to calculate dO/dZ which tells us how much the neuron output changes when the neuron’s weighted input changes. The activation function we used also makes this very easy to calculate:

$\frac{\partial O}{\partial Z} = O * (1-O)\\ \frac{\partial O}{\partial Z} = 0.69 * 0.31\\ \frac{\partial O}{\partial Z} = 0.2139$

Next we are going to multiply these two values together to get dCost/dZ which tells us how much the cost changes when the weighted input to the neuron changes.

$\frac{\partial Cost}{\partial Z} = \frac{\partial Cost}{\partial O} * \frac{\partial O}{\partial Z}\\ \frac{\partial Cost}{\partial Z} = 0.69 * 0.2139 \\ \frac{\partial Cost}{\partial Z} = 0.1476$

This value has two special meanings. Firstly, this value represents the amount of error in this neuron. Secondly, this value represents dCost/dBias, which is one of the values we need to do training!

$\frac{\partial Cost}{\partial bias} = \frac{\partial Cost}{\partial Z}\\ \frac{\partial Cost}{\partial bias} = 0.1476\\$

Next, we need dZ/dWeight, which tells us how much the weighted input of the neuron changes when you change the weight. This is just the input value, which is 1. This makes intuitive sense because if you added 1.0 to the weight, Z would grow by whatever the input is.

Now that we have dCost/dZ and dZ/dWeight, we can calculate dCost/dWeight:

$\frac{\partial Cost}{\partial Weight} = \frac{\partial Cost}{\partial Z} * \frac{\partial Z}{\partial Weight}\\ \frac{\partial Cost}{\partial Weight} = 0.1476 * 1\\ \frac{\partial Cost}{\partial Weight} = 0.1476$

We now have the partial derivatives we need to be able to train our network!

Using a learning rate of 0.5, we’ll first update our weight:

$Weight = Weight - \frac{\partial Cost}{\partial Weight} * 0.5\\ Weight = 0.3 - 0.1476 * 0.5\\ Weight = 0.2262$

Then we’ll update our bias:

$Bias = Bias - \frac{\partial Cost}{\partial Bias } * 0.5\\ Bias = 0.5 - 0.1476 * 0.5\\ Bias = 0.4262$

Our network has learned a very small amount!

To verify this, let’s calculate the network’s output and cost with these new values.

$Z = input*weight+bias \\ Z = 1*0.2262+0.4262 \\ Z = 0.6524\\ \\ O = \sigma(0.6524) \\ O = 0.6576\\ \\ Cost = 0.5*||target - actual||^2\\ Cost = 0.5*||0-0.6576||^2\\ Cost = 0.5*0.4324\\ Cost = 0.2162$

Our cost decreased from 0.238 to 0.2162 due to our training, so we have indeed improved the error of the network, hooray!

After 10,000 iterations of this, the cost drops down to 0.000026 (weight = -2.564909, bias = -2.364907). That sounds pretty good, and it is decent, but since that is based on error squared, it looks more accurate than it is. The error at that point is 0.007176. Specifically, that means that when we input a 1, it outputs a 0.007176 instead of zero.

With a larger number of trainings it improves though. At 100,000 iterations it gives 0.002246 as output, and at 1,000,000 iterations it gives 0.000728 as output.

You could also try adjusting the learning rate parameter to see if you can make it get to a higher accuracy more quickly, and then perhaps dropping it down to a smaller number once you got there, to get the higher accuracy. Again, neural networks are an art and there are all sorts of techniques you can use to attempt to make them better (more accurate, learn faster, etc).

# Backpropagation Example 2: Single Neuron, Two Training Examples

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
$\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476$

If we follow the same steps with the “input 0, output 1” training example we get these:
$\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887$

To get the actual derivatives to train the network with, we just average them!
$\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294$

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

# Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225$

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167$

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

$\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360$

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

$\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463$

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463$

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

# Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729$

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984$

Now we start the backward pass to calculate the derivatives for training.

## Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

$\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385$

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

$\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\$

## Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

$\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381$

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

$\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\$

## Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

$\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364$

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

$\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\$

## Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

$\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01$

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\$

## Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

# Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
float input, float desiredOutput,
float weight, float bias,
float& error, float& cost, float& actualOutput,
float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
float Z = input * weight + bias;
float O = 1.0f / (1.0f + std::exp(-Z));

// the actual output of the network is the activation of the neuron
actualOutput = O;

// calculate error
error = std::abs(desiredOutput - actualOutput);

// calculate cost
cost = 0.5f * error * error;

// calculate how much a change in neuron activation affects the cost function
// deltaCost/deltaO = O - target
float deltaCost_deltaO = O - desiredOutput;

// calculate how much a change in neuron weighted input affects neuron activation
// deltaO/deltaZ = O * (1 - O)
float deltaO_deltaZ = O * (1 - O);

// calculate how much a change in a neuron's weighted input affects the cost function.
// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
float neuronError = deltaCost_deltaO * deltaO_deltaZ;
deltaCost_deltaBias = neuronError;

// calculate how much a change in the weight affects the cost function.
// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * input
deltaCost_deltaWeight = neuronError * input;

// As a bonus, calculate how much a change in the input affects the cost function.
// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
// deltaCost/deltaInput = neuronError * weight
deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example1.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\",\"weight\",\"bias\",\"dCost/dWeight\",\"dCost/dBias\",\"dCost/dInput\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 10000;

// training data
// input: 1, output: 0
const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float error = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
// run the network to get error and derivatives
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
}
#endif

weight -= deltaCost_deltaWeight * c_learningRate;
bias -= deltaCost_deltaBias * c_learningRate;
}

printf("Example1 Final Error: %f\n", error);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example2.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\",\"weight\",\"bias\",\"dCost/dWeight\",\"dCost/dBias\",\"dCost/dInput\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 100000;

// training data
// input: 1, output: 0
// input: 0, output: 1
const std::array<std::array<float, 2>, 2> c_trainingData = { {
{1.0f, 0.0f},
{0.0f, 1.0f}
} };

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
avgError = 0.0f;
float avgOutput = 0.0f;
float avgCost = 0.0f;
float avgDeltaCost_deltaWeight = 0.0f;
float avgDeltaCost_deltaBias = 0.0f;
float avgDeltaCost_deltaInput = 0.0f;

// run the network to get error and derivatives for each training example
for (const std::array<float, 2>& trainingData : c_trainingData)
{
float error = 0.0f;
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

avgError += error;
avgOutput += output;
avgCost += cost;
avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
avgDeltaCost_deltaBias += deltaCost_deltaBias;
avgDeltaCost_deltaInput += deltaCost_deltaInput;
}

avgError /= (float)c_trainingData.size();
avgOutput /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
avgDeltaCost_deltaBias /= (float)c_trainingData.size();
avgDeltaCost_deltaInput /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\"%f\",\n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
}
#endif

weight -= avgDeltaCost_deltaWeight * c_learningRate;
bias -= avgDeltaCost_deltaBias * c_learningRate;
}

printf("Example2 Final Error: %f\n", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
std::array<float, 2> m_input;
std::array<float, 2> m_output;
};

void Example3RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 4>& weights, const std::array<float, 2>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// the actual output of the network is the activation of the neurons
actualOutput[0] = O0;
actualOutput[1] = O1;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = O0 - target0
float deltaCost_deltaO0 = O0 - desiredOutput[0];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = O1 - target1
float deltaCost_deltaO1 = O1 - desiredOutput[1];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO0/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example3.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 520000;

// training data: OR/AND
// input: 00, output: 00
// input: 01, output: 10
// input: 10, output: 10
// input: 11, output: 11
const std::array<SExample3Training, 4> c_trainingData = { {
{{0.0f, 0.0f}, {0.0f, 0.0f}},
{{0.0f, 1.0f}, {1.0f, 0.0f}},
{{1.0f, 0.0f}, {1.0f, 0.0f}},
{{1.0f, 1.0f}, {1.0f, 1.0f}},
} };

// starting weight and bias values
std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
std::array<float, 2> biases = { 0.5f, 0.1f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
//float avgCost = 0.0f;
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\"\n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example3 Final Error: %f\n", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 8>& weights, const std::array<float, 4>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// calculate Z2 and O2 for neuron2
float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
float O2 = 1.0f / (1.0f + std::exp(-Z2));

// calculate Z3 and O3 for neuron3
float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
float O3 = 1.0f / (1.0f + std::exp(-Z3));

// the actual output of the network is the activation of the output layer neurons
actualOutput[0] = O2;
actualOutput[1] = O3;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 2 -----

// calculate how much a change in neuron 2 activation affects the cost function
// deltaCost/deltaO2 = O2 - target0
float deltaCost_deltaO2 = O2 - desiredOutput[0];

// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
// deltaO2/deltaZ2 = O2 * (1 - O2)
float deltaO2_deltaZ2 = O2 * (1 - O2);

// calculate how much a change in neuron 2 weighted input affects the cost function.
// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
deltaCost_deltaBiases[2] = neuron2Error;

// calculate how much a change in weight4 affects the cost function.
// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * O0
// similar thing for weight5
deltaCost_deltaWeights[4] = neuron2Error * O0;
deltaCost_deltaWeights[5] = neuron2Error * O1;

//----- Neuron 3 -----

// calculate how much a change in neuron 3 activation affects the cost function
// deltaCost/deltaO3 = O3 - target1
float deltaCost_deltaO3 = O3 - desiredOutput[1];

// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
// deltaO3/deltaZ3 = O3 * (1 - O3)
float deltaO3_deltaZ3 = O3 * (1 - O3);

// calculate how much a change in neuron 3 weighted input affects the cost function.
// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
deltaCost_deltaBiases[3] = neuron3Error;

// calculate how much a change in weight6 affects the cost function.
// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * O0
// similar thing for weight7
deltaCost_deltaWeights[6] = neuron3Error * O0;
deltaCost_deltaWeights[7] = neuron3Error * O1;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO1/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example4.csv","w+t");
if (file != nullptr)
fprintf(file, "\"training index\",\"error\",\"cost\"\n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 5000;

// training data: 0.05, 0.1 in = 0.01, 0.99 out
const std::array<SExample3Training, 1> c_trainingData = { {
{{0.05f, 0.1f}, {0.01f, 0.99f}},
} };

// starting weight and bias values
std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, "\"%zi\",\"%f\",\"%f\"\n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example4 Final Error: %f\n", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

int main (int argc, char **argv)
{
Example1();
Example2();
Example3();
Example4();
system("pause");
return 0;
}


The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

1. finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
2. Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

## Multivariable Dual Numbers & Automatic Differentiation

In a previous post I showed how to use dual numbers to be able to get both the value and derivative of a function at the same time:
Dual Numbers & Automatic Differentiation

That post mentions that you can extend it to multivariable functions but doesn’t explain how. This post is that explanation, including simple working C++ code!

Extending this to multivariable functions is useful for ray marching, calculating analytical surface normals and also likely useful for training a neural network if you want an alternative to back propagation. I’m not sure about the efficiency comparison of this versus back propagation but I intend on looking into it (:

# How Does it Work?

It turns out to be really simple to use dual numbers with multivariable functions. The end result is that you want a partial derivative for each variable in the equation, so to do that, you just have a dual number per variable, and process the entire equation for each of those dual numbers separately.

We’ll work through an example. Let’s find the partial derivatives of x and y of the function $3x^2-2y^3$, at input (5,2).

We’ll start by finding the derivative of x, and then the derivative of y.

# Example: df/dx

We start by making a dual number for our x value, remembering that the real part is the actual value for x, and the dual part is the derivative of x, which is 1:

$5+1\epsilon$

or:

$5+\epsilon$

We multiply that value by itself to get the $x^2$ value, keeping in mind that $\epsilon^2$ is zero:
$(5+\epsilon)*(5+\epsilon)= \\ 25+10\epsilon+\epsilon^2= \\ 25+10\epsilon \\$

Next we need to multiply that by 3 to get the $3x^2$ term:

$3*(25+10\epsilon) = 75+30\epsilon$

Putting that aside for a moment, we need to make the $2y^3$ term. We start by making our y value:

$2+0\epsilon$

or:

$2$

If you are wondering why it has a zero for the epsilon term, it’s because when we are calculating the partial derivative of x, y is a constant, so has a derivative of zero.

Next, we multiply this y value by itself twice to get the $y^3$ value:

$2*2*2=8$

We then multiply it by 2 to get the $2y^3$ term:

$8*2=16$

Now that we have our two terms, we subtract the y term from the x term to get our final result:

$75+30\epsilon-16 = \\ 59+30\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30.

That checks out, let’s move on to the derivative of y!

# Example: df/dy

Calculating the derivative of y is very similar to calculating the derivative of x, except that it’s the x term that has an epsilon value (derivative) of 0, instead of the y term. The y term has the epsilon value of 1 this time as well. We’ll work through it to see how it plays out.

First up, we need to make the value for x:

$5+0\epsilon$

or:

$5$

Next we square it and multiply it by 3 to get the $3x^2$ term:

$5*5*3=75$

Next we need to make the value for y, remembering that we use an epsilon value of 1 since the derivative of y is 1 this time around.

$2+\epsilon$

We cube that value and multiply by 2 to get the $2y^3$ term:
$2*(2+\epsilon)*(2+\epsilon)*(2+\epsilon)= \\ 2*(2+\epsilon)*(4+4\epsilon+\epsilon^2)= \\ 2*(2+\epsilon)*(4+4\epsilon)= \\ 2*(8+12\epsilon+4\epsilon^2)= \\ 2*(8+12\epsilon)= \\ 16+24\epsilon$

Now we subtract the y term from the x term to get the final result:

$75 - (16+24\epsilon)= \\ 59-24\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of y at that point is -24.

That also checks out, so we got the correct value and partial derivatives for the equation.

# Reducing Redundancy

There was quite a bit of redundancy when working through the x and y derivatives wasn’t there? Increasing the number of variables will increase the amount of redundancy too, so it doesn’t scale up well.

Luckily there is a way to address this. Basically, instead of making two dual numbers which have two items, you make them share the real value (since it’s the same for both, as is the work to make it) and append the dual values for x and y to it.

$x'=(a+b\epsilon) \\ y'=(a+b\epsilon)$

You have:

$(a+b\epsilon_x+c\epsilon_y)$

Then, in your math or in your program, you treat it as if it’s two different dual numbers packed into one. This lets you do the work for the real number once instead of twice, but still lets you do your dual number work for each variable independently.

While it’s probably easiest to think of these as two dual numbers packed into one value, there is actually a mathematical basis for it as well, which may or may not surprise you.

Check out what happens when we multiply two of these together, keeping in mind that multiplying ANY two epsilon values together becomes zero, even if they are different epsilons:

$(a+b\epsilon_x+c\epsilon_y) * (d+e\epsilon_x+f\epsilon_y)= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + be\epsilon_x^2 + bf\epsilon_x\epsilon_y + cd\epsilon_y + ce\epsilon_x\epsilon_y + cf\epsilon_y^2= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + cd\epsilon_y= \\ ad + (ae+bd)\epsilon_x + (af+cd)\epsilon_y$

The interesting thing is that the above result gives you the same values as if you did the same work for two dual numbers individually.

Let’s see this three component dual number in action by re-doing the example again. Note that this pattern scales up to ANY number of variables!

# Example: Both Derivatives (Gradient Vector)

Our goal is to calculate the value and partial derivatives of the function $3x^2-2y^3$ at location (5,2).

First we make our x value:

$5 + 1\epsilon_x + 0\epsilon_y$

or:

$5 + \epsilon_x$

We square that and multiply it by 3 to get our $3x^2$ term:

$3*(5 + \epsilon_x)*(5 + \epsilon_x)= \\ 3*(25+10\epsilon_x+\epsilon_x^2)= \\ 3*(25+10\epsilon_x)= \\ 75+30\epsilon_x$

Next, we make our y value:

$2 + 0\epsilon_x + 1\epsilon_y$

or:

$2 + \epsilon_y$

We cube it and multiply it by 2 to get our $2x^3$ term:

$16+24\epsilon_y$

Lastly we subtract the y term from the x term to get our final answer:

$(75+30\epsilon_x) - (16+24\epsilon_y)= \\ 59+30\epsilon_x-24\epsilon_y$

The result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30, and the derivative of y at that point is -24.

Neat, right?!

# Example Code

Here is the example code output:

Here is the code that generated it:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>

#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

// constructor to make a constant
CDualNumber (float f = 0.0f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
CDualNumber (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// storage for real and dual values
float							m_real;
std::array<float, NUMVARIABLES> m_dual;
};

//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real + b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real - b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real * b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real / b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
float sqrtReal = sqrt(a.m_real);
ret.m_real = sqrtReal;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = pow(a.m_real, y);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = sin(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = cos(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------

void TestSmoothStep (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// calculate value and derivative using dual numbers
CDualNumber<1> y = SmoothStep(x);

// calculate numeric derivative using central differences
float derivNumeric = (SmoothStep(input + EPSILON) - SmoothStep(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 6.0f * input - 6.0f * input * input;

// show value and derivatives
printf("(smoothstep) y=3x^2-2x^3  (x=%0.4f)\n", input);
printf("  y = %0.4f\n", y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

void TestTrig (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sin
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = sin(x);

// calculate numeric derivative using central differences
float derivNumeric = (sin(input + EPSILON) - sin(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = cos(input);

// show value and derivatives
printf("sin(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// cos
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = cos(x);

// calculate numeric derivative using central differences
float derivNumeric = (cos(input + EPSILON) - cos(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -sin(input);

// show value and derivatives
printf("cos(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// tan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = tan(x);

// calculate numeric derivative using central differences
float derivNumeric = (tan(input + EPSILON) - tan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (cos(input)*cos(input));

// show value and derivatives
printf("tan(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// atan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = atan(x);

// calculate numeric derivative using central differences
float derivNumeric = (atan(input + EPSILON) - atan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (1.0f + input * input);

// show value and derivatives
printf("atan(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}
}

void TestSimple (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sqrt
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = CDualNumber<1>(3.0f) / sqrt(x);

// calculate numeric derivative using central differences
float derivNumeric = ((3.0f / sqrt(input + EPSILON)) - (3.0f / sqrt(input - EPSILON))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -3.0f / (2.0f * pow(input, 3.0f / 2.0f));

// show value and derivatives
printf("3/sqrt(%0.4f) = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}

// pow
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = pow(x + CDualNumber<1>(1.0f), 1.337f);

// calculate numeric derivative using central differences
float derivNumeric = ((pow(input + 1.0f + EPSILON, 1.337f)) - (pow(input + 1.0f - EPSILON, 1.337f))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.337f * pow(input + 1.0f, 0.337f);

// show value and derivatives
printf("(%0.4f+1)^1.337 = %0.4f\n", input, y.m_real);
printf("  dual# dy/dx = %0.4f\n", y.m_dual[0]);
printf("  actual dy/dx = %0.4f\n", derivActual);
printf("  numeric dy/dx = %0.4f\n\n", derivNumeric);
}
}

void Test2D (float inputx, float inputy)
{
// create dual numbers as the value of x and y
CDualNumber<2> x(inputx, 0);
CDualNumber<2> y(inputy, 1);

// z = 3x^2 - 2y^3
{
// calculate value and partial derivatives using dual numbers
CDualNumber<2> z = CDualNumber<2>(3.0f) * x * x - CDualNumber<2>(2.0f) * y * y * y;

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y) {
return 3.0f * x * x - 2.0f * y * y * y;
};
float derivNumericX = (f(inputx + EPSILON, inputy) - f(inputx - EPSILON, inputy)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON) - f(inputx, inputy - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = 6.0f * inputx;
float derivActualY = -6.0f * inputy * inputy;

// show value and derivatives
printf("z=3x^2-2y^3 (x = %0.4f, y = %0.4f)\n", inputx, inputy);
printf("  z = %0.4f\n", z.m_real);
printf("  dual# dz/dx = %0.4f\n", z.m_dual[0]);
printf("  dual# dz/dy = %0.4f\n", z.m_dual[1]);
printf("  actual dz/dx = %0.4f\n", derivActualX);
printf("  actual dz/dy = %0.4f\n", derivActualY);
printf("  numeric dz/dx = %0.4f\n", derivNumericX);
printf("  numeric dz/dy = %0.4f\n\n", derivNumericY);
}
}

void Test3D (float inputx, float inputy, float inputz)
{
// create dual numbers as the value of x and y
CDualNumber<3> x(inputx, 0);
CDualNumber<3> y(inputy, 1);
CDualNumber<3> z(inputz, 2);

// w = sin(x*cos(2*y)) / tan(z)
{
// calculate value and partial derivatives using dual numbers
CDualNumber<3> w = sin(x * cos(CDualNumber<3>(2.0f)*y)) / tan(z);

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y, float z) {
return sin(x*cos(2.0f*y)) / tan(z);
};
float derivNumericX = (f(inputx + EPSILON, inputy, inputz) - f(inputx - EPSILON, inputy, inputz)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON, inputz) - f(inputx, inputy - EPSILON, inputz)) / (2.0f * EPSILON);
float derivNumericZ = (f(inputx, inputy, inputz + EPSILON) - f(inputx, inputy, inputz - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = cos(inputx*cos(2.0f*inputy))*cos(2.0f * inputy) / tan(inputz);
float derivActualY = cos(inputx*cos(2.0f*inputy)) *-2.0f*inputx*sin(2.0f*inputy) / tan(inputz);
float derivActualZ = sin(inputx * cos(2.0f * inputy)) / -(sin(inputz) * sin(inputz));

// show value and derivatives
printf("w=sin(x*cos(2*y))/tan(z) (x = %0.4f, y = %0.4f, z = %0.4f)\n", inputx, inputy, inputz);
printf("  w = %0.4f\n", w.m_real);
printf("  dual# dw/dx = %0.4f\n", w.m_dual[0]);
printf("  dual# dw/dy = %0.4f\n", w.m_dual[1]);
printf("  dual# dw/dz = %0.4f\n", w.m_dual[2]);
printf("  actual dw/dx = %0.4f\n", derivActualX);
printf("  actual dw/dy = %0.4f\n", derivActualY);
printf("  actual dw/dz = %0.4f\n", derivActualZ);
printf("  numeric dw/dx = %0.4f\n", derivNumericX);
printf("  numeric dw/dy = %0.4f\n", derivNumericY);
printf("  numeric dw/dz = %0.4f\n\n", derivNumericZ);
}
}

int main (int argc, char **argv)
{
TestSmoothStep(0.5f);
TestSmoothStep(0.75f);
TestTrig(PI * 0.25f);
TestSimple(3.0f);
Test2D(1.5f, 3.28f);
Test3D(7.12f, 8.93f, 12.01f);
return 0;
}


# Closing

One of the neatest things about dual numbers is that they give precise results. They are not approximations and they are not numerical methods, unlike the central differences method that I compared them to in the example program (More info on numerical derivatives here: Finite Differences). Using dual numbers gives you exact derivatives, within the limitations of (eg) floating point math.

It turns out that backpropagation (the method that is commonly used to train neural networks) is just steepest gradient descent. You can read about that here: Backpropogation is Just Steepest Descent with Automatic Differentiation

That makes me wonder how dual numbers would do in run time speed compared to back propagation as well as numerical methods for getting the gradient to adjust a neural network during training.

If I had to guess, I’d say that dual numbers may be slightly slower than backpropagation, but not as slow as numerical methods which are going to be much, much slower. We’ll see though. It may be much easier to implement neural network learning using dual numbers compared to backpropagation, so may be worth an exploration and write up, even if only to make neural networks a little bit more accessible to people.

Comments, corrections, etc? Let me know in the comments below, or contact me on twitter at @Atrix256

## A Geometric Interpretation of Neural Networks

In the 90s before I was a professional programmer / game developer I looked at neural networks and found them interesting but got scared off by things like back propagation, which I wasn’t yet ready to understand.

With all the interesting machine learning things going on in modern times, I decided to have a look again and have been pleasantly surprised at how simple they are to understand. If you have knowledge of partial derivatives and gradients (like, if you’ve done any ray marching), you have the knowledge it takes to understand it.

Here are some really great resources I recomend for learning the nuts and bolts of how modern neural networks actually work:
Learn TensorFlow and deep learning, without a Ph.D.
Neural Networks and Deep Learning
A Neural Network Playground (Web Based NN sand box from google)

This post doesn’t require any understanding of neural networks or partial derivatives, so don’t worry if you don’t have that knowledge. The harder math comes up when training a neural network, but we are only going to be dealing with evaluating neural networks, which is much simpler.

## A Geometric Interpretation of a Neuron

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron.

To calculate the output value of a neuron, you multiply every input into that neuron by a weight for that input, and add a bias. This value is fed into an activation function (more on activation functions shortly) and the result is the output value of that neuron. Here is a diagram for a single neuron:

A more formal definition of a neuron’s output is below. $b$ is the bias, $w_j$ is the j’th weight and $i_j$ is the j’th input.
$Output = b+\sum_{j=0}^nw_ji_j$

You might notice that the above is really just the dot product of the weight vector and the input vector, with a value added on the end. We could re-write it like that:
$Output = Dot(w,i)+b$

Interestingly, that is the same equation that you use to find the distance of a point to a plane. Let’s say that we have a plane defined by a unit length normal N and a distance to the origin d, and we want to calculate the distance of a point P to that plane. We’d use this formula:
$Distance = Dot(N,P)+d$

This would give us a signed distance, where the value will be negative if we are in the negative half space defined by the plane, and positive otherwise.

This equation works if you are working in 3 dimensional space, but also works in general for any N dimensional point and plane definition.

What does this mean? Well, this tells us that every neuron in a neural network is essentially deciding what side of a hyperplane a point is on. Each neuron is doing a linear classification, saying if something is on side A or side B, and giving a distance of how far it is into A or B.

This also means that when you combine multiple neurons into a network, that an output neuron of that neural network tells you whether the input point is inside or outside of some shape, and by how much.

I find this interesting for two reason.

Firstly, it means that a neural network can be interpreted as encoding SHAPES, which means it could be used for modeling shapes. I’m interested in seeing what sort of shapes it’s capable of, and any sorts of behaviors this representation might have. I don’t expect it to be useful for, say, main stream game development (bonus if it is useful!) but at minimum it ought to be an interesting investigation to help understand neural networks a bit better.

Secondly, there is another machine learning algorithm called Support Vector Machines which are also based on being able to tell you which side of a separation a data point is on. However, unlike the above, SVM separations are not limited to plane tests and can use arbitrary shapes for separation. Does this mean that we are leaving something on the table for neural networks? Could we do better than we are to make networks with fewer layers and fewer neurons that do better classification by doing non linear separation tests?

Quick side note: besides classification, neural nets can help us with something called regression, which is where you fit a network to some analog data, instead of the more discrete problem of classification, which tells you what group something belongs to.

It turns out that the activation function of a neuron can have a pretty big influence on what sort of shapes are possible, which makes it so we aren’t strictly limited to shapes made up of planes and lines, and also means we aren’t necessarily leaving things on the table compared to SVM’s.

This all sort of gives me the feeling though that modern neural networks may not be the best possible algorithm for the types of things we use them for. I feel like we may need to generalize them beyond biological limitations to allow things like multiplications between weighted inputs instead of only sums. I feel like that sort of setup will be closer to whatever the real ideal “neural computation” model might be. However, the modern main stream neural models do have the benefit that they are evaluated very efficiently via dot products. They are particularly well suited for execution on GPUs which excel at performing homogenous operations on lots and lots of data. So, a more powerful and more general “neuron” may come at the cost of increased computational costs, which may make them less desirable in the end.

As a quick tangent, here is a paper from 2011 which shows a neural network model which does in fact allow for multiplication between neuron inputs. You then will likely be wanting exponents and other things, so while it’s a step in the right direction perhaps, it doesn’t yet seem to be the end all be all!
Sum-Product Networks: A New Deep Architecture

It’s also worth while to note that there are other flavors of neural networks, such as convolutional neural networks, which work quite a bit differently.

Let’s investigate this geometric interpretation of neurons as binary classifiers a bit, focusing on some different activation functions!

## Step Activation Function

The Heaviside step function is very simple. If you give it a value greater than zero, it returns a 1, else it returns a 0. That makes our neuron just spit out binary: either a 0 or a 1. The output of a neuron using the step activation function is just the below:

$Output = Dot(w,i)+b > 0$

The output of a neuron using the step activation function is true if the input point is in the positive half space of the plane that this neuron describes, else it returns false.

Let’s think about this in 2d. Let’s make a neural network that takes x and y as input and spits out a value. We can make an image that visualizes the range from (-1,-1) to (1,1). Negative values can be shown in blue, zero in white, and positive values in orange.

To start out, we’ll make a 2d plane (aka a line) that runs vertically and passes through the origin. That means it is a 2d plane with a normal of (1,0) and a distance from the origin of 0. In other words, our network will have a single neuron that has weights of (1,0) and a bias of 0. This is what the visualization looks like:

You can actually play around with the experiments of this post and do your own using an interactive visualization I made for this post. Click here to see this first experiment: Experiment: Vertical Seperation

We can change the normal (weights) to change the angle of the line, and we can also change the bias to move the line to it’s relative left or right. Here is the same network that has it’s weights adjusted to (1,1) with a bias of 0.1.

Experiment: Diagonal Separation

The normal (1,1) isn’t normalized though, which makes it so the distance from origin (aka the bias) isn’t really 0.1 units. The distance from origin is actually divided by the length of the normal to get the REAL distance to origin, so in the above image, where the normal is a bit more than 1.0, the line is actually less than 0.1 units from the origin.

Below is the visualization if we normalize the weights to (0.707,0.707) but leave the bias at 0.1 units. The result is that the line is actually 0.1 units away from the origin.

Experiment: Normalized Diagonal Separation

Recalling the description of our visualization, white is where the network outputs zero, while orange is where the network outputs a positive number (which is 1 in this case).

If we define three lines such that their negative half spaces don’t completely overlap, we can get a triangle where the network outputs a zero, while everywhere else it outputs a positive value. We do this by having three sibling neurons in the first layer which define three separate lines, and then in the output neuron we give them all a weight of 1. This makes it so the area outside the triangle is always a positive value, which step turns into 1, but inside the triangle, the value remains at 0.

We can turn this negative space triangle into a positive space triangle however by making the output neuron have a weight on the inputs of -1, and adding a bias of 0.1. This makes it so that pixels in the positive space of any of the lines will become a negative value. The negative space of those three lines get a small bias to make it be a positive value, resulting in the step function making the values be 0 outside of the triangle, and 1 inside the triangle. This gives us a positive space triangle:

Taking this a bit further, we can make the first layer define 6 lines, which make up two different triangles – a bigger one and a smaller one. We can then have a second layer which has two neurons – one which makes a positive space larger triangle, and one which makes a positive space smaller triangle. Then, in the output neuron we can give the larger triangle neuron a weight of 1, while giving the smaller triangle neuron a weight of -1. The end result is that we have subtracted the smaller triangle from the larger one:

Using the step function we have so far been limited to line based shapes. This has been due to the fact that we can only test our inputs against lines. There is a way around this though: Pass non linear input into the network!

Below is a circle with radius 0.5. The neural network has only a single input which is sqrt(x*x+y*y). The output neuron has a bias of -0.5 and a weight of 1. It’s the bias value which controls the radius of the circle.

You could pass other non linear inputs into the network to get a whole host of other shapes. You could pass sin(x) as an input for example, or x squared.

While the step function is inherently very much limited to linear tests, you can still do quite a lot of interesting non linear shapes (and data separations) by passing non linear input into the network.

Unfortunately though, you as a human would have to know the right non linear inputs to provide. The network can’t learn how to make optimal non linear separations when using the step function. That’s quite a limitation, but as I understand it, that’s how it works with support vector machines as well: a human can define non linear separations, but the human must decide the details of that separation.

BTW it seems like there could be a fun puzzle game here. Something like you have a fixed number of neurons that you can arrange in however many layers you want, and your goal is to separate blue data points from orange ones, or something like that. If you think that’d be a fun game, go make it with my blessing! I don’t have time to pursue it, so have at it (:

## Identity and Relu Activation Functions

The identity activation function doesn’t do anything. It’s the same as if no activation function is used. I’ve heard that it can be useful in regression, but it can also be useful for our geometric interpretation.

Below is the same circle experiment as before, but using the identity activation function instead of the step activation function:

Remembering that orange is positive, blue is negative, and white is zero, you can see that outside the circle is orange (positive) and inside the circle is blue (negative), while the outline of the circle itself is white. We are looking at a signed distance field of the circle! Every point on this image is a scalar value that says how far inside or outside that point is from the surface of the shape.

Signed distance fields are a popular way of rendering vector graphics on the GPU. They are often approximated by storing the distance field in a texture and sampling that texture at runtime. Storing them in a texture only requires a single color channel for storage, and as you zoom in to the shape, they preserve their shape a lot better than regular images. You can read more about SDF textures on my post: Distance Field Textures.

Considering the machine learning perspective, having a signed distance field is also an interesting proposition. It would allow you to do classification of input, but also let you know how deeply that input point is classified within it’s group. This could be a confidence level maybe, or could be interpreted in some other way, but it gives a sort of analog value to classification, which definitely seems like it could come in handy sometimes.

If we take our negative space triangle example from the last section and change it from using step activation to identity activation, we find that our technique doesn’t generalize naively though, as we see below. (It doesn’t generalize for the positive space triangle either)

The reason it doesn’t generalize is that the negatives and positives of pixel distances to each of the lines cancel out. Consider a pixel on the edge of the triangle: you are going to have a near zero value for the edge it’s on, and two larger magnitude negative values from the other edges it is in the negative half spaces of. Adding those all together is going to be a negative value.

To help sort this out we can use an activation function called “relu”, which returns 0 if the value it’s given is less than zero, otherwise it returns the value. This means that all our negative values become 0 and don’t affect the distance summation. If we switch all the neurons to using relu activation, we get this:

If you squint a bit, you can see a triangle in the white. If you open the experiment and turn on “discrete output” to force <= 0 to blue and > 0 to orange you get a nice image that shows you that the triangle is in fact still there.

Our result with relu is better than identity, but there are two problems with our resulting distance field.

Firstly it isn’t a signed distance field – there is no blue as you might notice. It only gives positive distances, for pixels that are outside the shape. This isn’t actually that big of an issue from a rendering perspective, as unsigned distance fields are still useful. It also doesn’t seem that big of an issue from a machine learning perspective, as it still gives some information about how deeply something is classified, even though it is only from one direction.

I think with some clever operations, you could probably create the internal negative signed distance using different operations, and then could compose it with the external positive distance in the output neuron by adding them together.

The second problem is a bigger deal though: The distance field is no longer accurate!

By summing the distance values, the distance is incorrect for points where the closest feature of the triangle is a vertex, because multiple lines are contributing their distance to the final value.

I can’t think of any good ways to correct that problem in the context of a neural network, but the distance is an approximation, and is correct for the edges, and also gets more correct the closer you get to the object, so this is still useful for rendering, and probably still useful for machine learning despite it being an imperfect measurement.

## Sigmoid and Hyperbolic Tangent Activation Function

The sigmoid function is basically a softer version of the step function and gives output between 0 and 1. The hyperbolic tangent activation function is also a softer version of the step function but gives output between -1 and 1.

Sigmoid:

Hyperbolic Tangent:

(images from Wolfram Mathworld)

They have different uses in machine learning, but I’ve found them to be visibly indistinguishable in my experiments after compensating for the different range of output values. It makes me think that smoothstep could probably be a decent activation function too, so long as your input was in the 0 to 1 range (maybe you could clamp input to 0 and 1?).

These activation functions let you get some non linearity in your neural network in a way that the learning algorithms can adjust, which is pretty neat. That puts us more into the realm where a neural net can find a good non linear separation for learned data. For the geometric perspective, this also lets us make more interesting non linear shapes.

Unfortunately, I haven’t been able to get a good understanding of how to use these functions to craft desired results.

It seems like if you add even numbers of hyperbolic tangents together in a neural network that you end up getting a lot of white, like below:

However, if you add an odd number of them together, it starts to look a bit more interesting, like this:

Other than that, it’s been difficult seeing a pattern that I can use to craft things. The two examples above were made by modifying the negative space triangle to use tanh instead of step.

## Closing

We’ve wandered a bit in the idea of interpreting neural networks geometrically but I feel like we’ve only barely scratched the surface. This also hasn’t been a very rigorous exploration, but more has just been about exploring the problem space to get a feeling for what might be possible.

It would be interesting to look more deeply into some of these areas, particularly for the case of distance field generation of shapes, or combining activation functions to get more diverse results.

While stumbling around, it seems like we may have gained some intuition about how neural networks work as well.

It feels like whenever you add a layer, you are adding the ability for a “logical operation” to happen within the network.

For instance, in the triangle cutout experiment, the first layer after the inputs defines the 6 individual lines of the two triangles and classifies input accordingly. The next layer combines those values into the two different triangle shapes. The layer after that converts them from negative space triangles to positive space triangles. Lastly, the output layer subtracts the smaller triangle’s values from the larger triangle’s values to make the final triangle outline shape.

Each layer has a logical operation it performs, which is based on the steps previous to it.

Another piece of intuition I’ve found is that it seems like adding more neurons to a layer allows it to do more work in parallel.

For instance, in the triangle cutout experiment, we created those two triangles in parallel, reserving some neurons in each layer for each triangle. It was only in the final step that we combined the values into a single output.

Since neurons can only access data from the previous network layer, it seems as though adding more neurons to layers can help push data forward from previous layers, to later layers that might need the data. But, it seems like it is most efficient to process input data as early as possible, so that you don’t have to shuttle it forward and waste layers / neurons / memory and computing power.

Here is some info on other activation functions:
Wikipedia:Activation Function

Here’s a link that talks about how perceptrons (step activated neural networks) relate to SVMs:
Hyperplane based Classification: Perceptron and (Intro to) Support Vector Machines

By the way, did I mention you can visualize neural networks in three dimensions as well?

Experiment: 3d Visualization

Here are the two visualizers of neural networks I made for this post using WebGL2:
Neural Network Visualization 2D
Neural Network Visualization 3D

If you play around with this stuff and find anything interesting, please share!

## My Old Master: How to Correct as a Mentor or a Teacher

Preface: I studied martial arts for a little over a decade (shaolin kempo at USSD) and learned a lot while I was there. Our teacher was a great guy who genuinely cared about his students, and in particular, taught my friends and I some really interesting things when we became instructors. I’d like to share some of that information with you in the “My Old Master” posts category. As cliched as it is, many things he taught us apply to all aspects of life, not just martial arts and I’d like other people to benefit from them.

My old master used to say…

“Praise, Correct, Praise. Even if you have to make something up, you need to say something positive.”

Let’s say that I’m teaching you how to punch and you aren’t quite doing it right.

Here are two things I might say to try and correct it:

1. “Keep your wrist straight when you are punching so you don’t hurt your hand” or…
2. “I really love how you are keeping your left hand up while you punch with your right. It’s doing a great job of protecting your head against your opponent hitting you back. Now try keeping your wrist straight when you punch so that you don’t hurt your hand.” Then I watch you try again and I say “Great, just like that, keep it up!”

Think about how those two responses make you feel for a second.

The first one likely makes you feel like you are messing up and need to fix it (a negative thing), while the second makes you feel like you were doing well and are now are doing even better.

What’s the difference? Well, like the opening quote says, I praised, corrected, and then praised. First I found something you were doing well, complimented you on it, gave a suggestion for improvement, and then praised you on doing (or attempting!) the correction.

This can be a great way to give people feedback, in a way that makes them feel better about themselves, and feel better about the feedback you are giving them. Instead of being a negative thing, it becomes a positive thing.

Pretty simple stuff, and if you practice this technique it starts to become second nature.

The quote says that if you can’t find anything positive to say, you should make it up. It shouldn’t be your first choice to make something up, because the more genuine you are about the praise, the better it will be. However, if you really can’t find anything nice to say, yes, you should make something up.

A person’s ego and self worth is a measurable quantity that is increased with praise and decreased with corrections or negative feedback (aka “you suck!”). When this tank of self worth gets too low, your student or mentee will feel worthless, get frustrated and/or start to get resentful at you.

This technique is useful because it allows you to give a correction while minimizing hit to the person’s self worth. In the end allows you to give MORE correction and help them more in the long run, just by phrasing your corrections differently. Another term for this is “complement sandwich” which you may have heard of before.

Another thing to be mindful of however is that you can only give so much feedback at any one time. The ego/self worth tank needs to refill after it’s diminished, and frankly, the person needs to absorb and internalize what you’ve taught them before they are ready for more.

Our teacher would say “it’s better having a mediocre black belt, than having a stellar white belt who quits out of frustration” and that’s very true. It’s better for them since a mediocre black belt is FAR SUPERIOR to a stellar white belt and much better able to protect themselves and their loved ones, but also better for the organization, since we are often teaching or mentoring in a “for profit” situation where the person we are trying to help is either a customer or a co-worker which the company is interested in keeping around.

Before wrapping it up, I heard a funny story regarding this topic about a special needs child and his or her parents. Just like everyone else, this child has a concept of self worth, however being disabled makes it very easy to feel depressed when you realize there are so many things you can’t do that other people can do. It’s difficult too for the parents to help the child feel better about themselves if they really can’t find an area he shines in. One day the parents noticed he loved to use tools and it clicked. They started loosening screws in the house and asking him to tighten them for him. “Jimmy, can you come tighten this screw up for us? You are so good at it!”.

I think that’s a cute story but really shows how we work as humans. Your job as (an effective) teacher, mentor, parent, boss or leader, is to teach whatever you need to teach, correct whatever you need to correct, but also to make sure you do so in a way that is least damaging to the person’s ego and self worth. They feel better about themselves, but you are also more effective at getting the job done. It matters!

So go out there and serve some compliment sandwiches, making sure to be as genuine as possible with your praise!

P.S. Yes people can have over inflated egos and feeding them more is only going to make things worse. That’s the topic of another post (;

## Raytracing Reflection, Refraction, Fresnel, Total Internal Reflection, and Beer’s Law

This post talks about how to render images like the below in real time using ray tracing. Some realism in the images come from reflection and refraction, but the real icing on the cake comes from Fresnel, total internal reflection and Beer’s law. We’ll look at the contributions of these features individually and talk about how to properly combine them for the greatest and most realistic results (:

The renderings come from a shadertoy that accompanies this post: Shadertoy: Reflect Refract TIR Fresnel RayT

My motivation for learning more about this stuff is that I’m starting to make a marble madness inspired game, and I’m planning to do hybrid rendering between rasterization and ray based techniques.

This post will focus more on how to make these features work for you in ray tracing, and less about the reasons for why they work the way they do. This post is more about practical implementations, and less about rigorous explanations.

If you have any questions, comments, corrections or additions, feel free to leave in the comments section at the bottom, or feel free to hit me up on twitter at @Atrix256.

This post is assumes you know how to do basic raytracing with ambient, diffuse and specular lighting, like the image below, which we are going to start with:

# Reflection

The first thing to talk about is reflection. More specifically we are going to be talking about SPECULAR reflection.

Specular is defined by the dictionary to mean “of, relating to, or having the properties of a mirror.”, so what we normally think of as reflection (like from a mirror) is in fact specular reflection.

There is non specular reflection, aka diffuse reflection, which just means that light is reflected off of a surface in a non mirror like way. This is accomplished with diffuse lighting where commonly we dot product the normal of a surface with the direction from the surface to the light to know how much to light that point on the surface.

If specular reflection is how mirrors reflect, and diffuse reflection is how regular diffuse surfaces work, then you might wonder what specular lighting is all about.

A specular highlight is actually just a cheap approximation to doing a mirror like specular reflection of a light source, so it is a cheap kind of specular reflection.

Let’s talk about how to do real mirror like specular reflection in a ray tracer.

When light hits a surface, some amount of the light will be reflected, and some amount of the light will be transmitted into the object. Transmitted light is used for the diffuse shading.

As you might imagine, the amount of light reflected plus the amount of light transmitted must equal the total amount of light that hit the surface. (note that some transmitted energy may be absorbed and given off as heat, or the object may be glowing, so may give off more light than received but let’s ignore that stuff for now.)

A common way to deal with this is to define a reflectivity of a surface as the amount of light it reflects in percent and use 100% minus that amount as the transmitted amount.

You might say that 2% of light that hits a surface reflects. That means that 98% of the light that hits a surface is transmitted and used for the diffuse shading.

When shading a point on the surface, you would calculate both the reflected color at that point on the surface, and the diffuse shaded color at that point, but you multiply the reflected color by 0.02 and the diffuse shaded color by 0.98 and add them together. Doing this gives a result like the below:

The higher the reflectivity percent, the more reflection you get, and the less diffuse shading you get. Turning down reflectivity has the opposite effect.

How do you calculate the reflected color of a point on a surface though?

You simply calculate the ray that reflected off of the surface, and calculate the color of what that ray hit.

To calculate a reflected ray, you need to know the direction that the ray was traveling when it hit the surface (The incident direction), you need to know the location that the ray hit the surface (the surface location), and you need to know the normal of the surface at the intersection point.

If you have those things you can calculate the reflected ray direction as this:

$ReflectRayLocation = SurfaceLocation \\ ReflectRayDirection = IncidentDirection - 2 * SurfaceNormal * DotProduct(IncidentDirection, SurfaceNormal)$

Note that in hlsl and glsl there is a function called “reflect” which takes the incident direction, and the surface normal, and returns you the reflected direction.

Doing the above is mathematically correct, but if you then try to raytrace from that ray, you may hit the same object you are reflecting off. One way to fight that problem is to push the ray positin a small amount away from the surface to make sure it misses it. You can do that for example like this:

$ReflectRayLocation = ReflectRayLocation + ReflectRayDirection * 0.01$

There are other ways to make sure you don’t hit the same object again, but pushing the ray away from the object a small amount really does work pretty nicely in practice.

# Refraction

I mentioned in the last section that whatever light wasn’t reflected when it hit a surface was transmitted. I also said that the transmitted light was used for diffuse shading.

In reality, it’s passed through the “bidirectional scattering distribution function” aka BSDF. You may have heard the term “bidirectional reflection distribution function” aka BRDF. A BRDF only deals with the hemisphere (half a sphere) that surrounds the surface normal. The BSDF on the other hand deals with the full sphere surrounding a surface normal so BRDF is a subset of what is possible with the BSDF.

Because the BSDF deals with an entire sphere, that means that it can handle reflection (specular and non specular) like the BRDF can, but it can also deal with transparency and refraction, where some of the light travels THROUGH an object.

In a path tracer where everything is physically accurate and mathematically precise, we would be interested in dealing with a BSDF and integrating over the sphere, but since we are working with a ray tracer, our physical accuracy needs are a lot lower – we only want a result that looks plausible.

What we are going to do for our transparency is calculate a direction that the ray is going to travel through the object, ray trace that ray to get a color, and use the transmitted light (the portion of light that isn’t reflected) as a multiplier of that color, that we add to the reflected amount of light.

If we have an object with 10% reflectivity, and 90% transmittance, but use that transmitted light for transparency, we’ll have a rendering like below:

Now that we have transparency, let’s talk about refraction. Refraction is a physical phenomenon where light bends (“changes speed” i guess but that sounds a bit suspicious for light), as it hits a boundary between two different surfaces.

Every material has a refractive index, and in fact, may have different refractive indices per light frequency. For our purposes, we’ll assume surfaces have the same refractive index for every frequency of light. There’s a list of refractive indices for a lot of different materials here: Index of refraction

How a ray bends when it refracts depends on the ration of the refractive index that it’s leaving to the refractive index that it’s entering. AKA outside/inside.

HLSL and GLSL have a function called refract which take the incident vector, the surface normal vector, and that ration of refractive indices, and return the refracted ray.

When you do a raytrace down the refracted ray, you will have the same problem as when tracing the reflected ray, that you may hit the same surface you just hit again erroneously. To help that, you once again just move the ray slightly away from the surface.

$RefractRayLocation = RefractRayLocation + RefractRayDirection * 0.01$

Here is a rendering where the sphere has 10% reflectivity, 90% transmittance, an air refractive index of 1.0, and a refractive index of 1.125 for the sphere. You can see how the light bends as it goes through the object and looks pretty neat!

# Fresnel

There is an interesting property in our world: If you look at something straight on, it’s the least reflective it will be. If you turn it nearly 90 degrees where it’s nearly edge on, it will be the most reflective it can be. Many surfaces will become almost perfectly reflective when you view them almost edge on. Go try it out with a wall in your house or a glass or other things.

Weird huh? That’s called Fresnel, and is something we can also make use of in ray tracing to get a more realistic image.

Instead of just always using the reflectivity amount of the surface, we use the Fresnel equation to figure out how much reflectivity an object should have based on the view angle versus the surface normal, so that when it’s more edge on it becomes more reflective. At minimum the reflectivity will be the reflectivity of the surface, and at maximum the reflectivity will be 100%.

The amount we transmit for either refraction or diffuse will be 100% minus however much percentage is reflective.

Here is the image showing normal reflection again:

Here is the image with Fresnel:

It looks quite a bit better with fresnel doesn’t it?!

Here’s a GLSL function of Schlick’s Fresnel approximation function. Notice that it takes the surface normal and incident vector, as well as the refractive index being left (n1) and the refractive index being entered (n2):

float FresnelReflectAmount (float n1, float n2, vec3 normal, vec3 incident)
{
// Schlick aproximation
float r0 = (n1-n2) / (n1+n2);
r0 *= r0;
float cosX = -dot(normal, incident);
if (n1 > n2)
{
float n = n1/n2;
float sinT2 = n*n*(1.0-cosX*cosX);
// Total internal reflection
if (sinT2 > 1.0)
return 1.0;
cosX = sqrt(1.0-sinT2);
}
float x = 1.0-cosX;
float ret = r0+(1.0-r0)*x*x*x*x*x;

// adjust reflect multiplier for object reflectivity
ret = (OBJECT_REFLECTIVITY + (1.0-OBJECT_REFLECTIVITY) * ret);
return ret;
}


Our tale of reflection is complete, so let’s go back to refraction / transparency and finish up the last two items.

# Total Internal Reflection

The way that the Fresnel equation works, it’s possible that when moving from a material with a higher index of refraction to a lower one, that the amount of refraction can actually drop to zero percent. In this case, the light doesn’t exit the higher refractive index object and instead ONLY reflects back into the object, because the reflective amount becomes 100%.

When this happens, it’s called “Total Internal Reflection” for hopefully obvious reasons (:

There isn’t a whole lot to say about this, because you can see that this is even accounted for in the GLSL Fresnel function from the last section.

However, if you are ever under water in a swimming pool and look up to see the water surface looking like a mirror, that is total internal reflection in action.

You can also see it in the render below, where you can see reflections on the inside of the walls of the object, especially on the bottom (floor) of the object:

# Beer’s Law

Beer’s law is the last item to talk about, and relates to transparent surfaces. Beer’s law deals with light being absorbed over distance as it travels through a material.

Beer’s law is why a thin piece of jello is mostly colorless, but a thicker piece of jello has a much richer and deeper color.

Here’s a cube with beer’s law absorbing red and green light over distance. You should notice that where the light travels less distance through the cube that it’s not as blue, because not as much red and green light has been absorbed:

To apply beer’s law, you first figure out how long a ray has traveled through the absorbing material (by tracing the ray inside the object to see where it hits the back side). You then do this calculation:

vec3 color = RayTrace(rayPos, rayDir);
vec3 absorb = exp(-OBJECT_ABSORB * absorbDistance);
color *= absorb;


OBJECT_ABSORB is an RGB value that describes how much of each color channel absorbs over distance. For example, a value of (8.0, 2.0, 0.1) would make red get absorbed the fastest, then green, then blue, so would result in a blueish, slightly green color object.

# Putting it All Together

Now that we have the individual components worked out let’s talk about how to put it together.

Firstly, when a ray hits any surface, you need to use the Fresnel equation to get the multiplier for the reflected and transmitted light.

Next you calculate the reflected and transmitted light by recursively raytracing. The transmitted light is either used for the diffuse shading, the transparency/refracted ray, or some combination of those two (technically, it’s all up to the BSDF we are approximating).

Then, you multiply the reflected light by the reflected amount from the Fresnel equation, and the transmitted amount by 100%-reflectionAmount and add the results together.

(Quick side note, if you have emissive color on the surface aka the object glows, you would also add that in here).

Since raytracing is recursive, you would do this each time a ray intersected with an object. In other words, each intersection causes the ray to split into two rays.

As you can imagine, this can make the rendering quite complex, especially on the GPU where you can’t even do real recursion.

One way to help limit the recursiveness a bit is when you are calculating your reflection and transmittance amounts, you can choose a threshold like say 1% where if the reflection is under 1% it clamps it to 0%, and if it’s greater than 99% it clamps it at 100%. You can then choose not to recurse down a specific ray if the ray’s multiplier is 0. The end result will be that reflections or transmittance rays that don’t contribute much to the end result won’t be followed at all.

If you are willing to sacrifice some visual quality to not have to split your ray into two at each object intersection, you could also figure out if reflection or transmittance has the higher multiplier, and make the ray only follow one of the paths. If you were doing this in a path tracer, you could choose which one to follow randomly, using the multiplier as a weight for the random selection.

The problem in both of these two optimizations is that the multiplier is only half of the information though so may incorrectly choose the less meaningful contribution. The other half of the information is the color of the ray if you followed it. The reason for this is that you might have a low multiplier for a really bright spot (caustics have this problem commonly!), or vice versa you may have a high multiplier for a dull featureless spot. With path tracing, if you take enough samples, it all washes out in the averages, and with ray tracing, maybe you accept that it will do the wrong thing sometimes, to stay a real time algorithm, but it’s important to know how this type of choice can fail for you. (Side note, this sort of stuff is what importance sampling in path tracing is all about – trying to make rays follow more meaningful paths to get better results quicker).

When doing real time raytracing you also often have to decide how many times you want to allow a ray to bounce around. In the shadertoy that goes with this post, that parameter is MAX_RAY_BOUNCES and I have it set to 10.

Interestingly, setting it to 1 has no visible impact on the sphere at all, which is a nice improvement. For the box, a value of 3 seems to be the maximum it needs. 3 also seems to be the magic number for the geometric gem type shape.

So, 10 is overkill, but i left it at that in case people play with parameters and change them to values which would require more bounces.

Lastly I wanted to mention that in the scene that I rendered, I did a small “trick” to make it so I didn’t need to do full recursive splitting of rays at each intersection. I did this by making it so the main object in the center of the scene was the only object that had reflection.

This way, I only need to split the ray into two if i hit the main object. Furthermore, when I’m splitting the ray at the main object, the ray that gets the color for the outside world (versus the inside of the object) is a single non recursive ray cast since it can’t hit anything reflective. The result is that at each intersection of the sphere, i do a simple non recursive ray cast for the ray that is going outside of the main object, then i continue the iterative ray on the inside of the object until i run out of bounces.

Doing this causes a recursive process to become an iterative one, which is much friendlier on the gpu.

Below is the final render again from the shadertoy. The parameters are:

• The refractive index of the air is 1.00029
• The refractive index inside the objects are 1.125
• Reflectivity is 1%
• The absorption for beers law is (8.0, 8.0, 3.0)

## Incremental Least Squares Surface and Hyper-Volume Fitting

The last post showed how to fit a $y=f(x)$ equation to a set of 2d data points, using least squares fitting. It allowed you to do this getting only one data point at a time, and still come up with the same answer as if you had all the data points at once, so it was an incremental, or “online” algorithm.

This post generalizes that process to equations of any dimension such as $z=f(x,y)$, $w=f(x,y,z)$ or greater.

Below is an image of a surface that is degree (2,2). This is a screenshot taken from the interactive webgl2 demo I made for this post: Least Squares Surface Fitting

# How Do You Do It?

The two main players from the last post were the ATA matrix and the ATY vector. These two could be incrementally updated as new data points came in, which would allow you to do an incremental (or “online”) least squares fit of a curve.

When working with surfaces and volumes, you have the same things basically. Both the ATA matrix and the ATY vector still exist, but they contain slightly different information – which I’ll explain lower down. However, the ATY vector is renamed, since in the case of a surface it should be called ATZ, and for a volume it should be called ATW. I call it ATV to generalize it, where v stands for value, and represents the last component in a data point, which is the output value we are trying to fit given the input values. The input values are the rest of the components of the data point.

At the end, you once again need to solve the equation $A^TA*c=A^Tv$ to calculate the coefficients (named c) of the equation.

It’s all pretty similar to fitting a curve, but the details change a bit. Let’s work through an example to see how it differs.

# Example: Bilinear Surface Fitting

Let’s fit 4 data points to a bilinear surface, otherwise known as a surface that is linear on each axis, or a surface of degree(1,1):
(0,0,5)
(0,1,3)
(1,0,8)
(1,1,2)

Since we are fitting those data points with a bilinear surface, we are looking for a function that takes in x,y values and gives as output the z coordinate. We want a surface that gives us the closest answer possible (minimizing the sum of the squared difference for each input data point) for the data points we do have, so that we can give it other data points and get z values as output that approximate what we expect to see for that input.

A linear equation looks like this, with coefficients A and B:
$y=Ax+B$

Since we want a bilinear equation this time around, this is the equation we are going to end up with, after solving for the coefficients A,B,C,D:
$y=Axy+Bx+Cy+D$

The first step is to make the A matrix. In the last post, this matrix was made up of powers of the x coordinates. In this post, they are actually going to be made up of the permutation of powers of the x and y coordinates.

Last time the matrix looked like this:
$A = \begin{bmatrix} x_0^0 & x_0^1 & x_0^2 \\ x_1^0 & x_1^1 & x_1^2 \\ x_2^0 & x_2^1 & x_2^2 \\ x_3^0 & x_3^1 & x_3^2 \\ \end{bmatrix}$

This time, the matrix is going to look like this:
$A = \begin{bmatrix} x_0^0y_0^0 & x_0^0y_0^1 & x_0^1y_0^0 & x_0^1y_0^1 \\ x_1^0y_1^0 & x_1^0y_1^1 & x_1^1y_1^0 & x_1^1y_1^1 \\ x_2^0y_2^0 & x_2^0y_2^1 & x_2^1y_2^0 & x_2^1y_2^1 \\ x_3^0y_3^0 & x_3^0y_3^1 & x_3^1y_3^0 & x_3^1y_3^1 \\ \end{bmatrix}$

Simplifying that matrix a bit, it looks like this:
$A = \begin{bmatrix} 1 & y_0 & x_0 & x_0y_0 \\ 1 & y_1 & x_1 & x_1y_1 \\ 1 & y_2 & x_2 & x_2y_2 \\ 1 & y_3 & x_3 & x_3y_3 \\ \end{bmatrix}$

To simplify it even further, there is one row in the A matrix per data point, where the row looks like this:
$\begin{bmatrix} 1 & y & x & xy \\ \end{bmatrix}$

You can see that every permutation of the powers of x and y for each data point is present in the matrix.

The A matrix for our data points is this:
$A = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix}$

Next we need to calculate the ATA matrix by multiplying the transpose of that matrix, by that matrix.

$A^TA = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} * \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix} = \begin{bmatrix} 4 & 2 & 2 & 1 \\ 2 & 2 & 1 & 1 \\ 2 & 1 & 2 & 1 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix}$

Taking the inverse of that matrix we get this:

$(A^TA)^{-1} = \begin{bmatrix} 1 & -1 & -1 & 1 \\ -1 & 2 & 1 & -2 \\ -1 & 1 & 2 & -2 \\ 1 & -2 & -2 & 4 \\ \end{bmatrix}$

Next we need to calculate the ATV vector (formerly known as ATY). We calculate that by multiplying the transpose of the A matrix by the Z values:

$A^TV = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} * \begin{bmatrix} 5 \\ 3 \\ 8 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 18 \\ 5 \\ 10 \\ 2 \\ \end{bmatrix}$

Lastly we multiply the inversed ATA matrix by the ATV vector to get our coefficients.

$\begin{bmatrix} 1 & -1 & -1 & 1 \\ -1 & 2 & 1 & -2 \\ -1 & 1 & 2 & -2 \\ 1 & -2 & -2 & 4 \\ \end{bmatrix} * \begin{bmatrix} 18 \\ 5 \\ 10 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ -2 \\ 3 \\ -4 \\ \end{bmatrix}$

In the last post, the coefficients we got out were in x power order, so the first (top) was for the $x^0$ term, the next was for the $x^1$ term etc.

This time around, the coefficients are in the same order as the permutations of the powers of x and y:
$\begin{bmatrix} 1 & y & x & xy \\ \end{bmatrix}$

That makes our final equation this:
$z = -4xy+3x-2y+5$

If you plug in the (x,y) values from the data set we fit, you’ll see that you get the corresponding z values as output. We perfectly fit the data set!

The process isn’t too different from last post and not too difficult either right?

Let’s see if we can generalize and formalize things a bit.

# Some Observations

Firstly you may be wondering how we come up with the correct permutation of powers of our inputs. It actually doesn’t matter so long as you are consistent. You can have your A matrix rows have the powers in any order, so long as all orders are present, and you use the same order in all operations.

Regarding storage sizes needed, the storage of surfaces and (hyper) volumes are a bit different and generally larger than curves.

To see how, let’s look at the powers of the ATA matrix of a bilinear surface, using the ordering of powers that we used in the example:

$\begin{bmatrix} x^0y^0 & x^0y^1 & x^1y^0 & x^1y^1 \\ x^0y^1 & x^0y^2 & x^1y^1 & x^1y^2 \\ x^1y^0 & x^1y^1 & x^2y^0 & x^2y^1 \\ x^1y^1 & x^1y^2 & x^2y^1 & x^2y^2 \\ \end{bmatrix}$

Let’s rewrite it as just the powers:

$\begin{bmatrix} 00 & 01 & 10 & 11 \\ 01 & 02 & 11 & 12 \\ 10 & 11 & 20 & 21 \\ 11 & 12 & 21 & 22 \\ \end{bmatrix}$

And the permutation we used as just powers to help us find the pattern in the powers of x and y in the ATA matrix:
$\begin{bmatrix} 00 & 01 & 10 & 11 \\ \end{bmatrix}$

Can you find the pattern of the powers used at the different spots in the ATA matrix?

I had to stare at it for a while before I figured it out but it’s this: For the i,j location in the ATA matrix, the powers of x and y are the powers of x and y in the i permutation added to the powers of x and y in the j permutation.

For example, $A^TA_{0,2}$ has xy powers of 10. Permutation 0 has powers of 0,0 and permutation 2 has powers of 1,0, so we add those together to get powers 1,0.

Another example, $A^TA_{2,3}$ has xy powers of 21. Permutation 2 has powers of 1,0 and permutation 3 has powers 1,1. Adding those together we get 2,1 which is correct.

That’s a bit more complex than last post, not too much more difficult to construct the ATA matrix directly – and also construct it incrementally as new data points come in!

How many unique values are there in the ATA matrix though? We need to know this to know how much storage we need.

In the last post, we needed (degree+1)*2–1 values to store the unique ATA matrix values. That can also be written as degree*2+1.

It turns out that when generalizing this to surfaces and volumes, that we need to take the product of that for each axis.

For instance, a surface has ((degreeX*2)+1)*((degreeY*2)+1) unique values. A volume has ((degreeX*2)+1)*((degreeY*2)+1)*((degreeZ*2)+1) unique values.

The pattern continues for higher dimensions, as well as lower, since you can see how in the curve case, it’s the same formula as it was before.

For the same ATA matrix size, a surface has more unique values than a curve does.

As far as what those values actually are, they are the full permutations of the powers of a surface (or hyper volume) that is one degree higher on each axis. For a bilinear surface, that means the 9 permutations of x and y for powers 0,1 and 2:
$x^0y^0,x^0y^1,x^0y^2,x^1y^0,x^1y^1,x^1y^2,x^2y^0,x^2y^1,x^2y^2$
Or simplified:
$1,y,y^2,x,xy,xy^2,x^2,x^2y,x^2y^2$

For the bilinear case, The ATV vector is the sums of the permutations of x,y multiplied by z, for every data point. In other words, you add this to ATV for each data point:
$\begin{bmatrix} z & yz & xz & xyz \\ \end{bmatrix}$

How much storage space do we need in general for the ATV vector then? it’s the product of (degree+1) for each axis.

For instance, a surface has (degreeX+1)*(degreeY+1) values in ATV, and a volume has (degreeX+1)*(degreeY+1)*(degreeZ+1).

You may also be wondering how many data points are required minimum to fit a curve, surface or hypervolume to a data set. The answer is that you need as many data points as there are terms in the polynomial. We are trying to solve for the polynomial coefficients, so there are as many unknowns as there are polynomial terms.

How many polynomial terms are there? There are as many terms as there are permutations of the axes powers involved. In other words, the size of ATV is also the minimum number of points you need to fit a curve, surface, or hypervolume to a data set.

# Measuring Quality of a Fit

You are probably wondering if there’s a way to calculate how good of a fit you have for a given data set. It turns out that there are a few ways to calculate a value for this.

The value I use in the code below and in the demos is called $R^2$ or residue squared.

First you calculate the average (mean) output value from the input data set.

Then you calculate SSTot which is the sum of the square of the mean subtracted from each input point’s output value. Pseudo code:

SSTot = 0;
for (point p in points)
SSTot += (p.out - mean)^2;


You then calculate SSRes which is the sum of the square of the fitted function evaluated at a point, subtracted from each input points’ output value. Pseudo code:

SSRes= 0;
for (point p in points)
SSRes += (p.out - f(p.in))^2;


The final value for R^2 is 1-SSRes/SSTot.

The value is nice because it’s unitless, and since SSRes and SSTot is a sum of squares, SSRes/SSTot is basically the value that the fitting algorithm minimizes. The value is subtracted from 1 so that it’s a fit quality metric. A value of 0 is a bad fit, and a value of 1 is a good fit and generally it will be between those values.

# Example Code

Here is a run from the sample code:

And here is the source code:

#include <stdio.h>
#include <array>

#define FILTER_ZERO_COEFFICIENTS true // if false, will show terms which have a coefficient of 0

//====================================================================
template<size_t N>
using TVector = std::array<float, N>;

template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

//====================================================================
// Specify a degree per axis.
// 1 = linear, 2 = quadratic, etc
template <size_t... DEGREES>
class COnlineLeastSquaresFitter
{
public:
COnlineLeastSquaresFitter ()
{
// initialize our sums to zero
std::fill(m_SummedPowers.begin(), m_SummedPowers.end(), 0.0f);
std::fill(m_SummedPowersTimesValues.begin(), m_SummedPowersTimesValues.end(), 0.0f);
}

// Calculate how many summed powers we need.
// Product of degree*2+1 for each axis.
template <class T>
constexpr static size_t NumSummedPowers(T degree)
{
return degree * 2 + 1;
}
template <class T, class... DEGREES>
constexpr static size_t NumSummedPowers(T first, DEGREES... degrees)
{
return NumSummedPowers(first) * NumSummedPowers(degrees...);
}

// Calculate how many coefficients we have for our equation.
// Product of degree+1 for each axis.
template <class T>
constexpr static size_t NumCoefficients(T degree)
{
return (degree + 1);
}
template <class T, class... DEGREES>
constexpr static size_t NumCoefficients(T first, DEGREES... degrees)
{
return NumCoefficients(first) * NumCoefficients(degrees...);
}

// Helper function to get degree of specific axis
static size_t Degree (size_t axisIndex)
{
static const std::array<size_t, c_dimension-1> c_degrees = { DEGREES... };
return c_degrees[axisIndex];
}

// static const values
static const size_t c_dimension = sizeof...(DEGREES) + 1;
static const size_t c_numCoefficients = NumCoefficients(DEGREES...);
static const size_t c_numSummedPowers = NumSummedPowers(DEGREES...);

// Typedefs
typedef TVector<c_numCoefficients> TCoefficients;
typedef TVector<c_dimension> TDataPoint;

// Function for converting from an index to a specific power permutation
static void IndexToPowers (size_t index, std::array<size_t, c_dimension-1>& powers, size_t maxDegreeMultiply, size_t maxDegreeAdd)
{
for (int i = c_dimension-2; i >= 0; --i)
{
size_t degree = Degree(i) * maxDegreeMultiply + maxDegreeAdd;
powers[i] = index % degree;
index = index / degree;
}
}

// Function for converting from a specific power permuation back into an index
static size_t PowersToIndex (std::array<size_t, c_dimension - 1>& powers, size_t maxDegreeMultiply, size_t maxDegreeAdd)
{
size_t ret = 0;
for (int i = 0; i < c_dimension - 1; ++i)
{
ret *= Degree(i) * maxDegreeMultiply + maxDegreeAdd;
ret += powers[i];
}
return ret;
}

// Add a datapoint to our fitting
{
// Note: It'd be a good idea to memoize the powers and calculate them through repeated
// multiplication, instead of calculating them on demand each time, using std::pow.

// add the summed powers of the input values
std::array<size_t, c_dimension-1> powers;
for (size_t i = 0; i < m_SummedPowers.size(); ++i)
{
IndexToPowers(i, powers, 2, 1);
for (size_t j = 0; j < c_dimension - 1; ++j)
}

// add the summed powers of the input value, multiplied by the output value
for (size_t i = 0; i < m_SummedPowersTimesValues.size(); ++i)
{
IndexToPowers(i, powers, 1, 1);
float valueAdd = dataPoint[c_dimension - 1];
for (size_t j = 0; j < c_dimension-1; ++j)
}
}

// Get the coefficients of the equation fit to the points
bool CalculateCoefficients (TCoefficients& coefficients) const
{
// make the ATA matrix
std::array<size_t, c_dimension - 1> powersi;
std::array<size_t, c_dimension - 1> powersj;
std::array<size_t, c_dimension - 1> summedPowers;
TMatrix<c_numCoefficients, c_numCoefficients> ATA;
for (size_t j = 0; j < c_numCoefficients; ++j)
{
IndexToPowers(j, powersj, 1, 1);

for (size_t i = 0; i < c_numCoefficients; ++i)
{
IndexToPowers(i, powersi, 1, 1);

for (size_t k = 0; k < c_dimension - 1; ++k)
summedPowers[k] = powersi[k] + powersj[k];

size_t summedPowersIndex = PowersToIndex(summedPowers, 2, 1);
ATA[j][i] = m_SummedPowers[summedPowersIndex];
}
}

// solve: ATA * coefficients = m_SummedPowers
// for the coefficients vector, using Gaussian elimination.
coefficients = m_SummedPowersTimesValues;
for (size_t i = 0; i < c_numCoefficients; ++i)
{
for (size_t j = 0; j < c_numCoefficients; ++j)
{
if (ATA[i][i] == 0.0f)
return false;

float c = ((i == j) - ATA[j][i]) / ATA[i][i];
coefficients[j] += c*coefficients[i];
for (size_t k = 0; k < c_numCoefficients; ++k)
ATA[j][k] += c*ATA[i][k];
}
}

// Note: this is the old, "bad" way to solve the equation using matrix inversion.
// It's a worse choice for larger matrices, and surfaces and volumes use larger matrices than curves in general.
/*
// Inverse the ATA matrix
TMatrix<c_numCoefficients, c_numCoefficients> ATAInverse;
if (!InvertMatrix(ATA, ATAInverse))
return false;

// calculate the coefficients
for (size_t i = 0; i < c_numCoefficients; ++i)
coefficients[i] = DotProduct(ATAInverse[i], m_SummedPowersTimesValues);
*/

return true;
}

private:
//Storage Requirements:
// Summed Powers = Product of degree*2+1 for each axis.
// Summed Powers Times Values = Product of degree+1 for each axis.
TVector<c_numSummedPowers>		m_SummedPowers;
TVector<c_numCoefficients>		m_SummedPowersTimesValues;
};

//====================================================================
char AxisIndexToLetter (size_t axisIndex)
{
// x,y,z,w,v,u,t,....
if (axisIndex < 3)
return 'x' + char(axisIndex);
else
return 'x' + 2 - char(axisIndex);
}

//====================================================================
template <class T, size_t M, size_t N>
float EvaluateFunction (const T& fitter, const TVector<M>& dataPoint, const TVector<N>& coefficients)
{
float ret = 0.0f;
for (size_t i = 0; i < coefficients.size(); ++i)
{
float term = coefficients[i];

// then the powers of the input variables
std::array<size_t, T::c_dimension - 1> powers;
fitter.IndexToPowers(i, powers, 1, 1);
for (size_t j = 0; j < powers.size(); ++j)
term *= (float)std::pow(dataPoint[j], powers[j]);

// add this term to our return value
ret += term;
}
return ret;
}

//====================================================================
template <size_t... DEGREES>
void DoTest (const std::initializer_list<TVector<sizeof...(DEGREES)+1>>& data)
{
// say what we are are going to do
printf("Fitting a function of degree (");
for (size_t i = 0; i < COnlineLeastSquaresFitter<DEGREES...>::c_dimension - 1; ++i)
{
if (i > 0)
printf(",");
printf("%zi", COnlineLeastSquaresFitter<DEGREES...>::Degree(i));
}
printf(") to %zi data points: \n", data.size());

// show input data points
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
{
printf("  (");
for (size_t i = 0; i < dataPoint.size(); ++i)
{
if (i > 0)
printf(", ");
printf("%0.2f", dataPoint[i]);
}
printf(")\n");
}

// fit data
COnlineLeastSquaresFitter<DEGREES...> fitter;
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)

// calculate coefficients if we can
COnlineLeastSquaresFitter<DEGREES...>::TCoefficients coefficients;
bool success = fitter.CalculateCoefficients(coefficients);
if (!success)
{
printf("Could not calculate coefficients!\n\n");
return;
}

// print the polynomial
bool firstTerm = true;
printf("%c = ", AxisIndexToLetter(sizeof...(DEGREES)));
bool showedATerm = false;
for (int i = (int)coefficients.size() - 1; i >= 0; --i)
{
// don't show zero terms
if (FILTER_ZERO_COEFFICIENTS && std::abs(coefficients[i]) < 0.00001f)
continue;

showedATerm = true;

// show an add or subtract between terms
float coefficient = coefficients[i];
if (firstTerm)
firstTerm = false;
else if (coefficient >= 0.0f)
printf(" + ");
else
{
coefficient *= -1.0f;
printf(" - ");
}

printf("%0.2f", coefficient);

std::array<size_t, COnlineLeastSquaresFitter<DEGREES...>::c_dimension - 1> powers;
fitter.IndexToPowers(i, powers, 1, 1);

for (size_t j = 0; j < powers.size(); ++j)
{
if (powers[j] > 0)
printf("%c", AxisIndexToLetter(j));
if (powers[j] > 1)
printf("^%zi", powers[j]);
}
}
if (!showedATerm)
printf("0");
printf("\n");

// Calculate and show R^2 value.
float rSquared = 1.0f;
if (data.size() > 0)
{
float mean = 0.0f;
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
mean += dataPoint[sizeof...(DEGREES)];
mean /= data.size();
float SSTot = 0.0f;
float SSRes = 0.0f;
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
{
float value = dataPoint[sizeof...(DEGREES)] - mean;
SSTot += value*value;

value = dataPoint[sizeof...(DEGREES)] - EvaluateFunction(fitter, dataPoint, coefficients);
SSRes += value*value;
}
if (SSTot != 0.0f)
rSquared = 1.0f - SSRes / SSTot;
}
printf("R^2 = %0.4f\n\n", rSquared);
}

//====================================================================
int main (int argc, char **argv)
{
// bilinear - 4 data points
DoTest<1, 1>(
{
TVector<3>{ 0.0f, 0.0f, 5.0f },
TVector<3>{ 0.0f, 1.0f, 3.0f },
TVector<3>{ 1.0f, 0.0f, 8.0f },
TVector<3>{ 1.0f, 1.0f, 2.0f },
}
);

// biquadratic - 9 data points
DoTest<2, 2>(
{
TVector<3>{ 0.0f, 0.0f, 8.0f },
TVector<3>{ 0.0f, 1.0f, 4.0f },
TVector<3>{ 0.0f, 2.0f, 6.0f },
TVector<3>{ 1.0f, 0.0f, 5.0f },
TVector<3>{ 1.0f, 1.0f, 2.0f },
TVector<3>{ 1.0f, 2.0f, 1.0f },
TVector<3>{ 2.0f, 0.0f, 7.0f },
TVector<3>{ 2.0f, 1.0f, 9.0f },
TVector<3>{ 2.0f, 2.5f, 12.0f },
}
);

// trilinear - 8 data points
DoTest<1, 1, 1>(
{
TVector<4>{ 0.0f, 0.0f, 0.0f, 8.0f },
TVector<4>{ 0.0f, 0.0f, 1.0f, 4.0f },
TVector<4>{ 0.0f, 1.0f, 0.0f, 6.0f },
TVector<4>{ 0.0f, 1.0f, 1.0f, 5.0f },
TVector<4>{ 1.0f, 0.0f, 0.0f, 2.0f },
TVector<4>{ 1.0f, 0.0f, 1.0f, 1.0f },
TVector<4>{ 1.0f, 1.0f, 0.0f, 7.0f },
TVector<4>{ 1.0f, 1.0f, 1.0f, 9.0f },
}
);

// trilinear - 9 data points
DoTest<1, 1, 1>(
{
TVector<4>{ 0.0f, 0.0f, 0.0f, 8.0f },
TVector<4>{ 0.0f, 0.0f, 1.0f, 4.0f },
TVector<4>{ 0.0f, 1.0f, 0.0f, 6.0f },
TVector<4>{ 0.0f, 1.0f, 1.0f, 5.0f },
TVector<4>{ 1.0f, 0.0f, 0.0f, 2.0f },
TVector<4>{ 1.0f, 0.0f, 1.0f, 1.0f },
TVector<4>{ 1.0f, 1.0f, 0.0f, 7.0f },
TVector<4>{ 1.0f, 1.0f, 1.0f, 9.0f },
TVector<4>{ 0.5f, 0.5f, 0.5f, 12.0f },
}
);

// Linear - 2 data points
DoTest<1>(
{
TVector<2>{ 1.0f, 2.0f },
TVector<2>{ 2.0f, 4.0f },
}
);

// Quadratic - 4 data points
DoTest<2>(
{
TVector<2>{ 1.0f, 5.0f },
TVector<2>{ 2.0f, 16.0f },
TVector<2>{ 3.0f, 31.0f },
TVector<2>{ 4.0f, 16.0f },
}
);

// Cubic - 4 data points
DoTest<3>(
{
TVector<2>{ 1.0f, 5.0f },
TVector<2>{ 2.0f, 16.0f },
TVector<2>{ 3.0f, 31.0f },
TVector<2>{ 4.0f, 16.0f },
}
);

system("pause");
return 0;
}


# Closing

The next logical step here for me would be to figure out how to break the equation for a surface or hypervolume up into multiple equations, like you’d have with a tensor product surface/hypervolume equation. It would also be interesting to see how to convert from these multidimensional polynomials to multidimensional Bernstein basis functions, which are otherwise known as Bezier rectangles (and Bezier hypercubes i guess).

The last post inverted the ATA matrix and multiplied by ATY to get the coefficients. Thanks to some feedback on reddit, I found out that is NOT how you want to solve this sort of equation. I ended up going with Gaussian elimination for this post which is more numerically robust while also being less computation to calculate. There are other options out there too that may be even better choices. I’ve found out that in general, if you are inverting a matrix in code, or even just using an inverted matrix that has already been given to you, you are probably doing it wrong. You can read more about that here: John D. Cook: Don’t invert that matrix.

I didn’t go over what to do if you don’t have enough data points because if you find yourself in that situation, you can either decrease the degree of one of the axes, or you could remove and axis completely if you wanted to. It’s situational and ambiguous what parameter to decrease when you don’t have enough data points to fit a specific curve or hypervolume, but it’s still possible to decay the fit to a lower degree or dimension if you hit this situation, because you will already have all the values you need in the ATA matrix values and the ATV vector. I leave that to you to decide how to handle it in your own usage cases. Something interesting to note is that ATA[0][0] is STILL the count of how many data points you have, so you can use this value to know how much you need to decay your fit to be able to handle the data set.

In the WebGL2 demo I mention, I use a finite difference method to calculate the normals of the surface, however since the surface is described by a polynomial, it’d be trivial to calculate the coefficients for the equations that described the partial derivatives of the surface for each axis and use those instead.

I also wanted to mention that in the case of surfaces and hypervolumes it’s still possible to get an imperfect fit to your data set, even though you may give the exact minimum required number of control points. The reason for this is that not all axes are necesarily created equal. If you have a surface of degree (1,2) it’s linear on the x axis, but quadratic on the y axis, and requires a minimum of 6 data points to be able to fit a data set. As you can imagine, it’s possible to give data points such that the data points ARE NOT LINEAR on the x axis. When this happens, the surface will not be a perfect fit.

Lastly, you may be wondering how to fit data sets where there is more than one output value, like an equation of the form $(z,w)=f(x,y)$.

I’m not aware of any ways to least square fit that as a whole, but apparently a common practice is to fit one equation to z and another to w and treat them independently. There is a math stack exchange question about that here: Math Stack Exchange: Least square fitting multiple values

Here is the webgl demo that goes with this post again:
Least Squares Surface Fitting

Thanks for reading, and please speak up if you have anything to add or correct, or a comment to make!

## Incremental Least Squares Curve Fitting

This Post In Short:

• Fit a curve of degree N to a data set, getting data points 1 at a time.
• Storage Required: 3*N+2 values.
• Update Complexity: roughly 3*N+2 additions and multiplies.
• Finalize Complexity: Solving Ax=b where A is an (N+1)x(N+1) matrix and b is a known vector. (Sample code inverts A matrix and multiplies by b, Gaussian elimination is better though).
• Simple C++ code and HTML5 demo at bottom!

I was recently reading a post from a buddy on OIT or “Order Independent Transparency” which is an open problem in graphics:
Fourier series based OIT and why it won’t work

In the article he talks about trying to approximate a function per pixel and shows the details of some methods he tried. One of the difficulties with the problem is that during a render you can get any number of triangles affecting a specific pixel, but you need a fixed and bounded size amount of storage per pixel for those variable numbers of data points.

That made me wonder: Is there an algorithm that can approximate a data set with a function, getting only one data point at a time, and end up with a decent approximation?

It turns out that there is one, at least one that I am happy with: Incremental Least Squares Curve Fitting.

While this perhaps doesn’t address all the problems that need addressing for OIT specifically, I think this is a great technique for programming in general, and I’m betting it still has it’s uses in graphics, for other times when you want to approximate a data set per pixel.

We’ll work through a math oriented way to do it, and then we’ll convert it into an equivalent and simpler programmer friendly version.

At the bottom of the post is some simple C++ that implements everything we talk about and the image below is a screenshot of an an interactive HTML5 demo I made: Least Squares Curve Fitting

# Mathy Version

Math Stack Exchange: Creating a function incrementally

I have to admit, I’m not so great with matrices outside of the typical graphics/gamedev usage cases of transormation and related, so it took me a few days to work through it and understand it all. If reading that answer made your eyes go blurry, give my explanation a shot. I’m hoping I gave step by step details enough such that you too can understand what the heck he was talking about. If not, let me know where you got lost and I can explain better and update the post.

The first thing we need to do is figure out what degree of a function we want to approximate our data with. For our example we’ll pick a degree 2 function, also known as a quadratic function. That means that when we are done we will get out a function of the form below:

$y=ax^2+bx+c$

We will give data points to the equation and it will calculate the values of a,b and c that approximate our function by minimizing the sum of the squared distance from each point to the curve.

We’ll deal with regular least squared fitting before moving onto incremental, so here’s the data set we’ll be fitting our quadratic curve to:

$(1,5),(2,16),(3,31),(4,50)$

The x values in my data set start at 1 and count up by 1, but that is not a requirement. You can use whatever x and y values you want to fit a curve to.

Next we need to calculate the matrix $A$, where $A_{jk} = x_j^k$ and the matrix has NumDataPoints rows and Degree+1 columns. It looks like the below for a quadratic curve fitting 4 data points:

$A = \begin{bmatrix} x_0^0 & x_0^1 & x_0^2 \\ x_1^0 & x_1^1 & x_1^2 \\ x_2^0 & x_2^1 & x_2^2 \\ x_3^0 & x_3^1 & x_3^2 \\ \end{bmatrix}$

When we plug in our specific x values we get this:

$A = \begin{bmatrix} 1^0 & 1^1 & 1^2 \\ 2^0 & 2^1 & 2^2 \\ 3^0 & 3^1 & 3^2 \\ 4^0 & 4^1 & 4^2 \\ \end{bmatrix}$

Calculating it out we get this:

$A = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ \end{bmatrix}$

Next we need to calculate the matrix $A^TA$, which we do below by multiplying the transpose of A by A:

$A^TA = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 4 & 9 & 16 \\ \end{bmatrix} * \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ \end{bmatrix} = \begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

Next we need to find the inverse of that matrix to get $(A^TA)^{-1}$. The inverse is:

$(A^TA)^{-1} = \begin{bmatrix} 31/4 & -27/4 & 5/4 \\ -27/4 & 129/20 & -5/4 \\ 5/4 & -5/4 & 1/4 \\ \end{bmatrix}$

The next thing we need to calculate is $A^TY$, which is the transpose of A multiplied by all of the Y values of our data:

$A^TY = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 4 & 9 & 16 \\ \end{bmatrix} * \begin{bmatrix} 5 \\ 16 \\ 31 \\ 50 \\ \end{bmatrix} = \begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

And finally, to calculate the coefficients of our quadratic function, we need to calculate $(A^TA)^{-1}*A^TY$:

$(A^TA)^{-1}*A^TY = \begin{bmatrix} 31/4 & -27/4 & 5/4 \\ -27/4 & 129/20 & -5/4 \\ 5/4 & -5/4 & 1/4 \\ \end{bmatrix} * \begin{bmatrix} 102 \\ 330 \\ 1148 \\ \end{bmatrix} = \begin{bmatrix} -2 & 5 & 2 \\ \end{bmatrix}$

Those coefficients are listed in power order of x, so the first value -2 is the coefficient for x^0, 5 is the coefficient for x^1 and 2 is the coefficient for x^2. That gives us the equation:

$y=2x^2+5x-2$

If you plug in the x values from our data set, you’ll find that this curve perfectly fits all 4 data points.

It won’t always be (and usually won’t be) that a resulting curve matches the input set for all values. It just so happened that this time it does. The only guarantee you’ll get when fitting a curve to the data points is that the squared distance of the point to the curve (distance on the Y axis only, so vertical distance), is minimized for all data points.

Now that we’ve worked through the math, let’s make some observations and make it more programmer friendly.

# Making it Programmer Friendly

Let’s look at the $A^TA$ matrix again:

$\begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

One thing you probably noticed right away is that it’s symmetric across the diagonal. Another thing you may have noticed is that there are only 5 unique values in that matrix.

As it turns out, those 5 values are just the sum of the x values, when those x values are raised to increasing powers.

• If you take all x values of our data set, raise them to the 0th power and sum the results, you get 4.
• If you take all x values of our data set, raise them to the 1st power and sum the results, you get 10.
• If you take all x values of our data set, raise them to the 2nd power and sum the results, you get 30.
• If you take all x values of our data set, raise them to the 3rd power and sum the results, you get 100.
• If you take all x values of our data set, raise them to the 4th power and sum the results, you get 354.

Further more, the power of the x values in each part of the matrix is the zero based x axis index plus the zero based y axis index. Check out what i mean below, which shows which power the x values are taken to before being summed for each location in the matrix:

$\begin{bmatrix} 0 & 1 & 2 \\ 1 & 2 & 3 \\ 2 & 3 & 4 \\ \end{bmatrix}$

That is interesting for two reasons…

1. This tells us that we only really need to store the 5 unique values, and that we can reconstruct the full matrix later when it’s time to calculate the coefficients.
2. It also tells us that if we’ve fit a curve to some data points, but then want to add a new data point, that we can just raise the x value of our new data point to the different powers and add it into these 5 values we already have stored. In other words, the $A^TA$ matrix can be incrementally adjusted as new data comes in.

This generalizes beyond quadratic functions too luckily. If you are fitting your data points with a degree N curve, the $A^TA$ matrix will have N+1 rows, and N+1 columns, but will only have (N+1)*2-1 unique values stored in it. Those values will be the sum of the x values taken from the 0th power up to the (N+1)*2-2th power.

As a concrete example, a cubic fit will have an $A^TA$ array that is 4×4, which will only have 7 unique values stored in it. Those values will be the x values raised to the 0th power and summed, all the way up to the x values raised to the 6th power and summed.

So, the $A^TA$ matrix has a fixed storage amount of (degree+1)*2 – 1 values, and it can be incrementally updated.

That is great, but there is another value we need to look at too, which is the $A^TY$ vector. Let’s see that again:

$\begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

There are some patterns to this vector too luckily. You may have noticed that the first entry is the sum of the Y values from our data set. It’s actually the sum of the y values multiplied by the x values raised to the 0th power.

The next number is the sum of the y values multiplied by the x values raised to the 1st power, and so on.

To generalize it, each entry in that vector is the sum of taking the x from each data point, raising it to the power that is the index in the vector, and multiplying it by the y value.

• Taking each data point’s x value, raising it to the 0th power, multiplying by the y value, and summing the results gives you 102.
• Taking each data point’s x value, raising it to the 1st power, multiplying by the y value, and summing the results gives you 330.
• Taking each data point’s x value, raising it to the 2nd power, multiplying by the y value, and summing the results gives you 1148.

So, this vector is incrementally updatable too. When you get a new data point, for each entry in the vector, you take the x value to the specific power, multiply by y, and add that result to the entry in the vector.

This generalizes for other curve types as well. If you are fitting your data points with a degree N curve, the $A^TY$ vector will have N+1 entries, corresponding to the powers: 0,1,…N.

As a concrete example, a cubic fit will have an $A^TY$ vector of size 4, corresponding to the powers: 0,1,2,3.

Combining the storage needs of the values needed for the $A^TA$ matrix, as well as the values needed for the $A^TY$ vector, the amount of storage we need for a degree N curve fit is 3*N+2 values.

# Algorithm Summarized

Here is a summary of the algorithm:

1. First decide on the degree of the fit you want. Call it N.
2. Ensure you have storage space for 3*N+2 values and initialize them all to zero. These represent the (N+1)*2-1 values needed for the $A^TA$ matrix values, as well as the N+1 values needed for the $A^TY$ vector.
3. For each data point you get, you will need to update both the $A^TA$ matrix values, as well as the $A^TY$ vector valuess. (Note that there are not the same number of values in ATA and ATY!)
• for(i in ATA) ATA[i] += x^i
• for(i in ATY) ATY[i] += x^i*y
4. When it’s time to calculate the coefficients of your polynomial, convert the ATA values back into the $A^TA$ matrix, invert it and multiply that by the $A^TY$ value.

Pretty simple right?

# Not Having Enough Points

When working through the mathier version of this algorithm, you may have noticed that if we were trying to fit using a degree N curve, that we needed N+1 data points at minimum for the math to even be able to happen.

So, you might ask, what happens in the real world, such as in a pixel shader, where we decide to do a cubic fit, but end up only getting 1 data point, instead of the 4 minimum that we need?

Well, first off, if you use the programmer friendly method of incrementally updating ATA and ATY, you’ll end up with an uninvertible matrix (0 determinant), but that doesn’t really help us any besides telling us when we don’t have enough data.

There is something pretty awesome hiding here though. Let’s look at the ATA matrix and ATY values from our quadratic example again.

$A^TA = \begin{bmatrix} 4 & 10 & 30 \\ 10 & 30 & 100 \\ 30 & 100 & 354 \\ \end{bmatrix}$

$A^TY = \begin{bmatrix} 102 & 330 & 1148 \\ \end{bmatrix}$

The above values are for a quadratic fit. What if we wanted a linear fit instead? Well… the upper left 2×2 matrix in ATA is the ATA matrix for the linear fit! Also, the first two values in the ATY vector is the ATY vector if we were doing a linear fit.

$A^TA = \begin{bmatrix} 4 & 10 \\ 10 & 30 \\ \end{bmatrix}$

$A^TY = \begin{bmatrix} 102 & 330 \\ \end{bmatrix}$

You can verify that the linear fit above is correct if you want, but let’s take it down another degree, down to approximating the fit with a point. They become scalars instead of matrices and vectors:

$A^TA = 4 \\ A^TY = 102$

If we take the inverse of ATA and multiply it by ATY, we get:

$1/4 * 102 = 25.5$

if you average the Y values of our input data, you’ll find that it is indeed 25.5, so we have verified that it does give us a degree 0 fit.

This is neat and all, but how can we KNOW if we’ve collected enough data or not? Do we just try to invert our ATA matrix, and if it fails, try one degree lower, repeatedly, until we succeed or fail at a degree 0 approximation? Do we maybe instead store a counter to keep track of how many points we have seen?

Luckily no, and maybe you have already put it together. The first value in the ATA array actually TELLS you how many points you have been given. You can use that to decide what degree you are going to have to actually fit the data set to when it’s time to calculate your coefficients, to avoid the uninvertible matrix and still get your data fit.

# Interesting Tid Bits

Something pretty awesome about this algorithm is that it can work in a multithreaded fashion very easily. One way would be to break apart the work into multiple job threads, have them calculate ATA and ATY independently, and then sum them all together on the main thread. Another way to do it would be to let all threads share the same ATA and ATY storage, but to use an atomic add operation to update them.

Going the atomic add route, I think this could be a relatively GPU friendly algorithm. You could use actual atomic operations in your shader code, or you could use alpha blending to add your result in.

Even though we saw it in the last section, I’ll mention it again. If you do a degree 0 curve fit to data points (aka fitting a point to the data), this algorithm is mathematically equivalent to just taking the average y value. The ATA values will have a single value which is the sum of the x values to the 0th degree, so will be the count of how many x items there are. The ATY values will also have only a single value, which will be the sum of the x^0*y values, so will be the sum of the y values. Taking the inverse of our 1×1 ATA matrix will give us one divided by how many items there are, so when we multiply that by the ATA vector which only has one item, it will be the same as if we divided our Y value sum by how many data points we had. So, in a way, this algorithm seems to be some sort of generalization of averaging, which is weird to me.

Another cool thing: if you have the minimum number of data points for your degree (aka degree+1 data points) or fewer, you can actually use the ATA and ATY values to get back your ORIGINAL data points – both the X and the Y values! I’ll leave it as an exercise for you, but if you look at it, you will always have more equations than you do unknowns.

If reconstructing the original data points is important to you, you could also have this algorithm operate in two modes.

Basically, always use the ATA[0] storage space to count the number of data points you’ve been given, since that is it’s entire purpose. You can then use the rest of the storage space as RAW data storage for your 2d input values. As soon as adding another value would cause you to overflow your storage, you could process your data points into the correct format of storing just ATA and ATY values, so that from then on, it was an approximation, instead of explicit point storage.

When decoding those values, you would use the ATA[0] storage space to know whether the rest of the storage contained ATA and ATY values, or if they contained data points. If they contained data points, you would also know how many there were, and where they were in the storage space, using the same logic to read data points out as you used to put them back in – basically like saying that the first data point goes immediately after ATA[0], the second data point after that, etc.

The last neat thing, let’s say that you are in a pixel shader as an exmaple, and that you wanted to approximate 2 different values for each pixel, but let’s say that the X value was always the same for these data sets – maybe you are approximating two different values over depth of the pixel for instance so X of both data points is the depth, but the Y values of the data points are different.

If you find yourself in a situation like this, you don’t actually need to pay the full cost of storage to have a second approximation curve.

Since the ATA values are all based on powers of the x values only, the ATA values would be the same for both of these data sets. You only need to pay the cost of the ATY values for the second curve.

This means that fitting a curve costs an initial 3*degree+2 in storage, but each additional curve only costs degree+1 in storage.

Also, since the ATA storage for a curve of degree N also contains the same values used for a curve of degree N-1, N-2, etc, you don’t have to use the same degree when approximating multiple values using the same storage. Your ATA just has to be large enough to hold the largest degree curve, and then you can have ATY values that are sized to the degree of the curve you want to use to approximate each data set.

This way, if you have limited storage, you could perhaps cubically fit one data set, and then linearly fit another data set where accuracy isn’t as important.

For that example, you would pay 11 values of storage for the cubic fit, and then only 2 more values of storage to have a linear fit of some other data.

# Example Code

There is some example code below that implements the ideas from this post.

The code is meant to be clear and readable firstly, with being a reasonably decent implementation second. If you are using this in a place where you want high precision and/or high speeds, there are likely both macro and micro optimizations and code changes to be made. The biggest of these is probably how the matrix is inverted.

You can read more on the reddit discussion: Reddit: Incremental Least Squares Curve Fitting

Here’s a run of the example code:

Here is the example code:

#include <stdio.h>
#include <array>

//====================================================================
template<size_t N>
using TVector = std::array<float, N>;

template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

template<size_t N>
using TSquareMatrix = TMatrix<N,N>;

typedef TVector<2> TDataPoint;

//====================================================================
template <size_t N>
float DotProduct (const TVector<N>& A, const TVector<N>& B)
{
float ret = 0.0f;
for (size_t i = 0; i < N; ++i)
ret += A[i] * B[i];
return ret;
}

//====================================================================
template <size_t M, size_t N>
void TransposeMatrix (const TMatrix<M, N>& in, TMatrix<N, M>& result)
{
for (size_t j = 0; j < M; ++j)
for (size_t k = 0; k < N; ++k)
result[k][j] = in[j][k];
}

//====================================================================
template <size_t M, size_t N>
void MinorMatrix (const TMatrix<M, N>& in, TMatrix<M-1, N-1>& out, size_t excludeI, size_t excludeJ)
{
size_t destI = 0;
for (size_t i = 0; i < N; ++i)
{
if (i != excludeI)
{
size_t destJ = 0;
for (size_t j = 0; j < N; ++j)
{
if (j != excludeJ)
{
out[destI][destJ] = in[i][j];
++destJ;
}
}
++destI;
}
}
}

//====================================================================
template <size_t M, size_t N>
float Determinant (const TMatrix<M,N>& in)
{
float determinant = 0.0f;
TMatrix<M - 1, N - 1> minor;
for (size_t j = 0; j < N; ++j)
{
MinorMatrix(in, minor, 0, j);

float minorDeterminant = Determinant(minor);
if (j % 2 == 1)
minorDeterminant *= -1.0f;

determinant += in[0][j] * minorDeterminant;
}
return determinant;
}

//====================================================================
template <>
float Determinant<1> (const TMatrix<1,1>& in)
{
return in[0][0];
}

//====================================================================
template <size_t N>
bool InvertMatrix (const TSquareMatrix<N>& in, TSquareMatrix<N>& out)
{
// calculate the cofactor matrix and determinant
float determinant = 0.0f;
TSquareMatrix<N> cofactors;
TSquareMatrix<N-1> minor;
for (size_t i = 0; i < N; ++i)
{
for (size_t j = 0; j < N; ++j)
{
MinorMatrix(in, minor, i, j);

cofactors[i][j] = Determinant(minor);
if ((i + j) % 2 == 1)
cofactors[i][j] *= -1.0f;

if (i == 0)
determinant += in[i][j] * cofactors[i][j];
}
}

// matrix cant be inverted if determinant is zero
if (determinant == 0.0f)
return false;

// calculate the adjoint matrix into the out matrix
TransposeMatrix(cofactors, out);

// divide by determinant
float oneOverDeterminant = 1.0f / determinant;
for (size_t i = 0; i < N; ++i)
for (size_t j = 0; j < N; ++j)
out[i][j] *= oneOverDeterminant;
return true;
}

//====================================================================
template <>
bool InvertMatrix<2> (const TSquareMatrix<2>& in, TSquareMatrix<2>& out)
{
float determinant = Determinant(in);
if (determinant == 0.0f)
return false;

float oneOverDeterminant = 1.0f / determinant;
out[0][0] =  in[1][1] * oneOverDeterminant;
out[0][1] = -in[0][1] * oneOverDeterminant;
out[1][0] = -in[1][0] * oneOverDeterminant;
out[1][1] =  in[0][0] * oneOverDeterminant;
return true;
}

//====================================================================
template <size_t DEGREE>  // 1 = linear, 2 = quadratic, etc
class COnlineLeastSquaresFitter
{
public:
COnlineLeastSquaresFitter ()
{
// initialize our sums to zero
std::fill(m_SummedPowersX.begin(), m_SummedPowersX.end(), 0.0f);
std::fill(m_SummedPowersXTimesY.begin(), m_SummedPowersXTimesY.end(), 0.0f);
}

{
// add the summed powers of the x value
float xpow = 1.0f;
for (size_t i = 0; i < m_SummedPowersX.size(); ++i)
{
m_SummedPowersX[i] += xpow;
xpow *= dataPoint[0];
}

// add the summed powers of the x value, multiplied by the y value
xpow = 1.0f;
for (size_t i = 0; i < m_SummedPowersXTimesY.size(); ++i)
{
m_SummedPowersXTimesY[i] += xpow * dataPoint[1];
xpow *= dataPoint[0];
}
}

bool CalculateCoefficients (TVector<DEGREE+1>& coefficients) const
{
// initialize all coefficients to zero
std::fill(coefficients.begin(), coefficients.end(), 0.0f);

// calculate the coefficients
return CalculateCoefficientsInternal<DEGREE>(coefficients);
}

private:

template <size_t EFFECTIVEDEGREE>
bool CalculateCoefficientsInternal (TVector<DEGREE + 1>& coefficients) const
{
// if we don't have enough data points for this degree, try one degree less
if (m_SummedPowersX[0] <= EFFECTIVEDEGREE)
return CalculateCoefficientsInternal<EFFECTIVEDEGREE - 1>(coefficients);

// Make the ATA matrix
TMatrix<EFFECTIVEDEGREE + 1, EFFECTIVEDEGREE + 1> ATA;
for (size_t i = 0; i < EFFECTIVEDEGREE + 1; ++i)
for (size_t j = 0; j < EFFECTIVEDEGREE + 1; ++j)
ATA[i][j] = m_SummedPowersX[i + j];

// calculate inverse of ATA matrix
TMatrix<EFFECTIVEDEGREE + 1, EFFECTIVEDEGREE + 1> ATAInverse;
if (!InvertMatrix(ATA, ATAInverse))
return false;

// calculate the coefficients for this degree. The higher ones are already zeroed out.
TVector<EFFECTIVEDEGREE + 1> summedPowersXTimesY;
std::copy(m_SummedPowersXTimesY.begin(), m_SummedPowersXTimesY.begin() + EFFECTIVEDEGREE + 1, summedPowersXTimesY.begin());
for (size_t i = 0; i < EFFECTIVEDEGREE + 1; ++i)
coefficients[i] = DotProduct(ATAInverse[i], summedPowersXTimesY);
return true;
}

// Base case when no points are given, or if you are fitting a degree 0 curve to the data set.
template <>
bool CalculateCoefficientsInternal<0> (TVector<DEGREE + 1>& coefficients) const
{
if (m_SummedPowersX[0] > 0.0f)
coefficients[0] = m_SummedPowersXTimesY[0] / m_SummedPowersX[0];
return true;
}

// Total storage space (# of floats) needed is 3 * DEGREE + 2
// Where y is number of values that need to be stored and x is the degree of the polynomial
TVector<(DEGREE + 1) * 2 - 1>   m_SummedPowersX;
TVector<DEGREE + 1>             m_SummedPowersXTimesY;
};

//====================================================================
template <size_t DEGREE>
void DoTest(const std::initializer_list<TDataPoint>& data)
{
printf("Fitting a curve of degree %zi to %zi data points:\n", DEGREE, data.size());

COnlineLeastSquaresFitter<DEGREE> fitter;

// show data
for (const TDataPoint& dataPoint : data)
printf("  (%0.2f, %0.2f)\n", dataPoint[0], dataPoint[1]);

// fit data
for (const TDataPoint& dataPoint : data)

// calculate coefficients if we can
TVector<DEGREE+1> coefficients;
bool success = fitter.CalculateCoefficients(coefficients);
if (!success)
{
printf("ATA Matrix could not be inverted!\n");
return;
}

// print the polynomial
bool firstTerm = true;
printf("y = ");
bool showedATerm = false;
for (int i = (int)coefficients.size() - 1; i >= 0; --i)
{
// don't show zero terms
if (std::abs(coefficients[i]) < 0.00001f)
continue;

showedATerm = true;

// show an add or subtract between terms
float coefficient = coefficients[i];
if (firstTerm)
firstTerm = false;
else if (coefficient >= 0.0f)
printf(" + ");
else
{
coefficient *= -1.0f;
printf(" - ");
}

printf("%0.2f", coefficient);

if (i > 0)
printf("x");

if (i > 1)
printf("^%i", i);
}
if (!showedATerm)
printf("0");
printf("\n\n");
}

//====================================================================
int main (int argc, char **argv)
{
// Point - 1 data points
DoTest<0>(
{
TDataPoint{ 1.0f, 2.0f },
}
);

// Point - 2 data points
DoTest<0>(
{
TDataPoint{ 1.0f, 2.0f },
TDataPoint{ 2.0f, 4.0f },
}
);

// Linear - 2 data points
DoTest<1>(
{
TDataPoint{ 1.0f, 2.0f },
TDataPoint{ 2.0f, 4.0f },
}
);

// Linear - 3 colinear data points
DoTest<1>(
{
TDataPoint{ 1.0f, 2.0f },
TDataPoint{ 2.0f, 4.0f },
TDataPoint{ 3.0f, 6.0f },
}
);

// Linear - 3 non colinear data points
DoTest<1>(
{
TDataPoint{ 1.0f, 2.0f },
TDataPoint{ 2.0f, 4.0f },
TDataPoint{ 3.0f, 5.0f },
}
);

// Quadratic - 3 colinear data points
DoTest<2>(
{
TDataPoint{ 1.0f, 2.0f },
TDataPoint{ 2.0f, 4.0f },
TDataPoint{ 3.0f, 6.0f },
}
);

// Quadratic - 3 data points
DoTest<2>(
{
TDataPoint{ 1.0f, 5.0f },
TDataPoint{ 2.0f, 16.0f },
TDataPoint{ 3.0f, 31.0f },
}
);

// Cubic - 4 data points
DoTest<3>(
{
TDataPoint{ 1.0f, 5.0f },
TDataPoint{ 2.0f, 16.0f },
TDataPoint{ 3.0f, 31.0f },
TDataPoint{ 4.0f, 16.0f },
}
);

// Cubic - 2 data points
DoTest<3>(
{
TDataPoint{ 1.0f, 7.0f },
TDataPoint{ 3.0f, 17.0f },
}
);

// Cubic - 1 data point
DoTest<3>(
{
TDataPoint{ 1.0f, 7.0f },
}
);

// Cubic - 0 data points
DoTest<3>(
{
}
);

system("pause");
return 0;
}


# Feedback

There’s some interesting feedback on twitter.