Counting Bits & The Normal Distribution

I recently saw some interesting posts on twitter about the normal distribution:

I’m not really a statistics kind of guy, but knowing that probability distributions come up in graphics (Like in PBR & Path Tracing), it seemed like a good time to upgrade knowledge in this area while sharing an interesting technique for generating normal distribution random numbers.

Basics

Below is an image showing a few normal (aka Gaussian) distributions (from wikipedia).

Normal distributions are defined by these parameters:

• $\mu$ – “mu” is the mean. This is the average value of the distribution. This is where the center (peak) of the curve is on the x axis.
• $\sigma^2$ – “sigma squared” is the variance, and is just the standard deviation squared. I find standard deviation more intuitive to think about.
• $\sigma$ – “sigma” is the standard deviation, which (surprise surprise!) is the square root of the variance. This controls the “width” of the graph. The area under the cover is 1.0, so as you increase standard deviation and make the graph wider, it also gets shorter.

Here’s a diagram of standard deviations to help understand them (also from wikipedia):

I find the standard deviation intuitive because 68.2% of the data is within one standard deviation from the mean (on the plus and minus side of the mean). 95.4% of the data is within two standard deviations of the mean.

Standard deviation is given in the same units as the data itself, so if a bell curve described scores on a test, with a mean of 80 and a standard deviation of 5, it means that 68.2% of the students got between 75 and 85 points on the test, and that 95.4% of the students got between 70 and 90 points on the test.

The normal distribution is what’s called a “probability density function” or pdf, which means that the y axis of the graph describes the likelyhood of the number on the x axis being chosen at random.

This means that if you have a normal distribution that has a specific mean and variance (standard deviation), that numbers closer to the mean are more likely to be chosen randomly, while numbers farther away are less likely. The variance controls how the probability drops off as you get farther away from the mean.

Thinking about standard deviation again, 68.2% of the random numbers generated will be within 1 standard deviation of the mean (+1 std dev or -1 std dev). 95.4% will be within 2 standard deviations.

Generating Normal Distribution Random Numbers – Coin Flips

Generating uniform random numbers, where every number is as likely as every other number, is pretty simple. In the physical world, you can roll some dice or flip some coins. In the software world, you can use PRNGs.

How would you generate random numbers that follow a normal distribution though?

In C++, there is std::normal_distribution that can do this for you. There is also something called the Box-Muller transform that can turn uniformly distributed random numbers into normal distribution random numbers (info here: Generating Gaussian Random Numbers).

I want to talk about something else though and hopefully build some better intuition.

First let’s look at coin flips.

If you flip a fair coin a million times and keep a count of how many heads and tails you saw, you might get 500014 heads and 499986 tails (I got this with a PRNG – std::mt19937). That is a pretty uniform distribution of values in the range of [0,1]. (breadcrumb: pascal’s triangle row 2 is 1,1)

Let’s flip two coins at a time though and add our values together (say that heads is 0 and tails is 1). Here’s what that graph looks like:

Out of 1 million flips, 250639 had no tails, 500308 had one tail, and 249053 had two tails. It might seem weird that they aren’t all even, but it makes more sense when you look at the outcome of flipping two coins: we can get heads/heads (00), heads/tails (01), tails/heads (10) or tails/tails (11). Two of the four possibilities have a single tails, so it makes sense that flipping two coins and getting one coin being a tail would be twice as likely as getting no tails or two tails. (breadcrumb: pascal’s triangle row 3 is 1,2,1)

What happens when we sum 3 coins? With a million flips I got 125113 0’s, 375763 1’s, 373905 2’s and 125219 3’s.

If you work out the possible combinations, there is 1 way to get 0, 3 ways to get 1, 3 ways to get 2 and 1 way to get 3. Those numbers almost exactly follow that 1, 3, 3, 1 probability. (breadcrumb: pascal’s triangle row 4 is 1,3,3,1)

If we flip 100 coins and sum them, we get this:

That looks a bit like the normal distribution graphs at the beginning of this post doesn’t it?

Flipping and summing coins will get you something called the “Binomial Distribution”, and the interesting thing there is that the binomial distribution approaches the normal distribution the more coins you are summing together. At an infinite number of coins, it is the normal distribution.

Generating Normal Distribution Random Numbers – Dice Rolls

What if instead of flipping coins, we roll dice?

Well, rolling a 4 sided die a million times, you get each number roughly the same percentage of the time as you’d expect; roughly 25% each. 250125 0’s, 250103 1’s, 249700 2’s, 250072 3’s.

If we sum two 4 sided dice rolls we get this:

If we sum three 4 sided dice rolls we get this:

And if we sum one hundred we get this, which sure looks like a normal distribution:

This isn’t limited to four sided dice though, here’s one hundred 6 sided dice being summed:

With dice, instead of being a “binomial distribution”, it’s called a “multinomial distribution”, but as the number of dice goes to infinity, it also approaches the normal distribution.

This means you can get a normal distribution with not only coins, but any sided dice in general.

An even stronger statement than that is the Central Limit Theorem which says that if you have random numbers from ANY distribution, if you add enough of em together, you’ll often approach a normal distribution.

Strange huh?

Generating Normal Distribution Random Numbers – Counting Bits

Now comes a fun way of generating random numbers which follow a normal distribution. Are you ready for it?

Simply generate an N bit random number and return how many 1 bits are set.

That gives you a random number that follows a normal distribution!

One problem with this is that you have very low “resolution” random numbers. Counting the bits of a 64 bit random number for instance, you can only return 0 through 64 so there are only 65 possible random numbers.

That is a pretty big limitation, but if you need normal distribution numbers calculated quickly and don’t mind if they are low resolution (like in a pixel shader?), this technique could work well for you.

Another problem though is that you don’t have control over the variance or the mean of the distribution.

That isn’t a super huge deal though because you can easily convert numbers from one normal distribution into another normal distribution.

To do so, you get your normal distribution random number. First you subtract the mean of the distribution to make it centered on 0 (have a mean of 0). You then divide it by the standard deviation to make it be part of a distribution which has a standard deviation of 1.

At this point you have a random number from a normal distribution which has a mean of 0 and a standard deviation of 1.

Next, you multiply the number by the standard deviation of the distribution you want, and lastly you add the mean of the distribution you want.

That’s pretty simple (and is implemented in the source code at the bottom of this post), but to do this you need to know what standard deviation (variance) and mean you are starting with.

If you have some way to generate random numbers in [0, N) and you are summing M of those numbers together, the mean is $M*(N-1)/2$. Note that if you instead are generating random numbers in [1,N], the mean instead is $M*(N+1)/2$.

The variance in either case is $M*(N^2-1)/12$. The standard deviation is the square root of that.

Using that information you have everything you need to generate normal distribution random numbers of a specified mean and variance.

Thanks to @fahickman for the help on calculating mean and variance of dice roll sums.

Code

Here is the source code I used to generate the data which was used to generate the graphs in this post. There is also an implementation of the bit counting algorithm i mentioned, which converts to the desired mean and variance.

#define _CRT_SECURE_NO_WARNINGS

#include <array>
#include <random>
#include <stdint.h>
#include <stdio.h>
#include <limits>

const size_t c_maxNumSamples = 1000000;
const char* c_fileName = "results.csv";

template <size_t DiceRange, size_t DiceCount, size_t NumBuckets>
void DumpBucketCountsAddRandomNumbers (size_t numSamples, const std::array<size_t, NumBuckets>& bucketCounts)
{
// open file for append if we can
FILE* file = fopen(c_fileName, "a+t");
if (!file)
return;

// write the info
float mean = float(DiceCount) * float(DiceRange - 1.0f) / 2.0f;
float variance = float(DiceCount) * (DiceRange * DiceRange) / 12.0f;
if (numSamples == 1)
{
fprintf(file, "\"%zu random numbers [0,%zu) added together (sum %zud%zu). %zu buckets.  Mean = %0.2f.  Variance = %0.2f.  StdDev = %0.2f.\"\n", DiceCount, DiceRange, DiceCount, DiceRange, NumBuckets, mean, variance, std::sqrt(variance));
fprintf(file, "\"\"");
for (size_t i = 0; i < NumBuckets; ++i)
fprintf(file, ",\"%zu\"", i);
fprintf(file, "\n");
}
fprintf(file, "\"%zu samples\",", numSamples);

// report the samples
for (size_t count : bucketCounts)
fprintf(file, "\"%zu\",", count);

fprintf(file, "\"\"\n");
if (numSamples == c_maxNumSamples)
fprintf(file, "\n");

// close file
fclose(file);
}

template <size_t DiceSides, size_t DiceCount>
{
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<size_t> dist(size_t(0), DiceSides - 1);

std::array<size_t, (DiceSides - 1) * DiceCount + 1> bucketCounts = { 0 };

size_t nextDump = 1;
for (size_t i = 0; i < c_maxNumSamples; ++i)
{
size_t sum = 0;
for (size_t j = 0; j < DiceCount; ++j)
sum += dist(rng);

bucketCounts[sum]++;

if (i + 1 == nextDump)
{
nextDump *= 10;
}
}
}

template <size_t NumBuckets>
void DumpBucketCountsCountBits (size_t numSamples, const std::array<size_t, NumBuckets>& bucketCounts)
{
// open file for append if we can
FILE* file = fopen(c_fileName, "a+t");
if (!file)
return;

// write the info
float mean = float(NumBuckets-1) * 1.0f / 2.0f;
float variance = float(NumBuckets-1) * 3.0f / 12.0f;
if (numSamples == 1)
{
fprintf(file, "\"%zu random bits (coin flips) added together. %zu buckets.  Mean = %0.2f.  Variance = %0.2f.  StdDev = %0.2f.\"\n", NumBuckets - 1, NumBuckets, mean, variance, std::sqrt(variance));
fprintf(file, "\"\"");
for (size_t i = 0; i < NumBuckets; ++i)
fprintf(file, ",\"%zu\"", i);
fprintf(file, "\n");
}
fprintf(file, "\"%zu samples\",", numSamples);

// report the samples
for (size_t count : bucketCounts)
fprintf(file, "\"%zu\",", count);

fprintf(file, "\"\"\n");
if (numSamples == c_maxNumSamples)
fprintf(file, "\n");

// close file
fclose(file);
}

template <size_t NumBits> // aka NumCoinFlips!
void CountBitsTest ()
{

size_t maxValue = 0;
for (size_t i = 0; i < NumBits; ++i)
maxValue = (maxValue << 1) | 1;

std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<size_t> dist(0, maxValue);

std::array<size_t, NumBits + 1> bucketCounts = { 0 };

size_t nextDump = 1;
for (size_t i = 0; i < c_maxNumSamples; ++i)
{
size_t sum = 0;
size_t number = dist(rng);
while (number)
{
if (number & 1)
++sum;
number = number >> 1;
}

bucketCounts[sum]++;

if (i + 1 == nextDump)
{
DumpBucketCountsCountBits(nextDump, bucketCounts);
nextDump *= 10;
}
}
}

float GenerateNormalRandomNumber (float mean, float variance)
{
static std::mt19937 rng;
static std::uniform_int_distribution<uint64_t> dist(0, (uint64_t)-1);

static bool seeded = false;
if (!seeded)
{
seeded = true;
rng.seed(std::random_device()());
}

// generate our normal distributed random number from 0 to 65.
//
float sum = 0.0f;
uint64_t number = dist(rng);
while (number)
{
if (number & 1)
sum += 1.0f;
number = number >> 1;
}

// convert from: mean 32, variance 16, stddev 4
// to: mean 0, variance 1, stddev 1
float ret = sum;
ret -= 32.0f;
ret /= 4.0f;

// convert to the specified mean and variance
ret *= std::sqrt(variance);
ret += mean;
return ret;
}

void VerifyGenerateNormalRandomNumber (float mean, float variance)
{
// open file for append if we can
FILE* file = fopen(c_fileName, "a+t");
if (!file)
return;

// write info
fprintf(file, "\"Normal Distributed Random Numbers. mean = %0.2f.  variance = %0.2f.  stddev = %0.2f\"\n", mean, variance, std::sqrt(variance));

// write some random numbers
fprintf(file, "\"100 numbers\"");
for (size_t i = 0; i < 100; ++i)
fprintf(file, ",\"%f\"", GenerateNormalRandomNumber(mean, variance));
fprintf(file, "\n\n");

// close file
fclose(file);
}

int main (int argc, char **argv)
{
// clear out the file
FILE* file = fopen(c_fileName, "w+t");
if (file)
fclose(file);

// coin flips
{
// flip a fair coin

// flip two coins and sum them

// sum 3 coin flips

// sum 100 coin flips
}

// dice rolls
{
// roll a 4 sided die

// sum two 4 sided dice

// sum three 4 sided dice

// sum one hundred 4 sided dice

// sum one hundred 6 sided dice
}

CountBitsTest<8>();
CountBitsTest<16>();
CountBitsTest<32>();
CountBitsTest<64>();

VerifyGenerateNormalRandomNumber(0.0f, 20.0f);

VerifyGenerateNormalRandomNumber(0.0f, 10.0f);

VerifyGenerateNormalRandomNumber(5.0f, 10.0f);

return 0;
}


WebGL PBR Implementation

Just want to see the demo? Click the link below. Warning: it loads quite a few images, some of which are ~10MB, so may take some time to load (it does report loading progress though):

http://demofox.org/WebGLPBR/

There is a great PBR (Physically Based Rendering) tutorial at: https://learnopengl.com/#!PBR/Theory

I followed that tutorial, making a WebGL PBR implementation as I went, but also making some C++ for pre-integrating diffuse and specular IBL (Image Based Lighting) and making the splitsum texture.

Pre-integrating the diffuse and specular (and using the splitsum texture) allows you to use an object’s surroundings as light sources, which is more in line with how real life works; we don’t just have point lights and directional lights in the real world, we have objects that glow because they are illuminated by light sources, and we have light sources which are in odd shapes.

It’s possible that there are one or more math errors or bugs in the C++ as well as my WebGL PBR implementation. At some point in the future I’ll dig deeper into the math of PBR and try and write up some simple blog posts about it, at which point I’ll be more confident about correctness other than “well, it looks right…”.

The source code for the C++ pre-integrations are on github:
IBL Diffuse Cube Map Integration
IBL Specular Cube Map Integration + Split Sum Texture

The WebGL PBR implementation is also on github:
WebGLPBR

Here are some screenshots:

Learn WebGL2:

https://webgl2fundamentals.org/

Free PBR Materials:

http://freepbr.com/materials/rusted-iron-pbr-metal-material-alt/

https://learnopengl.com/#!PBR/Theory

https://disney-animation.s3.amazonaws.com/library/s2012_pbs_disney_brdf_notes_v2.pdf

SIMD / GPU Friendly Branchless Binary Search

The other day I was thinking about how you might do a binary search branchlessly. I came up with a way, and I’m pretty sure I’m not the first to come up with it, but it was fun to think about and I wanted to share my solution.

Here it is searching a list of 8 items in 3 steps:

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
size_t ret = (haystack[4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}


The three steps it does are:

1. The list has 8 items in it. We test index 4 to see if we need to continue searching index 0 to 3, or index 4 to 7. The returned index becomes either 0xx or 1xx.
2. The list has 4 items in it now. We test index 2 to see if we need to continue searching index 0 to 1, or index 2 to 3. The returned index becomes one of the following: 00x, 01x, 10x or 11x.
3. The list has 2 items in it. We test index 1 to see if we need to take the left item or the right item. The returned index becomes: 000, 001, 010, 011, 100, 101, 110, or 111.

But Big O Complexity is Worse!

Usually a binary search can take up to $O(log N)$ steps, where $N$ is the number of items in the list. In this post’s solution, it always takes $log_2N$ steps.

It probably seems odd that this branchless version could be considered an improvement when it has big O complexity that is always the worst case of a regular binary search. That is strange, but in the world of SIMD and shader programs, going branchless can be a big win that is not captured by looking at big O complexity. (Note that cache coherancy and thread contention are two other things not captured by looking at big O complexity).

Also, when working in something like video games or other interactive simulations, an even frame rate is more important than a high frame rate for making a game look and feel smooth. Because of this, if you have algorithms that have very fast common cases but much slower worst cases, you may actually prefer to use an algorithm that is slower in the common case but faster in the worst case just to keep performance more consistent. Using an algorithm such as this, which has a constant amount of work regardless of input can be a good trade off there.

Lastly, in cryptographic applications, attackers can gather secret information by seeing how long certain operations take. For instance, if you use a shorter password than other people, an attacker may be able to detect that by seeing that it consistently takes you a little bit less time to login than other people. They now have an idea of the length of your password, and maybe will brute force you, knowing that you are low hanging fruit!

These timing based attacks can be thwarted by algorithms which run at a constant time regardless of input. This algorithm is one of those algorithms.

As an example of another algorithm that runs in constant time regardless of input, check out CORDIC math. I really want to write up a post on that someday, it’s pretty cool stuff.

You might have noticed that if the item you are searching for isn’t in the list, the function doesn’t return anything indicating that, and you might think that’s strange.

This function actually just returns the largest index that isn’t greater than the value you are searching for. If all the numbers are greater than the value you are searching for, it returns zero.

This might seem odd but this can actually come in handy if the list you are searching represents something like animation data, where there are keyframes sorted by time, and you want to find which two keyframes you are between so that you can interpolate.

To actually test if your value was in the list, you could do an extra check:

    size_t searchValue = 3;
size_t index = BinarySearch8(searchValue, list);
bool found = (list[index] == searchValue);


If you need that extra check, it’s easy enough to add, and if you don’t need that extra check, it’s nice to not have it.

Without Ternary Operator

If in your setup you don’t have a ternary operator, or if the ternary operator isn’t branchless for you, you get the same results using multiplication:

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
size_t ret = (haystack[4] <= needle) * 4;
ret += (haystack[ret + 2] <= needle) * 2;
ret += (haystack[ret + 1] <= needle) * 1;
return ret;
}


Note that on some platforms, the less than or equal test will be a branch! None of the platforms or compilers I tested had that issue but if you find yourself hitting that issue, you can do a branchless test via subtraction or similar.

Here is a godbolt link that lets you view the assembly for various compilers. When you open the link you’ll see clang doing this work branchlessly.
View Assembly

@adamjmiles from twitter also verified that GCN does it branchlessly, which you can see at the link below. Thanks for that!
View GCN Assembly

Something to keep in mind for the non GPU case though is that if you were doing this in SIMD, you’d be using SIMD intrinsics.

Larger Lists

It’s trivial to search larger numbers of values. Here it is searching 16 items in 4 steps:

size_t BinarySearch16 (size_t needle, const size_t haystack[16])
{
size_t ret = (haystack[8] <= needle) ? 8 : 0;
ret += (haystack[ret + 4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}


And here it is searching 32 items in 5 steps:

size_t BinarySearch32 (size_t needle, const size_t haystack[32])
{
size_t ret = (haystack[16] <= needle) ? 16 : 0;
ret += (haystack[ret + 8] <= needle) ? 8 : 0;
ret += (haystack[ret + 4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}


Non Power of 2 Lists

Let’s say that your list is not a perfect power of two in length. GASP!

You can still use the technique, but you treat it as if it has the next power of 2 up items, and then make sure your indices stay in range. The nice part here is that you don’t have to do extra work on the index at each step of the way, only in the places where it’s possible for the index to go out of range.

Here it is searching an array of size 7 in 3 steps:

size_t BinarySearch7 (size_t needle, const size_t haystack[7])
{
size_t ret = 0;
size_t testIndex = 0;

// test index is at most 4, so is within range.
testIndex = ret + 4;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 6, so is within range.
testIndex = ret + 2;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 7, so could be out of range.
// use min() to make sure the index stays in range.
testIndex = std::min<size_t>(ret + 1, 6);
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

return ret;
}


There are some other techniques for dealing with non power of 2 sized lists that you can find in the links at the bottom, but there was one particularly interesting that my friend and ex boss James came up with.

Basically, you start out with something like this if you were searching a list of 7 items:

    // 7 because the list has 7 items in it.
// 4 because it's half of the next power of 2 that is >= 7.
ret = (haystack[4] <= needle) * (7-4);


The result is that instead of having ret go to either 0 or 4, it goes to 0 or 3.

From there, in both cases you have 4 items in your sublist remaining, so you don’t need to worry about the index going out of bounds from that point on.

Code

Here’s some working code demonstrating the ideas above, as well as it’s output.

#include <algorithm>
#include <stdlib.h>

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
// using ternary operator
size_t ret = (haystack[4] <= needle) ? 4 : 0;
ret += (haystack[ret + 2] <= needle) ? 2 : 0;
ret += (haystack[ret + 1] <= needle) ? 1 : 0;
return ret;
}

size_t BinarySearch8b (size_t needle, const size_t haystack[8])
{
// using multiplication
size_t ret = (haystack[4] <= needle) * 4;
ret += (haystack[ret + 2] <= needle) * 2;
ret += (haystack[ret + 1] <= needle) * 1;
return ret;
}

size_t BinarySearch7 (size_t needle, const size_t haystack[7])
{
// non perfect power of 2.  use min() to keep it from going out of bounds.
size_t ret = 0;
size_t testIndex = 0;

// test index is 4, so is within range.
testIndex = ret + 4;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 6, so is within range.
testIndex = ret + 2;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 7, so could be out of range.
// use min() to make sure the index stays in range.
testIndex = std::min<size_t>(ret + 1, 6);
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

return ret;
}

int main (int argc, char **argv)
{
// search a list of size 8
{
// show the data
printf("Seaching through a list with 8 items:\n");
size_t data[8] = { 1, 3, 5, 6, 9, 11, 15, 21 };
printf("data = [");
for (size_t i = 0; i < sizeof(data)/sizeof(data[0]); ++i)
{
if (i > 0)
printf(", ");
printf("%zu", data[i]);
}
printf("]\n");

// do some searches on it using ternary operation based function
printf("\nTernary based searches:\n");
#define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch8(needle, data), data[BinarySearch8(needle, data)], data[BinarySearch8(needle, data)] == needle ? "true" : "false");
FIND(2);
FIND(3);
FIND(0);
FIND(22);
FIND(16);
FIND(15);
FIND(21);
#undef FIND

// do some searches on it using multiplication based function
printf("\nMultiplication based searches:\n");
#define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch8b(needle, data), data[BinarySearch8b(needle, data)], data[BinarySearch8b(needle, data)] == needle ? "true" : "false");
FIND(2);
FIND(3);
FIND(0);
FIND(22);
FIND(16);
FIND(15);
FIND(21);
#undef FIND

printf("\n\n\n\n");
}

// search a list of size 7
{
// show the data
printf("Seaching through a list with 7 items:\n");
size_t data[7] = { 1, 3, 5, 6, 9, 11, 15};
printf("data = [");
for (size_t i = 0; i < sizeof(data)/sizeof(data[0]); ++i)
{
if (i > 0)
printf(", ");
printf("%zu", data[i]);
}
printf("]\n");

// do some searches on it using ternary operation based function
printf("\nTernary based searches:\n");
#define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch7(needle, data), data[BinarySearch7(needle, data)], data[BinarySearch7(needle, data)] == needle ? "true" : "false");
FIND(2);
FIND(3);
FIND(0);
FIND(22);
FIND(16);
FIND(15);
FIND(21);
#undef FIND

printf("\n\n\n\n");
}

system("pause");
return 0;
}


Closing

Another facet of binary searching is that it isn’t the most cache friendly algorithm out there. There might be some value in combining the above with the information in the link below.

Cache-friendly binary search

If you like this sort of thing, here is an interesting paper from this year (2017):
Array Layouts For Comparison-Based Searching

And further down the rabbit hole a bit, this talks about re-ordering the search array to fit things into a cache line better:
https://people.mpi-inf.mpg.de/~rgemulla/publications/schlegel09search.pdf

Taking the next step is Intel and Oracle’s FAST paper:
http://www.timkaldewey.de/pubs/FAST__TODS11.pdf

Florian Gross from twitch made me aware of the last two links and also mentioned his master’s these in this area (thank you Florian!):

@rygorous mentioned on twitter some improvements such as ternary and quaternary search, as well as a way to handle the case of non power of 2 sized lists without extra index checks:

Thanks to everyone who gave feedback. It’s a very interesting topic, of which this post only seems to scratch this surface!

Hopefully you found this interesting. Questions, comments, corrections, let me know!

When Random Numbers Are Too Random: Low Discrepancy Sequences

Random numbers can be useful in graphics and game development, but they have a pesky and sometimes undesirable habit of clumping together.

This is a problem in path tracing and monte carlo integration when you take N samples, but the samples aren’t well spread across the sampling range.

This can also be a problem for situations like when you are randomly placing objects in the world or generating treasure for a treasure chest. You don’t want your randomly placed trees to only be in one part of the forest, and you don’t want a player to get only trash items or only godly items when they open a treasure chest. Ideally you want to have some randomness, but you don’t want the random number generator to give you all of the same or similar random numbers.

The problem is that random numbers can be TOO random, like in the below where you can see clumps and large gaps between the 100 samples.

For cases like that, when you want random numbers that are a little bit more well distributed, you might find some use in low discrepancy sequences.

The standalone C++ code (one source file, standard headers, no libraries to link to) I used to generate the data and images are at the bottom of this post, as well as some links to more resources.

What Is Discrepancy?

In this context, discrepancy is a measurement of the highest or lowest density of points in a sequence. High discrepancy means that there is either a large area of empty space, or that there is an area that has a high density of points. Low discrepancy means that there are neither, and that your points are more or less pretty evenly distributed.

The lowest discrepancy possible has no randomness at all, and in the 1 dimensional case means that the points are evenly distributed on a grid. For monte carlo integration and the game dev usage cases I mentioned, we do want some randomness, we just want the random points to be spread out a little more evenly.

If more formal math notation is your thing, discrepancy is defined as:
$D_{N}(P)=\sup _{{B\in J}}\left|{\frac {A(B;P)}{N}}-\lambda _{s}(B)\right|$

Equidistributed sequence

For monte carlo integration specifically, this is the behavior each thing gives you:

• High Discrepancy: Random Numbers / White Noise aka Uniform Distribution – At lower sample counts, convergance is slower (and have higher variance) due to the possibility of not getting good coverage over the area you integrating. At higher sample counts, this problem disappears. (Hint: real time graphics and preview renderings use a smaller number of samples)
• Lowest Discrepancy: Regular Grid – This will cause aliasing, unlike the other “random” based sampling, which trade aliasing for noise. Noise is preferred over aliasing.
• Low Discrepancy: Low Discrepancy Sequences – In lower numbers of samples, this will have faster convergence by having better coverage of the sampling space, but will use randomness to get rid of aliasing by introducing noise.

Also interesting to note, Quasi Monte Carlo has provably better asymptotic convergence than regular monte carlo integration.

1 Dimensional Sequences

We’ll first look at 1 dimensional sequences.

Grid

Here are 100 samples evenly spaced:

Random Numbers (White Noise)

This is actually a high discrepancy sequence. To generate this, you just use a standard random number generator to pick 100 points between 0 and 1. I used std::mt19937 with a std::uniform_real_distribution from 0 to 1:

Subrandom Numbers

Subrandom numbers are ways to decrease the discrepancy of white noise.

One way to do this is to break the sampling space in half. You then generate even numbered samples in the first half of the space, and odd numbered samples in the second half of the space.

There’s no reason you can’t generalize this into more divisions of space though.

This splits the space into 4 regions:

8 regions:

16 regions:

32 regions:

There are other ways to generate subrandom numbers though. One way is to generate random numbers between 0 and 0.5, and add them to the last sample, plus 0.5. This gives you a random walk type setup.

Here is that:

Uniform Sampling + Jitter

If you take the first subrandom idea to the logical maximum, you break your sample space up into N sections and place one point within those N sections to make a low discrepancy sequence made up of N points.

Another way to look at this is that you do uniform sampling, but add some random jitter to the samples, between +/- half a uniform sample size, to keep the samples in their own areas.

This is that:

I have heard that Pixar invented this technique interestingly.

Irrational Numbers

Rational numbers are numbers which can be described as fractions, such as 0.75 which can be expressed as 3/4. Irrational numbers are numbers which CANNOT be described as fractions, such as pi, or the golden ratio, or the square root of a prime number.

Interestingly you can use irrational numbers to generate low discrepancy sequences. You start with some value (could be 0, or could be a random number), add the irrational number, and modulus against 1.0. To get the next sample you add the irrational value again, and modulus against 1.0 again. Rinse and repeat until you get as many samples as you want.

Some values work better than others though, and apparently the golden ratio is provably the best choice (1.61803398875…), says Wikipedia.

Here is the golden ratio, using 4 different random (white noise) starting values:

Here I’ve used the square root of 2, with 4 different starting random numbers again:

Lastly, here is pi, with 4 random starting values:

Van der Corput Sequence

The Van der Corput sequence is the 1d equivelant of the Halton sequence which we’ll talk about later.

How you generate values in the Van der Corput sequence is you convert the index of your sample into some base.

For instance if it was base 2, you would convert your index to binary. If it was base 16, you would convert your index to hexadecimal.

Now, instead of treating the digits as if they are $B^0$, $B^1$, $B^2$, etc (where B is the base), you instead treat them as $B^{-1}$, $B^{-2}$, $B^{-3}$ and so on. In other words, you multiply each digit by a fraction and add up the results.

To show a couple quick examples, let’s say we wanted sample 6 in the sequence of base 2.

First we convert 6 to binary which is 110. From right to left, we have 3 digits: a 0 in the 1’s place, a 1 in the 2’s place, and a 1 in the 4’s place. $0*1 + 1*2 + 1*4 = 6$, so we can see that 110 is in fact 6 in binary.

To get the Van der Corput value for this, instead of treating it as the 1’s, 2’s and 4’s digit, we treat it as the 1/2, 1/4 and 1/8’s digit.

$0 * 1/2 + 1 * 1/4 + 1 * 1/8 = 3/8$.

So, sample 6 in the Van der Corput sequence using base 2 is 3/8.

Let’s try sample 21 in base 3.

First we convert 21 to base 3 which is 210. We can verify this is right by seeing that $0 * 1 + 1 * 3 + 2 * 9 = 21$.

Instead of a 1’s, 3’s and 9’s digit, we are going to treat it like a 1/3, 1/9 and 1/27 digit.

$0 * 1/3 + 1 * 1/9 + 2 * 1/27 = 5/27$

So, sample 21 in the Van der Corput sequence using base 3 is 5/27.

Here is the Van der Corput sequence for base 2:

Here it is for base 3:

Base 4:

Base 5:

Sobol

One dimensional Sobol is actually just the Van der Corput sequence base 2 re-arranged a little bit, but it’s generated differently.

You start with 0 (either using it as sample 0 or sample -1, doesn’t matter which), and for each sample you do this:

1. Calculate the Ruler function value for the current sample’s index(more info in a second)
2. Make the direction vector by shifting 1 left (in binary) 31 – ruler times.
3. XOR the last sample by the direction vector to get the new sample
4. To interpret the sample as a floating point number you divide it by $2^{32}$

That might sound completely different than the Van der Corput sequence but it actually is the same thing – just re-ordered.

In the final step when dividing by $2^{32}$, we are really just interpreting the binary number as a fraction just like before, but it’s the LEFT most digit that is the 1/2 spot, not the RIGHT most digit.

The Ruler Function goes like: 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, …

It’s pretty easy to calculate too. Calculating the ruler function for an index (starting at 1) is just the zero based index of the right most 1’s digit after converting the number to binary.

1 in binary is 001 so Ruler(1) is 0.
2 in binary is 010 so Ruler(2) is 1.
3 in binary is 011 so Ruler(3) is 0.
4 in binary is 100 so Ruler(4) is 2.
5 in binary is 101 so Ruler(5) is 0.
and so on.

Here is 1D Sobol:

Hammersley

In one dimension, the Hammersley sequence is the same as the base 2 Van der Corput sequence, and in the same order. If that sounds strange that it’s the same, it’s a 2d sequence I broke down into a 1d sequence for comparison. The one thing Hammersley has that makes it unique in the 1d case is that you can truncate bits.

It doesn’t seem that useful for 1d Hammersley to truncate bits but knowing that is useful info too I guess. Look at the 2d version of Hammersley to get a fairer look at it, because it’s meant to be a 2d sequence.

Here is Hammersley:

With 1 bit truncated:

With 2 bits truncated:

Poisson Disc

Poisson disc points are points which are densely packed, but have a minimum distance from each other.

Computer scientists are still working out good algorithms to generate these points efficiently.

I use “Mitchell’s Best-Candidate” which means that when you want to generate a new point in the sequence, you generate N new points, and choose whichever point is farthest away from the other points you’ve generated so far.

Here it is where N is 100:

2 Dimensional Sequences

Next up, let’s look at some 2 dimensional sequences.

Grid

Below is 2d uniform samples on a grid.

Note that uniform grid is not particularly low discrepancy for the 2d case! More info here: Is it expected that uniform points would have non zero discrepancy?

Random

Here are 100 random points:

Uniform Grid + Jitter

Here is a uniform grid that has random jitter applied to the points. Jittered grid is a pretty commonly used low discrepancy sampling technique that has good success.

Subrandom

Just like in 1 dimensions, you can apply the subrandom ideas to 2 dimensions where you divide the X and Y axis into so many sections, and randomly choose points in the sections.

If you divide X and Y into the same number of sections though, you are going to have a problem because some areas are not going to have any points in them.

@Reedbeta pointed out that instead of using i%x and i%y, that you could use i%x and (i/x)%y to make it pick points in all regions.

Picking different numbers for X and Y can be another way to give good results. Here’s dividing X and Y into 2 and 3 sections respectively:

If you choose co-prime numbers for divisions for each axis you can get maximal period of repeats. 2 and 3 are coprime so the last example is a good example of that, but here is 3 and 11:

Here is 3 and 97. 97 is large enough that with only doing 100 samples, we are almost doing jittered grid on the y axis.

Here is the other subrandom number from 1d, where we start with a random value for X and Y, and then add a random number between 0 and 0.5 to each, also adding 0.5, to make a “random walk” type setup again:

Halton

The Halton sequence is just the Van der Corput sequence, but using a different base on each axis.

Here is the Halton sequence where X and Y use bases 2 and 3:

Here it is using bases 5 and 7:

Here are bases 13 and 9:

Irrational Numbers

The irrational numbers technique can be used for 2d as well but I wasn’t able to find out how to make it give decent looking output that didn’t have an obvious diagonal pattern in them. Bart Wronski shared a neat paper that explains how to use the golden ratio in 2d with great success: Golden Ratio Sequences For Low-Discrepancy Sampling

This uses the golden ratio for the X axis and the square root of 2 for the Y axis. Below that is the same, with a random starting point, to make it give a different sequence.

Here X axis uses square root of 2 and Y axis uses square root of 3. Below that is a random starting point, which gives the same discrepancy.

Hammersley

In 2 dimensions, the Hammersley sequence uses the 1d Hammersley sequence for the X axis: Instead of treating the binary version of the index as binary, you treat it as fractions like you do for Van der Corput and sum up the fractions.

For the Y axis, you just reverse the bits and then do the same!

Here is the Hammersley sequence. Note we would have to take 128 samples (not just the 100 we did) if we wanted it to fill the entire square with samples.

Truncating bits in 2d is a bit useful. Here is 1 bit truncated:

2 bits truncated:

Poisson Disc

Using the same method we did for 1d, we can generate points in 2d space:

N Rooks

There is a sampling pattern called N-Rooks where you put N rooks onto a chess board and arrange them such that no two are in the same row or column.

A way to generate these samples is to realize that there will be only one rook per row, and that none of them will ever be in the same column. So, you make an array that has numbers 0 to N-1, and then shuffle the array. The index into the array is the row, and the value in the array is the column.

Here are 100 rooks:

Sobol

Sobol in two dimensions is more complex to explain so I’ll link you to the source I used: Sobol Sequences Made Simple.

The 1D sobol already covered is used for the X axis, and then something more complex was used for the Y axis:

Bart Wronski has a really great series on a related topic: Dithering in Games

Wikipedia: Low Discrepancy Sequence

Wikipedia: Halton Sequence

Wikipedia: Van der Corput Sequence

Using Fibonacci Sequence To Generate Colors

Deeper info and usage cases for low discrepancy sequences

Poisson-Disc Sampling

Low discrepancy sequences are related to blue noise. Where white noise contains all frequencies evenly, blue noise has more high frequencies and fewer low frequencies. Blue noise is essentially the ultimate in low discrepancy, but can be expensive to compute. Here are some pages on blue noise:

Free Blue Noise Textures

The problem with 3D blue noise

Stippling and Blue Noise

Vegetation placement in “The Witness”

Here are some links from @marc_b_reynolds:
Sobol (low-discrepancy) sequence in 1-3D, stratified in 2-4D.
Classic binary-reflected gray code.
Sobol.h

Weyl Sequence

Code

#define _CRT_SECURE_NO_WARNINGS

#include <windows.h>  // for bitmap headers and performance counter.  Sorry non windows people!
#include <vector>
#include <stdint.h>
#include <random>
#include <array>
#include <algorithm>
#include <stdlib.h>
#include <set>

typedef uint8_t uint8;

#define NUM_SAMPLES 100  // to simplify some 2d code, this must be a square
#define NUM_SAMPLES_FOR_COLORING 100

// Turning this on will slow things down significantly because it's an O(N^5) operation for 2d!
#define CALCULATE_DISCREPANCY 0

#define IMAGE1D_WIDTH 600
#define IMAGE1D_HEIGHT 50
#define IMAGE2D_WIDTH 300
#define IMAGE2D_HEIGHT 300

#define AXIS_HEIGHT 40
#define DATA_HEIGHT 20
#define DATA_WIDTH 2

#define COLOR_FILL SColor(255,255,255)
#define COLOR_AXIS SColor(0, 0, 0)

//======================================================================================
struct SImageData
{
SImageData ()
: m_width(0)
, m_height(0)
{ }

size_t m_width;
size_t m_height;
size_t m_pitch;
std::vector<uint8> m_pixels;
};

struct SColor
{
SColor (uint8 _R = 0, uint8 _G = 0, uint8 _B = 0)
: R(_R), G(_G), B(_B)
{ }

uint8 B, G, R;
};

//======================================================================================
bool SaveImage (const char *fileName, const SImageData &image)
{
// open the file if we can
FILE *file;
file = fopen(fileName, "wb");
if (!file) {
printf("Could not save %s\n", fileName);
return false;
}

// write the data and close the file
fclose(file);

return true;
}

//======================================================================================
void ImageInit (SImageData& image, size_t width, size_t height)
{
image.m_width = width;
image.m_height = height;
image.m_pitch = 4 * ((width * 24 + 31) / 32);
image.m_pixels.resize(image.m_pitch * image.m_width);
std::fill(image.m_pixels.begin(), image.m_pixels.end(), 0);
}

//======================================================================================
void ImageClear (SImageData& image, const SColor& color)
{
uint8* row = &image.m_pixels[0];
for (size_t rowIndex = 0; rowIndex < image.m_height; ++rowIndex)
{
SColor* pixels = (SColor*)row;
std::fill(pixels, pixels + image.m_width, color);

row += image.m_pitch;
}
}

//======================================================================================
void ImageBox (SImageData& image, size_t x1, size_t x2, size_t y1, size_t y2, const SColor& color)
{
for (size_t y = y1; y < y2; ++y)
{
uint8* row = &image.m_pixels[y * image.m_pitch];
SColor* start = &((SColor*)row)[x1];
std::fill(start, start + x2 - x1, color);
}
}

//======================================================================================
float Distance (float x1, float y1, float x2, float y2)
{
float dx = (x2 - x1);
float dy = (y2 - y1);

return std::sqrtf(dx*dx + dy*dy);
}

//======================================================================================
SColor DataPointColor (size_t sampleIndex)
{
SColor ret;
float percent = (float(sampleIndex) / (float(NUM_SAMPLES_FOR_COLORING) - 1.0f));

ret.R = uint8((1.0f - percent) * 255.0f);
ret.G = 0;
ret.B = uint8(percent * 255.0f);

float mag = (float)sqrt(ret.R*ret.R + ret.G*ret.G + ret.B*ret.B);
ret.R = uint8((float(ret.R) / mag)*255.0f);
ret.G = uint8((float(ret.G) / mag)*255.0f);
ret.B = uint8((float(ret.B) / mag)*255.0f);

return ret;
}

//======================================================================================
float RandomFloat (float min, float max)
{
static std::random_device rd;
static std::mt19937 mt(rd());
std::uniform_real_distribution<float> dist(min, max);
return dist(mt);
}

//======================================================================================
size_t Ruler (size_t n)
{
size_t ret = 0;
while (n != 0 && (n & 1) == 0)
{
n /= 2;
++ret;
}
return ret;
}

//======================================================================================
float CalculateDiscrepancy1D (const std::array<float, NUM_SAMPLES>& samples)
{
// some info about calculating discrepancy
// https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

// Calculates the discrepancy of this data.
// Assumes the data is [0,1) for valid sample range
std::array<float, NUM_SAMPLES> sortedSamples = samples;
std::sort(sortedSamples.begin(), sortedSamples.end());

float maxDifference = 0.0f;
for (size_t startIndex = 0; startIndex <= NUM_SAMPLES; ++startIndex)
{
// startIndex 0 = 0.0f.  startIndex 1 = sortedSamples[0]. etc

float startValue = 0.0f;
if (startIndex > 0)
startValue = sortedSamples[startIndex - 1];

for (size_t stopIndex = startIndex; stopIndex <= NUM_SAMPLES; ++stopIndex)
{
// stopIndex 0 = sortedSamples[0].  startIndex[N] = 1.0f. etc

float stopValue = 1.0f;
if (stopIndex < NUM_SAMPLES)
stopValue = sortedSamples[stopIndex];

float length = stopValue - startValue;

// open interval (startValue, stopValue)
size_t countInside = 0;
for (float sample : samples)
{
if (sample > startValue &&
sample < stopValue)
{
++countInside;
}
}
float density = float(countInside) / float(NUM_SAMPLES);
float difference = std::abs(density - length);
if (difference > maxDifference)
maxDifference = difference;

// closed interval [startValue, stopValue]
countInside = 0;
for (float sample : samples)
{
if (sample >= startValue &&
sample <= stopValue)
{
++countInside;
}
}
density = float(countInside) / float(NUM_SAMPLES);
difference = std::abs(density - length);
if (difference > maxDifference)
maxDifference = difference;
}
}
return maxDifference;
}

//======================================================================================
float CalculateDiscrepancy2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples)
{
// some info about calculating discrepancy
// https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

// Calculates the discrepancy of this data.
// Assumes the data is [0,1) for valid sample range.

// Get the sorted list of unique values on each axis
std::set<float> setSamplesX;
std::set<float> setSamplesY;
for (const std::array<float, 2>& sample : samples)
{
setSamplesX.insert(sample[0]);
setSamplesY.insert(sample[1]);
}
std::vector<float> sortedXSamples;
std::vector<float> sortedYSamples;
sortedXSamples.reserve(setSamplesX.size());
sortedYSamples.reserve(setSamplesY.size());
for (float f : setSamplesX)
sortedXSamples.push_back(f);
for (float f : setSamplesY)
sortedYSamples.push_back(f);

// Get the sorted list of samples on the X axis, for faster interval testing
std::array<std::array<float, 2>, NUM_SAMPLES> sortedSamplesX = samples;
std::sort(sortedSamplesX.begin(), sortedSamplesX.end(),
[] (const std::array<float, 2>& itemA, const std::array<float, 2>& itemB)
{
return itemA[0] < itemB[0];
}
);

// calculate discrepancy
float maxDifference = 0.0f;
for (size_t startIndexY = 0; startIndexY <= sortedYSamples.size(); ++startIndexY)
{
float startValueY = 0.0f;
if (startIndexY > 0)
startValueY = *(sortedYSamples.begin() + startIndexY - 1);

for (size_t startIndexX = 0; startIndexX <= sortedXSamples.size(); ++startIndexX)
{
float startValueX = 0.0f;
if (startIndexX > 0)
startValueX = *(sortedXSamples.begin() + startIndexX - 1);

for (size_t stopIndexY = startIndexY; stopIndexY <= sortedYSamples.size(); ++stopIndexY)
{
float stopValueY = 1.0f;
if (stopIndexY < sortedYSamples.size())
stopValueY = sortedYSamples[stopIndexY];

for (size_t stopIndexX = startIndexX; stopIndexX <= sortedXSamples.size(); ++stopIndexX)
{
float stopValueX = 1.0f;
if (stopIndexX < sortedXSamples.size())
stopValueX = sortedXSamples[stopIndexX];

// calculate area
float length = stopValueX - startValueX;
float height = stopValueY - startValueY;
float area = length * height;

// open interval (startValue, stopValue)
size_t countInside = 0;
for (const std::array<float, 2>& sample : samples)
{
if (sample[0] > startValueX &&
sample[1] > startValueY &&
sample[0] < stopValueX &&
sample[1] < stopValueY)
{
++countInside;
}
}
float density = float(countInside) / float(NUM_SAMPLES);
float difference = std::abs(density - area);
if (difference > maxDifference)
maxDifference = difference;

// closed interval [startValue, stopValue]
countInside = 0;
for (const std::array<float, 2>& sample : samples)
{
if (sample[0] >= startValueX &&
sample[1] >= startValueY &&
sample[0] <= stopValueX &&
sample[1] <= stopValueY)
{
++countInside;
}
}
density = float(countInside) / float(NUM_SAMPLES);
difference = std::abs(density - area);
if (difference > maxDifference)
maxDifference = difference;
}
}
}
}

return maxDifference;
}

//======================================================================================
void Test1D (const char* fileName, const std::array<float, NUM_SAMPLES>& samples)
{
// create and clear the image
SImageData image;

// setup the canvas
ImageClear(image, COLOR_FILL);

// calculate the discrepancy
#if CALCULATE_DISCREPANCY
float discrepancy = CalculateDiscrepancy1D(samples);
printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
#endif

// draw the sample points
size_t i = 0;
for (float f: samples)
{
size_t pos = size_t(f * float(IMAGE1D_WIDTH)) + IMAGE_PAD;
ImageBox(image, pos, pos + 1, IMAGE1D_CENTERY - DATA_HEIGHT / 2, IMAGE1D_CENTERY + DATA_HEIGHT / 2, DataPointColor(i));
++i;
}

// draw the axes lines. horizontal first then the two vertical
ImageBox(image, IMAGE_PAD, IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);
ImageBox(image, IMAGE1D_WIDTH + IMAGE_PAD, IMAGE1D_WIDTH + IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);

// save the image
SaveImage(fileName, image);
}

//======================================================================================
void Test2D (const char* fileName, const std::array<std::array<float,2>, NUM_SAMPLES>& samples)
{
// create and clear the image
SImageData image;

// setup the canvas
ImageClear(image, COLOR_FILL);

// calculate the discrepancy
#if CALCULATE_DISCREPANCY
float discrepancy = CalculateDiscrepancy2D(samples);
printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
#endif

// draw the sample points
size_t i = 0;
for (const std::array<float, 2>& sample : samples)
{
size_t posx = size_t(sample[0] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
size_t posy = size_t(sample[1] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
ImageBox(image, posx - 1, posx + 1, posy - 1, posy + 1, DataPointColor(i));
++i;
}

// horizontal lines

// vertical lines

// save the image
SaveImage(fileName, image);
}

//======================================================================================
void TestUniform1D (bool jitter)
{
// calculate the sample points
const float c_cellSize = 1.0f / float(NUM_SAMPLES+1);
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i] = float(i+1) / float(NUM_SAMPLES+1);
if (jitter)
samples[i] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
}

// save bitmap etc
if (jitter)
Test1D("1DUniformJitter.bmp", samples);
else
Test1D("1DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom1D ()
{
// calculate the sample points
const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
samples[i] = RandomFloat(0.0f, 1.0f);

// save bitmap etc
Test1D("1DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA1D (size_t numRegions)
{
const float c_randomRange = 1.0f / float(numRegions);

// calculate the sample points
const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i] = RandomFloat(0.0f, c_randomRange);
samples[i] += float(i % numRegions) / float(numRegions);
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "1DSubRandomA_%zu.bmp", numRegions);
Test1D(fileName, samples);
}

//======================================================================================
void TestSubRandomB1D ()
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
float sample = RandomFloat(0.0f, 0.5f);
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
sample = std::fmodf(sample + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
samples[i] = sample;
}

// save bitmap etc
Test1D("1DSubRandomB.bmp", samples);
}

//======================================================================================
void TestVanDerCorput (size_t base)
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i] = 0.0f;
float denominator = float(base);
size_t n = i;
while (n > 0)
{
size_t multiplier = n % base;
samples[i] += float(multiplier) / denominator;
n = n / base;
denominator *= base;
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "1DVanDerCorput_%zu.bmp", base);
Test1D(fileName, samples);
}

//======================================================================================
void TestIrrational1D (float irrational, float seed)
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
float sample = seed;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
sample = std::fmodf(sample + irrational, 1.0f);
samples[i] = sample;
}

// save bitmap etc
char irrationalStr[256];
sprintf(irrationalStr, "%f", irrational);
char seedStr[256];
sprintf(seedStr, "%f", seed);
char fileName[256];
sprintf(fileName, "1DIrrational_%s_%s.bmp", &irrationalStr[2], &seedStr[2]);
Test1D(fileName, samples);
}

//======================================================================================
void TestSobol1D ()
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
size_t ruler = Ruler(i + 1);
size_t direction = size_t(size_t(1) << size_t(31 - ruler));
sampleInt = sampleInt ^ direction;
samples[i] = float(sampleInt) / std::pow(2.0f, 32.0f);
}

// save bitmap etc
Test1D("1DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley1D (size_t truncateBits)
{
// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
size_t n = i >> truncateBits;
float base = 1.0f / 2.0f;
samples[i] = 0.0f;
while (n)
{
if (n & 1)
samples[i] += base;
n /= 2;
base /= 2.0f;
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "1DHammersley_%zu.bmp", truncateBits);
Test1D(fileName, samples);
}

//======================================================================================
float MinimumDistance1D (const std::array<float, NUM_SAMPLES>& samples, size_t numSamples, float x)
{
// Used by poisson.
// This returns the minimum distance that point (x) is away from the sample points, from [0, numSamples).
float minimumDistance = 0.0f;
for (size_t i = 0; i < numSamples; ++i)
{
float distance = std::abs(samples[i] - x);
if (i == 0 || distance < minimumDistance)
minimumDistance = distance;
}
return minimumDistance;
}

//======================================================================================
void TestPoisson1D ()
{
// every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
const size_t c_bestOfAttempts = 100;

// calculate the sample points
std::array<float, NUM_SAMPLES> samples;
for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
{
// generate some random points and keep the one that has the largest minimum distance from any of the existing points
float bestX = 0.0f;
float bestMinDistance = 0.0f;
for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
{
float attemptX = RandomFloat(0.0f, 1.0f);
float minDistance = MinimumDistance1D(samples, sampleIndex, attemptX);

if (minDistance > bestMinDistance)
{
bestX = attemptX;
bestMinDistance = minDistance;
}
}
samples[sampleIndex] = bestX;
}

// save bitmap etc
Test1D("1DPoisson.bmp", samples);
}

//======================================================================================
void TestUniform2D (bool jitter)
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
const float c_cellSize = 1.0f / float(c_oneSide+1);
for (size_t iy = 0; iy < c_oneSide; ++iy)
{
for (size_t ix = 0; ix < c_oneSide; ++ix)
{
size_t sampleIndex = iy * c_oneSide + ix;

samples[sampleIndex][0] = float(ix + 1) / (float(c_oneSide + 1));
if (jitter)
samples[sampleIndex][0] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);

samples[sampleIndex][1] = float(iy + 1) / (float(c_oneSide) + 1.0f);
if (jitter)
samples[sampleIndex][1] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
}
}

// save bitmap etc
if (jitter)
Test2D("2DUniformJitter.bmp", samples);
else
Test2D("2DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom2D ()
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i][0] = RandomFloat(0.0f, 1.0f);
samples[i][1] = RandomFloat(0.0f, 1.0f);
}

// save bitmap etc
Test2D("2DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA2D (size_t regionsX, size_t regionsY)
{
const float c_randomRangeX = 1.0f / float(regionsX);
const float c_randomRangeY = 1.0f / float(regionsY);

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samples[i][0] = RandomFloat(0.0f, c_randomRangeX);
samples[i][0] += float(i % regionsX) / float(regionsX);

samples[i][1] = RandomFloat(0.0f, c_randomRangeY);
samples[i][1] += float(i % regionsY) / float(regionsY);
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "2DSubRandomA_%zu_%zu.bmp", regionsX, regionsY);
Test2D(fileName, samples);
}

//======================================================================================
void TestSubRandomB2D ()
{
// calculate the sample points
float samplex = RandomFloat(0.0f, 0.5f);
float sampley = RandomFloat(0.0f, 0.5f);
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samplex = std::fmodf(samplex + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
sampley = std::fmodf(sampley + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
samples[i][0] = samplex;
samples[i][1] = sampley;
}

// save bitmap etc
Test2D("2DSubRandomB.bmp", samples);
}

//======================================================================================
void TestHalton (size_t basex, size_t basey)
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
// x axis
samples[i][0] = 0.0f;
{
float denominator = float(basex);
size_t n = i;
while (n > 0)
{
size_t multiplier = n % basex;
samples[i][0] += float(multiplier) / denominator;
n = n / basex;
denominator *= basex;
}
}

// y axis
samples[i][1] = 0.0f;
{
float denominator = float(basey);
size_t n = i;
while (n > 0)
{
size_t multiplier = n % basey;
samples[i][1] += float(multiplier) / denominator;
n = n / basey;
denominator *= basey;
}
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "2DHalton_%zu_%zu.bmp", basex, basey);
Test2D(fileName, samples);
}

//======================================================================================
void TestSobol2D ()
{
// calculate the sample points

// x axis
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
size_t ruler = Ruler(i + 1);
size_t direction = size_t(size_t(1) << size_t(31 - ruler));
sampleInt = sampleInt ^ direction;
samples[i][0] = float(sampleInt) / std::pow(2.0f, 32.0f);
}

// y axis
// uses numbers: new-joe-kuo-6.21201

// Direction numbers
std::vector<size_t> V;
V.resize((size_t)ceil(log((double)NUM_SAMPLES) / log(2.0)));
V[0] = size_t(1) << size_t(31);
for (size_t i = 1; i < V.size(); ++i)
V[i] = V[i - 1] ^ (V[i - 1] >> 1);

// Samples
sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i) {
size_t ruler = Ruler(i + 1);
sampleInt = sampleInt ^ V[ruler];
samples[i][1] = float(sampleInt) / std::pow(2.0f, 32.0f);
}

// save bitmap etc
Test2D("2DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley2D (size_t truncateBits)
{
// figure out how many bits we are working in.
size_t value = 1;
size_t numBits = 0;
while (value < NUM_SAMPLES)
{
value *= 2;
++numBits;
}

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
size_t sampleInt = 0;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
// x axis
samples[i][0] = 0.0f;
{
size_t n = i >> truncateBits;
float base = 1.0f / 2.0f;
while (n)
{
if (n & 1)
samples[i][0] += base;
n /= 2;
base /= 2.0f;
}
}

// y axis
samples[i][1] = 0.0f;
{
size_t n = i >> truncateBits;
size_t mask = size_t(1) << (numBits - 1 - truncateBits);

float base = 1.0f / 2.0f;
{
samples[i][1] += base;
base /= 2.0f;
}
}
}

// save bitmap etc
char fileName[256];
sprintf(fileName, "2DHammersley_%zu.bmp", truncateBits);
Test2D(fileName, samples);
}

//======================================================================================
void TestRooks2D ()
{
// make and shuffle rook positions
std::random_device rd;
std::mt19937 mt(rd());
std::array<size_t, NUM_SAMPLES> rookPositions;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
rookPositions[i] = i;
std::shuffle(rookPositions.begin(), rookPositions.end(), mt);

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
// x axis
samples[i][0] = float(rookPositions[i]) / float(NUM_SAMPLES-1);

// y axis
samples[i][1] = float(i) / float(NUM_SAMPLES - 1);
}

// save bitmap etc
Test2D("2DRooks.bmp", samples);
}

//======================================================================================
void TestIrrational2D (float irrationalx, float irrationaly, float seedx, float seedy)
{
// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
float samplex = seedx;
float sampley = seedy;
for (size_t i = 0; i < NUM_SAMPLES; ++i)
{
samplex = std::fmodf(samplex + irrationalx, 1.0f);
sampley = std::fmodf(sampley + irrationaly, 1.0f);

samples[i][0] = samplex;
samples[i][1] = sampley;
}

// save bitmap etc
char irrationalxStr[256];
sprintf(irrationalxStr, "%f", irrationalx);
char irrationalyStr[256];
sprintf(irrationalyStr, "%f", irrationaly);
char seedxStr[256];
sprintf(seedxStr, "%f", seedx);
char seedyStr[256];
sprintf(seedyStr, "%f", seedy);
char fileName[256];
sprintf(fileName, "2DIrrational_%s_%s_%s_%s.bmp", &irrationalxStr[2], &irrationalyStr[2], &seedxStr[2], &seedyStr[2]);
Test2D(fileName, samples);
}

//======================================================================================
float MinimumDistance2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples, size_t numSamples, float x, float y)
{
// Used by poisson.
// This returns the minimum distance that point (x,y) is away from the sample points, from [0, numSamples).
float minimumDistance = 0.0f;
for (size_t i = 0; i < numSamples; ++i)
{
float distance = Distance(samples[i][0], samples[i][1], x, y);
if (i == 0 || distance < minimumDistance)
minimumDistance = distance;
}
return minimumDistance;
}

//======================================================================================
void TestPoisson2D ()
{
// every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
const size_t c_bestOfAttempts = 100;

// calculate the sample points
std::array<std::array<float, 2>, NUM_SAMPLES> samples;
for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
{
// generate some random points and keep the one that has the largest minimum distance from any of the existing points
float bestX = 0.0f;
float bestY = 0.0f;
float bestMinDistance = 0.0f;
for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
{
float attemptX = RandomFloat(0.0f, 1.0f);
float attemptY = RandomFloat(0.0f, 1.0f);
float minDistance = MinimumDistance2D(samples, sampleIndex, attemptX, attemptY);

if (minDistance > bestMinDistance)
{
bestX = attemptX;
bestY = attemptY;
bestMinDistance = minDistance;
}
}
samples[sampleIndex][0] = bestX;
samples[sampleIndex][1] = bestY;
}

// save bitmap etc
Test2D("2DPoisson.bmp", samples);
}

//======================================================================================
int main (int argc, char **argv)
{
// 1D tests
{
TestUniform1D(false);
TestUniform1D(true);

TestUniformRandom1D();

TestSubRandomA1D(2);
TestSubRandomA1D(4);
TestSubRandomA1D(8);
TestSubRandomA1D(16);
TestSubRandomA1D(32);

TestSubRandomB1D();

TestVanDerCorput(2);
TestVanDerCorput(3);
TestVanDerCorput(4);
TestVanDerCorput(5);

// golden ratio mod 1 aka (sqrt(5) - 1)/2
TestIrrational1D(0.618034f, 0.0f);
TestIrrational1D(0.618034f, 0.385180f);
TestIrrational1D(0.618034f, 0.775719f);
TestIrrational1D(0.618034f, 0.287194f);

// sqrt(2) - 1
TestIrrational1D(0.414214f, 0.0f);
TestIrrational1D(0.414214f, 0.385180f);
TestIrrational1D(0.414214f, 0.775719f);
TestIrrational1D(0.414214f, 0.287194f);

// PI mod 1
TestIrrational1D(0.141593f, 0.0f);
TestIrrational1D(0.141593f, 0.385180f);
TestIrrational1D(0.141593f, 0.775719f);
TestIrrational1D(0.141593f, 0.287194f);

TestSobol1D();

TestHammersley1D(0);
TestHammersley1D(1);
TestHammersley1D(2);

TestPoisson1D();
}

// 2D tests
{
TestUniform2D(false);
TestUniform2D(true);

TestUniformRandom2D();

TestSubRandomA2D(2, 2);
TestSubRandomA2D(2, 3);
TestSubRandomA2D(3, 11);
TestSubRandomA2D(3, 97);

TestSubRandomB2D();

TestHalton(2, 3);
TestHalton(5, 7);
TestHalton(13, 9);

TestSobol2D();

TestHammersley2D(0);
TestHammersley2D(1);
TestHammersley2D(2);

TestRooks2D();

// X axis = golden ratio mod 1 aka (sqrt(5)-1)/2
// Y axis = sqrt(2) mod 1
TestIrrational2D(0.618034f, 0.414214f, 0.0f, 0.0f);
TestIrrational2D(0.618034f, 0.414214f, 0.775719f, 0.264045f);

// X axis = sqrt(2) mod 1
// Y axis = sqrt(3) mod 1
TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.0f, 0.0f);
TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.775719f, 0.264045f);

TestPoisson2D();
}

#if CALCULATE_DISCREPANCY
printf("\n");
system("pause");
#endif
}


Solving N equations and N unknowns: The Fine Print (Gauss Jordan Elimination)

In basic algebra we were taught that if we have three unknowns (variables), it takes three equations to solve for them.

There’s some fine print though that isn’t talked about until quite a bit later.

Let’s have a look at three unknowns in two equations:

$A + B + C = 2 \\ B = 5$

If we just need a third equation to solve this, why not just modify the second equation to make a third?

$-B = -5$

That obviously doesn’t work, because it doesn’t add any new information! If you try it out, you’ll find that adding that equation doesn’t get you any closer to solving for the variables.

So, it takes three equations to solve for three unknowns, but the three equations have to provide unique, meaningful information. That is the fine print.

How can we know if an equation provides unique, meaningful information though?

It turns out that linear algebra gives us a neat technique for simplifying a system of equations. It actually solves for individual variables if it’s able to, and also gets rid of redundant equations that don’t add any new information.

This simplest form is called the Reduced Row Echelon Form (Wikipedia) which you may also see abbreviated as “rref” (perhaps a bit of a confusing term for programmers) and it involves you putting the equations into a matrix and then performing an algorithm, such as Gauss–Jordan elimination (Wikipedia) to get the rref.

Equations as a Matrix

Putting n set of equations into a matrix is really straight forward.

Each row of a matrix is a separate equation, and each column represents the coefficient of a variable.

Let’s see how with this set of equations:

$3x + y = 5\\ 2y = 7\\ y + z = 14$

Not every equation has every variable in it, so let’s fix that by putting in zero terms for the missing variables, and let’s make the one terms explicit as well:

$3x + 1y + 0z = 5\\ 0x + 2y + 0z = 7\\ 0x + 1y + 1z = 14$

Putting those equations into a matrix looks like this:

$\left[\begin{array}{rrr} 3 & 1 & 0 \\ 0 & 2 & 0 \\ 0 & 1 & 1 \end{array}\right]$

If you also include the constants on the right side of the equation, you get what is called an augmented matrix, which looks like this:

$\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

Reduced Row Echelon Form

Wikipedia explains the reduced row echelon form this way:

• all nonzero rows (rows with at least one nonzero element) are above any rows of all zeroes (all zero rows, if any, belong at the bottom of the matrix), and
• the leading coefficient (the first nonzero number from the left, also called the pivot) of a nonzero row is always strictly to the right of the leading coefficient of the row above it.
• Every leading coefficient is 1 and is the only nonzero entry in its column.

This is an example of a 3×5 matrix in reduced row echelon form:
$\left[\begin{array}{rrrrr} 1 & 0 & a_1 & 0 & b_1 \\ 0 & 1 & a_2 & 0 & b_2 \\ 0 & 0 & 0 & 1 & b_3 \end{array}\right]$

Basically, the lower left triangle of the matrix (the part under the diagonal) needs to be zero, and the first number in each row needs to be one.

Looking back at the augmented matrix we made:

$\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

If we put it into reduced row echelon form, we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 0 & 0.5 \\ 0 & 1 & 0 & 3.5 \\ 0 & 0 & 1 & 10.5 \end{array}\right]$

There’s something really neat about the reduced row echelon form. If we take the above augmented matrix and turn it back into equations, look what we get:

$1x + 0y + 0z = 0.5\\ 0x + 1y + 0z = 3.5\\ 0x + 0y + 1z = 10.5$

Or if we simplify that:

$x = 0.5\\ y = 3.5\\ z = 10.5$

Putting it into reduced row echelon form simplified our set of equations so much that it actually solved for our variables. Neat!

How do we put a matrix into rref? We can use Gauss–Jordan elimination.

Gauss–Jordan Elimination

Gauss Jordan Elimination is a way of doing operations on rows to be able to manipulate the matrix to get it into the desired form.

It’s often explained that there are three row operations you can do:

• Type 1: Swap the positions of two rows.
• Type 2: Multiply a row by a nonzero scalar.
• Type 3: Add to one row a scalar multiple of another.

You might notice that the first two rules are technically just cases of using the third rule. I find that easier to remember, maybe you will too.

The algorithm for getting the rref is actually pretty simple.

1. Starting with the first column of the matrix, find a row which has a non zero in that column, and make that row be the first row by swapping it with the first row.
2. Multiply the first row by a value so that the first column has a 1 in it.
3. Subtract a multiple of the first row from every other row in the matrix so that they have a zero in the first column.

You’ve now handled one column (one variable) so move onto the next.

1. Continuing on, we consider the second column. Find a row which has a non zero in that column and make that row be the second row by swapping it with the second row.
2. Multiply the second row by a value so that the second column has a 1 in it.
3. Subtract a multiple of the second row from every other row in the matrix so that they have a zero in the second column.

You repeat this process until you either run out of rows or columns, at which point you are done.

Note that if you ever find a column that has only zeros in it, you just skip that row.

Let’s work through the example augmented matrix to see how we got it into rref. Starting with this:

$\left[\begin{array}{rrr|r} 3 & 1 & 0 & 5 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

We already have a non zero in the first column, so we multiply the top row by 1/3 to get this:

$\left[\begin{array}{rrr|r} 1 & 0.3333 & 0 & 1.6666 \\ 0 & 2 & 0 & 7 \\ 0 & 1 & 1 & 14 \end{array}\right]$

All the other rows have a zero in the first column so we move to the second row and the second column. The second row already has a non zero in the second column, so we multiply the second row by 1/2 to get this:

$\left[\begin{array}{rrr|r} 1 & 0.3333 & 0 & 1.6666 \\ 0 & 1 & 0 & 3.5 \\ 0 & 1 & 1 & 14 \end{array}\right]$

To make sure the second row is the only row that has a non zero in the second column, we subtract the second row times 1/3 from the first row. We also subtract the second row from the third row. That gives us this:

$\left[\begin{array}{rrr|r} 1 & 0 & 0 & 0.5 \\ 0 & 1 & 0 & 3.5 \\ 0 & 0 & 1 & 10.5 \end{array}\right]$

Since the third row has a 1 in the third column, and all other rows have a 0 in that column we are done.

That’s all there is to it! We put the matrix into rref, and we also solved the set of equations. Neat huh?

You may notice that the ultimate rref of a matrix is just the identity matrix. This is true unless the equations can’t be fully solved.

Overdetermined, Underdetermined & Inconsistent Equations

Systems of equations are overdetermined when they have more equations than unknowns, like the below which has three equations and two unknowns:

$x + y = 3 \\ x = 1 \\ y = 2 \\$

Putting that into (augmented) matrix form gives you this:

$\left[\begin{array}{rr|r} 1 & 1 & 3 \\ 1 & 0 & 1 \\ 0 & 1 & 2 \end{array}\right]$

If you put that into rref, you end up with this:

$\left[\begin{array}{rr|r} 1 & 0 & 1 \\ 0 & 1 & 2 \\ 0 & 0 & 0 \end{array}\right]$

The last row became zeroes, which shows us that there was redundant info in the system of equations that disappeared. We can easily see that x = 1 and y = 2, and that satisfies all three equations.

Just like we talked about in the opening of this post, if you have equations that don’t add useful information beyond what the other equations already give, it will disappear when you put it into rref. That made our over-determined system become just a determined system.

What happens though if we change the third row in the overdetermined system to be something else? For instance, we can say y=10 instead of y=2:

$x + y = 3 \\ x = 1 \\ y = 10 \\$

The augmented matrix for that is this:

$\left[\begin{array}{rr|r} 1 & 1 & 3 \\ 1 & 0 & 1 \\ 0 & 1 & 10 \end{array}\right]$

If we put that in rref, we get the identity matrix out which seems like everything is ok:

$\left[\begin{array}{rr|r} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right]$

However, if we turn it back into a set of equations, we can see that we have a problem:

$x = 0 \\ x = 0 \\ 0 = 1 \\$

The result says that 0 = 1, which is not true. Having a row of “0 = 1” in rref is how you detect that a system of equations is inconsistent, or in other words, that the equations give contradictory information.

A system of equations can also be underderdetermined, meaning there isn’t enough information to solve the equations. Let’s use the example from the beginning of the post:

$A + B + C = 2 \\ B = 5 \\$

In an augmented matrix, that looks like this:

$\left[\begin{array}{rrr|r} 1 & 1 & 1 & 2 \\ 0 & 1 & 0 & 5 \\ \end{array}\right]$

Putting that in rref we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 1 & -3 \\ 0 & 1 & 0 & 5 \\ \end{array}\right]$

Converting the matrix back into equations we get this:

$A + C = -3 \\ B = 5 \\$

This says there isn’t enough information to fully solve the equations, and shows how A and C are related, even though B is completely determined.

Note that another way of looking at this is that “A” and “C” are “free variables”. That means that if your equations specify constraints, that you are free to choose a value for either A or C. If you choose a value for one, the other becomes defined. B is not a free variable because it’s value is determined.

Let’s finish the example from the beginning of the post, showing what happens when we “make up” an equation by transforming one of the equations we already have:

$A + B + C = 2 \\ B = 5\\ -B = -5$

The augmented matrix looks like this:

$\left[\begin{array}{rrr|r} 1 & 1 & 1 & 2 \\ 0 & 1 & 0 & 5 \\ 0 & -1 & 0 & -5 \\ \end{array}\right]$

Putting it in rref, we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 1 & -3 \\ 0 & 1 & 0 & 5 \\ 0 & 0 & 0 & 0 \\ \end{array}\right]$

Which as you can see, our rref matrix is the same as it was without the extra “made up” equation besides the extra row of zeros in the result.

The number of non zero rows in a matrix in rref is known as the rank of the matrix. In these last two examples, the rank of the matrix was two in both cases. That means that you can tell if adding an equation to a system of equations adds any new, meaningful information or not by seeing if it changes the rank of the matrix for the set of equations. If the rank is the same before and after adding the new equation, it doesn’t add anything new. If the rank does change, that means it does add new information.

This concept of “adding new, meaningful information” actually has a formalized term: linear independence. If a new equation is linearly independent from the other equations in the system, it will change the rank of the rref matrix, else it won’t.

The rank of a matrix for a system of equations just tells you the number of linearly independent equations there actually are, and actually gives you what those equations are in their simplest form.

Lastly I wanted to mention that the idea of a system of equations being inconsistent is completely separate from the idea of a system of equations being under determined or over determined. They can be both over determined and inconsistent, under determined and inconsistent, over determined and consistent or under determined and consistent . The two ideas are completely separate, unrelated things.

Inverting a Matrix

Interestingly, Gauss-Jordan elimination is also a common way for efficiently inverting a matrix!

How you do that is make an augmented matrix where on the left side you have the matrix you want to invert, and on the right side you have the identity matrix.

Let’s invert a matrix I made up pretty much at random:

$\left[\begin{array}{rrr|rrr} 1 & 0 & 1 & 1 & 0 & 0 \\ 0 & 3 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1\\ \end{array}\right]$

Putting that matrix in rref, we get this:

$\left[\begin{array}{rrr|rrr} 1 & 0 & 0 & 1 & 0 & -1 \\ 0 & 1 & 0 & 0 & 0.3333 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1\\ \end{array}\right]$

The equation on the right is the inverse of the original matrix we had on the left!

You can double check by using an online matrix inverse calculator if you want: Inverse Matrix Calculator

Note that not all matrices are invertible though! When you get an inconsistent result, or the result is not the identity matrix, it wasn’t invertible.

Solving Mx = b

Let’s say that you have two vectors x and b, and a matrix M. Let’s say that we know the matrix M and the vector b, and that we are trying to solve for the vector x.

This comes up more often that you might suspect, including when doing “least squares fitting” of an equation to a set of data points (more info on that: Incremental Least Squares Curve Fitting).

One way to solve this equation would be to calculate the inverse matrix of M and multiply that by vector b to get vector x:

$Mx = b\\ x = M^{-1} * b$

However, Gauss-Jordan elimination can help us here too.

If we make an augmented matrix where on the left we have M, and on the right we have b, we can put the matrix into rref, which will essentially multiply vector b by the inverse of M, leaving us with the vector x.

For instance, on the left is our matrix M that scales x,y,z by 2. On the right is our vector b, which is the matrix M times our unknown vector x:

$\left[\begin{array}{rrr|r} 2 & 0 & 0 & 2 \\ 0 & 2 & 0 & 4 \\ 0 & 0 & 2 & 8 \\ \end{array}\right]$

Putting that into rref form we get this:

$\left[\begin{array}{rrr|r} 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 2 \\ 0 & 0 & 1 & 4 \\ \end{array}\right]$

From this, we know that the value of vector x is the right side of the augmented matrix: (1,2,4)

This only works when the matrix is invertible (aka when the rref goes to an identity matrix).

Source Code

Here is some C++ source code which does Gauss-Jordan elimination. It’s written mainly to be readable, not performant!

#include <stdio.h>
#include <array>
#include <vector>
#include <assert.h>

// Define a vector as an array of floats
template<size_t N>
using TVector = std::array<float, N>;

// Define a matrix as an array of vectors
template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

// Helper function to fill out a matrix
template <size_t M, size_t N>
TMatrix<M, N> MakeMatrix (std::initializer_list<std::initializer_list<float>> matrixData)
{
TMatrix<M, N> matrix;

size_t m = 0;
assert(matrixData.size() == M);
for (const std::initializer_list<float>& rowData : matrixData)
{
assert(rowData.size() == N);
size_t n = 0;
for (float value : rowData)
{
matrix[m][n] = value;
++n;
}
++m;
}

return matrix;
}

// Make a specific row have a 1 in the colIndex, and make all other rows have 0 there
template <size_t M, size_t N>
bool MakeRowClaimVariable (TMatrix<M, N>& matrix, size_t rowIndex, size_t colIndex)
{
// Find a row that has a non zero value in this column and swap it with this row
{
// Find a row that has a non zero value
size_t nonZeroRowIndex = rowIndex;
while (nonZeroRowIndex < M && matrix[nonZeroRowIndex][colIndex] == 0.0f)
++nonZeroRowIndex;

// If there isn't one, nothing to do
if (nonZeroRowIndex == M)
return false;

// Otherwise, swap the row
if (rowIndex != nonZeroRowIndex)
std::swap(matrix[rowIndex], matrix[nonZeroRowIndex]);
}

// Scale this row so that it has a leading one
float scale = 1.0f / matrix[rowIndex][colIndex];
for (size_t normalizeColIndex = colIndex; normalizeColIndex < N; ++normalizeColIndex)
matrix[rowIndex][normalizeColIndex] *= scale;

// Make sure all rows except this one have a zero in this column.
// Do this by subtracting this row from other rows, multiplied by a multiple that makes the column disappear.
for (size_t eliminateRowIndex = 0; eliminateRowIndex < M; ++eliminateRowIndex)
{
if (eliminateRowIndex == rowIndex)
continue;

float scale = matrix[eliminateRowIndex][colIndex];
for (size_t eliminateColIndex = 0; eliminateColIndex < N; ++eliminateColIndex)
matrix[eliminateRowIndex][eliminateColIndex] -= matrix[rowIndex][eliminateColIndex] * scale;
}

return true;
}

// make matrix into reduced row echelon form
template <size_t M, size_t N>
void GaussJordanElimination (TMatrix<M, N>& matrix)
{
size_t rowIndex = 0;
for (size_t colIndex = 0; colIndex < N; ++colIndex)
{
if (MakeRowClaimVariable(matrix, rowIndex, colIndex))
{
++rowIndex;
if (rowIndex == M)
return;
}
}
}

int main (int argc, char **argv)
{
auto matrix = MakeMatrix<3, 4>(
{
{ 2.0f, 0.0f, 0.0f, 2.0f },
{ 0.0f, 2.0f, 0.0f, 4.0f },
{ 0.0f, 0.0f, 2.0f, 8.0f },
});

GaussJordanElimination(matrix);

return 0;
}


I hope you enjoyed this post and/or learned something from it. This is a precursor to an interesting (but maybe obscure) topic for my next blog post, which involves a graphics / gamedev thing.

Any comments, questions or corrections, let me know in the comments below or on twitter at @Atrix256

Neural Network Recipe: Recognize Handwritten Digits With 95% Accuracy

This post is a recipe for making a neural network which is able to recognize hand written numeric digits (0-9) with 95% accuracy.

The intent is that you can use this recipe (and included simple C++ code, and interactive web demo!) as a starting point for some hands on experimentation.

A recent post of mine talks about all the things used in this recipe so give it a read if you want more info about anything: How to Train Neural Networks With Backpropagation.

This recipe is also taken straight from this amazing website (but coded from scratch in C++ by myself), where it’s implemented in python: Using neural nets to recognize handwritten digits.

Recipe

The neural network takes as input 28×28 greyscale images, so there will be 784 input neurons.

There is one hidden layer that has 30 neurons.

The final layer is the output layer which has 10 neurons.

The output neuron with the highest activation is the digit that was recognized. For instance if output neuron 0 had the highest activation, the network detected a 0. If output neuron 2 was highest, the network detected a 2.

The neurons use the sigmoid activation function, and the cost function used is half mean squared error.

Training uses a learning rate of 3.0 and the training data is processed by the network 30 times (aka 30 training epochs), using a minibatch size of 10.

A minibatch size of 10 just means that we calculate the gradient for 10 training samples at a time and adjust the weights and biases using that gradient. We do that for the entire (shuffled) 60,000 training items and call that a single epoch. 30 epochs mean we do this full process 30 times.

There are 60,000 items in the training data, mapping 28×28 greyscale images to what digit 0-9 they actually represent.

Besides the 60,000 training data items, there are also 10,000 separate items that are the test data. These test data items are items never seen by the network during training and are just used as a way to see how well the network has learned about the problem in general, versus learning about the specific training data items.

The test and training data is the MNIST data set. I have a link to zip file I made with the data in it below, but this is where I got the data from: The MNIST database of handwritten digits.

That is the entire recipe!

Results

The 30 training epochs took 1 minute 22 seconds on my machine in release x64 (with REPORT_ERROR_WHILE_TRAINING() set to 0 to speed things up), but the code could be made to run faster by using SIMD, putting it on the GPU, getting multithreading involved or other things.

Below is a graph of the accuracy during the training epochs.

Notice that most learning happened very early on and then only slowly improved from there. This is due to our neuron activation functions and also our cost function. There are better choices for both, but this is also an ongoing area of research to improve in neural networks.

The end result of my training run is 95.32% accuracy but you may get slightly higher or lower due to random initialization of weights and biases. That sounds pretty high, but if you were actually using this, 4 or 5 numbers wrong out of 100 is a pretty big deal! The record for MNIST is 99.77% accuracy using “a committee of convolutional networks” where they distorted the input data during training to make it learn in a more generalized way (described as “committee of 35 conv. net, 1-20-P-40-P-150-10 [elastic distortions]”).

A better cost function would probably be the cross entropy cost function, a better activation function than sigmoid would probably be an ELU (Exponential Linear Unit). A soft max layer could be used instead of just taking the maximum output neuron as the answer. The weights could be initialized to smarter values. We could also use a convolutional layer to help let the network learn features in a way that didn’t also tie the features to specific locations in the images.

Many of these things are described in good detail at http://neuralnetworksanddeeplearning.com/, particularly in chapter 3 where they make a python implementation of a convolutional neural network which performs better than this one. I highly recommend checking that website out!

HTML5 Demo

You can play with a network created with this recipe here: Recognize Handwritten Digit 95% Accuracy

Here is an example of it correctly detecting that I drew a 4.

The demo works “pretty well” but it does have a little less than 95% accuracy.

The reason for this though is that it isn’t comparing apples to apples.

A handwritten digit isn’t quite the same as a digit drawn with a mouse. Check out the image below to see 100 of the training images and see what i mean.

The demo finds the bounding box of the drawn image and rescales that bounding box to a 20×20 image, preserving the aspect ratio. It then puts that into a 28×28 image, using the center of mass of the pixels to center the smaller image in the larger one. This is how the MNIST data was generated, so makes the demo more accurate, but it also has the nice side effect of making it so you can draw a number of any size, in any part of the box, and it will treat it the same as if you drew it at a difference size, or in a different part of the box.

The code that goes with this post outputs the weights, biases and network structure in a json format that is very easy to drop into the html5 demo. This way, if you want to try different things in the network, it should be fairly low effort to adjust the demo to try your adjustments there as well.

Lastly, it might be interesting to get the derivatives of the inputs and play around with the input you gave it. Some experiments I can think of:

1. When it misclassifies what number you drew, have it adjust what you drew (the input) to be more like what the network would expect to see for that digit. This could help show why it misclassified your number.
2. Start with a well classified number and make it morph into something recognized by the network as a different number.
3. Start with a random static (noise) image and adjust it until the network recognizes it as a digit. It would be interesting to see if it looked anything like a number, or if it was still just static.

Source Code

The source code and mnist data is on github at MNIST1, but is also included below for your convenience.

If grabbing the source code from below instead of github, you will need to extract this zip file into the working directory of the program as well. It contains the test data used for training the network.
mnist.zip

Note: it’s been reported on that some compilers/OSs, std::random_device will use dev/urandom which can result in lower quality random numbers, which can result in the network not learning as quickly or to as high of accuracy. You can force it to use dev/random instead which will fix that problem. Details can be found here: http://en.cppreference.com/w/cpp/numeric/random/random_device/random_device

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <random>
#include <array>
#include <vector>
#include <algorithm>
#include <chrono>

typedef uint32_t uint32;
typedef uint16_t uint16;
typedef uint8_t uint8;

// Set to 1 to have it show error after each training and also writes it to an Error.csv file.
// Slows down the process a bit (+~50% time on my machine)
#define REPORT_ERROR_WHILE_TRAINING() 1

const size_t c_numInputNeurons = 784;
const size_t c_numHiddenNeurons = 30;  // NOTE: setting this to 100 hidden neurons can give better results, but also can be worse other times.
const size_t c_numOutputNeurons = 10;

const size_t c_trainingEpochs = 30;
const size_t c_miniBatchSize = 10;
const float c_learningRate = 3.0f;

// ============================================================================================
//                                     SBlockTimer
// ============================================================================================
// times a block of code
struct SBlockTimer
{
SBlockTimer (const char* label)
{
m_start = std::chrono::high_resolution_clock::now();
m_label = label;
}

~SBlockTimer ()
{
std::chrono::duration<float> seconds = std::chrono::high_resolution_clock::now() - m_start;
printf("%s%0.2f secondsn", m_label, seconds.count());
}

std::chrono::high_resolution_clock::time_point m_start;
const char* m_label;
};

// ============================================================================================
// ============================================================================================

inline uint32 EndianSwap (uint32 a)
{
return (a<<24) | ((a<<8) & 0x00ff0000) |
((a>>8) & 0x0000ff00) | (a>>24);
}

// MNIST data and file format description is from http://yann.lecun.com/exdb/mnist/
class CMNISTData
{
public:
CMNISTData ()
{
m_labelData = nullptr;
m_imageData = nullptr;

m_imageCount = 0;
m_labels = nullptr;
m_pixels = nullptr;
}

{
// set the expected image count
m_imageCount = training ? 60000 : 10000;

const char* labelsFileName = training ? "train-labels.idx1-ubyte" : "t10k-labels.idx1-ubyte";
FILE* file = fopen(labelsFileName,"rb");
if (!file)
{
printf("could not open %s for reading.n", labelsFileName);
return false;
}
fseek(file, 0, SEEK_END);
long fileSize = ftell(file);
fseek(file, 0, SEEK_SET);
m_labelData = new uint8[fileSize];
fclose(file);

const char* imagesFileName = training ? "train-images.idx3-ubyte" : "t10k-images.idx3-ubyte";
file = fopen(imagesFileName, "rb");
if (!file)
{
printf("could not open %s for reading.n", imagesFileName);
return false;
}
fseek(file, 0, SEEK_END);
fileSize = ftell(file);
fseek(file, 0, SEEK_SET);
m_imageData = new uint8[fileSize];
fclose(file);

// endian swap label file if needed, just first two uint32's.  The rest is uint8's.
uint32* data = (uint32*)m_labelData;
if (data[0] == 0x01080000)
{
data[0] = EndianSwap(data[0]);
data[1] = EndianSwap(data[1]);
}

// verify that the label file has the right header
if (data[0] != 2049 || data[1] != m_imageCount)
{
return false;
}
m_labels = (uint8*)&(data[2]);

// endian swap the image file if needed, just first 4 uint32's. The rest is uint8's.
data = (uint32*)m_imageData;
if (data[0] == 0x03080000)
{
data[0] = EndianSwap(data[0]);
data[1] = EndianSwap(data[1]);
data[2] = EndianSwap(data[2]);
data[3] = EndianSwap(data[3]);
}

// verify that the image file has the right header
if (data[0] != 2051 || data[1] != m_imageCount || data[2] != 28 || data[3] != 28)
{
return false;
}
m_pixels = (uint8*)&(data[4]);

// convert the pixels from uint8 to float
m_pixelsFloat.resize(28 * 28 * m_imageCount);
for (size_t i = 0; i < 28 * 28 * m_imageCount; ++i)
m_pixelsFloat[i] = float(m_pixels[i]) / 255.0f;

// success!
return true;
}

~CMNISTData ()
{
delete[] m_labelData;
delete[] m_imageData;
}

size_t NumImages () const { return m_imageCount; }

const float* GetImage (size_t index, uint8& label) const
{
label = m_labels[index];
return &m_pixelsFloat[index * 28 * 28];
}

private:
void* m_labelData;
void* m_imageData;

size_t m_imageCount;
uint8* m_labels;
uint8* m_pixels;

std::vector<float> m_pixelsFloat;
};

// ============================================================================================
//                                    NEURAL NETWORK
// ============================================================================================

template <size_t INPUTS, size_t HIDDEN_NEURONS, size_t OUTPUT_NEURONS>
class CNeuralNetwork
{
public:
CNeuralNetwork ()
{
// initialize weights and biases to a gaussian distribution random number with mean 0, stddev 1.0
std::random_device rd;
std::mt19937 e2(rd());
std::normal_distribution<float> dist(0, 1);

for (float& f : m_hiddenLayerBiases)
f = dist(e2);

for (float& f : m_outputLayerBiases)
f = dist(e2);

for (float& f : m_hiddenLayerWeights)
f = dist(e2);

for (float& f : m_outputLayerWeights)
f = dist(e2);
}

void Train (const CMNISTData& trainingData, size_t miniBatchSize, float learningRate)
{
// shuffle the order of the training data for our mini batches
if (m_trainingOrder.size() != trainingData.NumImages())
{
m_trainingOrder.resize(trainingData.NumImages());
size_t index = 0;
for (size_t& v : m_trainingOrder)
{
v = index;
++index;
}
}
static std::random_device rd;
static std::mt19937 e2(rd());
std::shuffle(m_trainingOrder.begin(), m_trainingOrder.end(), e2);

// process all minibatches until we are out of training examples
size_t trainingIndex = 0;
while (trainingIndex < trainingData.NumImages())
{
// Clear out minibatch derivatives.  We sum them up and then divide at the end of the minimatch
std::fill(m_miniBatchHiddenLayerBiasesDeltaCost.begin(), m_miniBatchHiddenLayerBiasesDeltaCost.end(), 0.0f);
std::fill(m_miniBatchOutputLayerBiasesDeltaCost.begin(), m_miniBatchOutputLayerBiasesDeltaCost.end(), 0.0f);
std::fill(m_miniBatchHiddenLayerWeightsDeltaCost.begin(), m_miniBatchHiddenLayerWeightsDeltaCost.end(), 0.0f);
std::fill(m_miniBatchOutputLayerWeightsDeltaCost.begin(), m_miniBatchOutputLayerWeightsDeltaCost.end(), 0.0f);

// process the minibatch
size_t miniBatchIndex = 0;
while (miniBatchIndex < miniBatchSize && trainingIndex < trainingData.NumImages())
{
// get the training item
uint8 imageLabel = 0;
const float* pixels = trainingData.GetImage(m_trainingOrder[trainingIndex], imageLabel);

// run the forward pass of the network
uint8 labelDetected = ForwardPass(pixels, imageLabel);

// run the backward pass to get derivatives of the cost function
BackwardPass(pixels, imageLabel);

// add the current derivatives into the minibatch derivative arrays so we can average them at the end of the minibatch via division.
for (size_t i = 0; i < m_hiddenLayerBiasesDeltaCost.size(); ++i)
m_miniBatchHiddenLayerBiasesDeltaCost[i] += m_hiddenLayerBiasesDeltaCost[i];
for (size_t i = 0; i < m_outputLayerBiasesDeltaCost.size(); ++i)
m_miniBatchOutputLayerBiasesDeltaCost[i] += m_outputLayerBiasesDeltaCost[i];
for (size_t i = 0; i < m_hiddenLayerWeightsDeltaCost.size(); ++i)
m_miniBatchHiddenLayerWeightsDeltaCost[i] += m_hiddenLayerWeightsDeltaCost[i];
for (size_t i = 0; i < m_outputLayerWeightsDeltaCost.size(); ++i)
m_miniBatchOutputLayerWeightsDeltaCost[i] += m_outputLayerWeightsDeltaCost[i];

// note that we've added another item to the minibatch, and that we've consumed another training example
++trainingIndex;
++miniBatchIndex;
}

// divide minibatch derivatives by how many items were in the minibatch, to get the average of the derivatives.
// NOTE: instead of doing this explicitly like in the commented code below, we'll do it implicitly
// by dividing the learning rate by miniBatchIndex.
/*
for (float& f : m_miniBatchHiddenLayerBiasesDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchOutputLayerBiasesDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchHiddenLayerWeightsDeltaCost)
f /= float(miniBatchIndex);
for (float& f : m_miniBatchOutputLayerWeightsDeltaCost)
f /= float(miniBatchIndex);
*/

float miniBatchLearningRate = learningRate / float(miniBatchIndex);

// apply training to biases and weights
for (size_t i = 0; i < m_hiddenLayerBiases.size(); ++i)
m_hiddenLayerBiases[i] -= m_miniBatchHiddenLayerBiasesDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_outputLayerBiases.size(); ++i)
m_outputLayerBiases[i] -= m_miniBatchOutputLayerBiasesDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_hiddenLayerWeights.size(); ++i)
m_hiddenLayerWeights[i] -= m_miniBatchHiddenLayerWeightsDeltaCost[i] * miniBatchLearningRate;
for (size_t i = 0; i < m_outputLayerWeights.size(); ++i)
m_outputLayerWeights[i] -= m_miniBatchOutputLayerWeightsDeltaCost[i] * miniBatchLearningRate;
}
}

// This function evaluates the network for the given input pixels and returns the label it thinks it is from 0-9
uint8 ForwardPass (const float* pixels, uint8 correctLabel)
{
// first do hidden layer
for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
{
float Z = m_hiddenLayerBiases[neuronIndex];

for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
Z += pixels[inputIndex] * m_hiddenLayerWeights[HiddenLayerWeightIndex(inputIndex, neuronIndex)];

m_hiddenLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
}

// then do output layer
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
float Z = m_outputLayerBiases[neuronIndex];

for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
Z += m_hiddenLayerOutputs[inputIndex] * m_outputLayerWeights[OutputLayerWeightIndex(inputIndex, neuronIndex)];

m_outputLayerOutputs[neuronIndex] = 1.0f / (1.0f + std::exp(-Z));
}

// calculate error.
// this is the magnitude of the vector that is Desired - Actual.
// Commenting out because it's not needed.
/*
{
error = 0.0f;
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;
float diff = (desiredOutput - m_outputLayerOutputs[neuronIndex]);
error += diff * diff;
}
error = std::sqrt(error);
}
*/

// find the maximum value of the output layer and return that index as the label
float maxOutput = m_outputLayerOutputs[0];
uint8 maxLabel = 0;
for (uint8 neuronIndex = 1; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
if (m_outputLayerOutputs[neuronIndex] > maxOutput)
{
maxOutput = m_outputLayerOutputs[neuronIndex];
maxLabel = neuronIndex;
}
}
return maxLabel;
}

// Functions to get weights/bias values. Used to make the JSON file.
const std::array<float, HIDDEN_NEURONS>& GetHiddenLayerBiases () const { return m_hiddenLayerBiases; }
const std::array<float, OUTPUT_NEURONS>& GetOutputLayerBiases () const { return m_outputLayerBiases; }
const std::array<float, INPUTS * HIDDEN_NEURONS>& GetHiddenLayerWeights () const { return m_hiddenLayerWeights; }
const std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>& GetOutputLayerWeights () const { return m_outputLayerWeights; }

private:

static size_t HiddenLayerWeightIndex (size_t inputIndex, size_t hiddenLayerNeuronIndex)
{
return hiddenLayerNeuronIndex * INPUTS + inputIndex;
}

static size_t OutputLayerWeightIndex (size_t hiddenLayerNeuronIndex, size_t outputLayerNeuronIndex)
{
return outputLayerNeuronIndex * HIDDEN_NEURONS + hiddenLayerNeuronIndex;
}

// this function uses the neuron output values from the forward pass to backpropagate the error
// of the network to calculate the gradient needed for training.  It figures out what the error
// is by comparing the label it came up with to the label it should have come up with (correctLabel).
void BackwardPass (const float* pixels, uint8 correctLabel)
{
// since we are going backwards, do the output layer first
for (size_t neuronIndex = 0; neuronIndex < OUTPUT_NEURONS; ++neuronIndex)
{
// calculate deltaCost/deltaBias for each output neuron.
// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
//
// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
//
// deltaCost/deltaO = O - desiredOutput
// deltaO/deltaZ = O * (1 - O)
//
float desiredOutput = (correctLabel == neuronIndex) ? 1.0f : 0.0f;

float deltaCost_deltaO = m_outputLayerOutputs[neuronIndex] - desiredOutput;
float deltaO_deltaZ = m_outputLayerOutputs[neuronIndex] * (1.0f - m_outputLayerOutputs[neuronIndex]);

m_outputLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

// calculate deltaCost/deltaWeight for each weight going into the neuron
//
// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
// deltaCost/deltaWeight = deltaCost/deltaBias * input
//
for (size_t inputIndex = 0; inputIndex < HIDDEN_NEURONS; ++inputIndex)
m_outputLayerWeightsDeltaCost[OutputLayerWeightIndex(inputIndex, neuronIndex)] = m_outputLayerBiasesDeltaCost[neuronIndex] * m_hiddenLayerOutputs[inputIndex];
}

// then do the hidden layer
for (size_t neuronIndex = 0; neuronIndex < HIDDEN_NEURONS; ++neuronIndex)
{
// calculate deltaCost/deltaBias for each hidden neuron.
// This is also the error for the neuron, and is the same value as deltaCost/deltaZ.
//
// deltaCost/deltaO =
//   Sum for each output of this neuron:
//
// deltaTargetZ/deltaSourceO is the value of the weight connecting the source and target neuron.
//
// deltaCost/deltaZ = deltaCost/deltaO * deltaO/deltaZ
// deltaO/deltaZ = O * (1 - O)
//
float deltaCost_deltaO = 0.0f;
for (size_t destinationNeuronIndex = 0; destinationNeuronIndex < OUTPUT_NEURONS; ++destinationNeuronIndex)
deltaCost_deltaO += m_outputLayerBiasesDeltaCost[destinationNeuronIndex] * m_outputLayerWeights[OutputLayerWeightIndex(neuronIndex, destinationNeuronIndex)];
float deltaO_deltaZ = m_hiddenLayerOutputs[neuronIndex] * (1.0f - m_hiddenLayerOutputs[neuronIndex]);
m_hiddenLayerBiasesDeltaCost[neuronIndex] = deltaCost_deltaO * deltaO_deltaZ;

// calculate deltaCost/deltaWeight for each weight going into the neuron
//
// deltaCost/deltaWeight = deltaCost/deltaZ * deltaCost/deltaWeight
// deltaCost/deltaWeight = deltaCost/deltaBias * input
//
for (size_t inputIndex = 0; inputIndex < INPUTS; ++inputIndex)
m_hiddenLayerWeightsDeltaCost[HiddenLayerWeightIndex(inputIndex, neuronIndex)] = m_hiddenLayerBiasesDeltaCost[neuronIndex] * pixels[inputIndex];
}
}

private:

// biases and weights
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiases;
std::array<float, OUTPUT_NEURONS>					m_outputLayerBiases;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeights;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeights;

// neuron activation values aka "O" values
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerOutputs;
std::array<float, OUTPUT_NEURONS>					m_outputLayerOutputs;

// derivatives of biases and weights for a single training example
std::array<float, HIDDEN_NEURONS>					m_hiddenLayerBiasesDeltaCost;
std::array<float, OUTPUT_NEURONS>					m_outputLayerBiasesDeltaCost;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_hiddenLayerWeightsDeltaCost;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_outputLayerWeightsDeltaCost;

// derivatives of biases and weights for the minibatch. Average of all items in minibatch.
std::array<float, HIDDEN_NEURONS>					m_miniBatchHiddenLayerBiasesDeltaCost;
std::array<float, OUTPUT_NEURONS>					m_miniBatchOutputLayerBiasesDeltaCost;

std::array<float, INPUTS * HIDDEN_NEURONS>			m_miniBatchHiddenLayerWeightsDeltaCost;
std::array<float, HIDDEN_NEURONS * OUTPUT_NEURONS>	m_miniBatchOutputLayerWeightsDeltaCost;

// used for minibatch generation
std::vector<size_t>									m_trainingOrder;
};

// ============================================================================================
//                                   DRIVER PROGRAM
// ============================================================================================

// training and test data
CMNISTData g_trainingData;
CMNISTData g_testData;

// neural network
CNeuralNetwork<c_numInputNeurons, c_numHiddenNeurons, c_numOutputNeurons> g_neuralNetwork;

float GetDataAccuracy (const CMNISTData& data)
{
size_t correctItems = 0;
for (size_t i = 0, c = data.NumImages(); i < c; ++i)
{
uint8 label;
const float* pixels = data.GetImage(i, label);
uint8 detectedLabel = g_neuralNetwork.ForwardPass(pixels, label);

if (detectedLabel == label)
++correctItems;
}
return float(correctItems) / float(data.NumImages());
}

void ShowImage (const CMNISTData& data, size_t imageIndex)
{
uint8 label = 0;
const float* pixels = data.GetImage(imageIndex, label);
printf("showing a %in", label);
for (int iy = 0; iy < 28; ++iy)
{
for (int ix = 0; ix < 28; ++ix)
{
if (*pixels < 0.125)
printf(" ");
else
printf("+");
++pixels;
}
printf("n");
}
}

int main (int argc, char** argv)
{
// load the MNIST data if we can
{
printf("Could not load mnist data, aborting!n");
system("pause");
return 1;
}

#if REPORT_ERROR_WHILE_TRAINING()
FILE *file = fopen("Error.csv","w+t");
if (!file)
{
printf("Could not open Error.csv for writing, aborting!n");
system("pause");
return 2;
}
fprintf(file, ""Training Data Accuracy","Testing Data Accuracy"n");
#endif

{
SBlockTimer timer("Training Time:  ");

// train the network, reporting error before each training
for (size_t epoch = 0; epoch < c_trainingEpochs; ++epoch)
{
#if REPORT_ERROR_WHILE_TRAINING()
float accuracyTraining = GetDataAccuracy(g_trainingData);
float accuracyTest = GetDataAccuracy(g_testData);
printf("Training Data Accuracy: %0.2f%%n", 100.0f*accuracyTraining);
printf("Test Data Accuracy: %0.2f%%nn", 100.0f*accuracyTest);
fprintf(file, ""%f","%f"n", accuracyTraining, accuracyTest);
#endif

printf("Training epoch %zu / %zu...n", epoch+1, c_trainingEpochs);
g_neuralNetwork.Train(g_trainingData, c_miniBatchSize, c_learningRate);
printf("n");
}
}

// report final error
float accuracyTraining = GetDataAccuracy(g_trainingData);
float accuracyTest = GetDataAccuracy(g_testData);
printf("nFinal Training Data Accuracy: %0.2f%%n", 100.0f*accuracyTraining);
printf("Final Test Data Accuracy: %0.2f%%nn", 100.0f*accuracyTest);

#if REPORT_ERROR_WHILE_TRAINING()
fprintf(file, ""%f","%f"n", accuracyTraining, accuracyTest);
fclose(file);
#endif

// Write out the final weights and biases as JSON for use in the web demo
{
FILE* file = fopen("WeightsBiasesJSON.txt", "w+t");
fprintf(file, "{n");

// network structure
fprintf(file, "  "InputNeurons":%zu,n", c_numInputNeurons);
fprintf(file, "  "HiddenNeurons":%zu,n", c_numHiddenNeurons);
fprintf(file, "  "OutputNeurons":%zu,n", c_numOutputNeurons);

// HiddenBiases
auto hiddenBiases = g_neuralNetwork.GetHiddenLayerBiases();
fprintf(file, "  "HiddenBiases" : [n");
for (size_t i = 0; i < hiddenBiases.size(); ++i)
{
fprintf(file, "    %f", hiddenBiases[i]);
if (i < hiddenBiases.size() -1)
fprintf(file, ",");
fprintf(file, "n");
}
fprintf(file, "  ],n");

// HiddenWeights
auto hiddenWeights = g_neuralNetwork.GetHiddenLayerWeights();
fprintf(file, "  "HiddenWeights" : [n");
for (size_t i = 0; i < hiddenWeights.size(); ++i)
{
fprintf(file, "    %f", hiddenWeights[i]);
if (i < hiddenWeights.size() - 1)
fprintf(file, ",");
fprintf(file, "n");
}
fprintf(file, "  ],n");

// OutputBiases
auto outputBiases = g_neuralNetwork.GetOutputLayerBiases();
fprintf(file, "  "OutputBiases" : [n");
for (size_t i = 0; i < outputBiases.size(); ++i)
{
fprintf(file, "    %f", outputBiases[i]);
if (i < outputBiases.size() - 1)
fprintf(file, ",");
fprintf(file, "n");
}
fprintf(file, "  ],n");

// OutputWeights
auto outputWeights = g_neuralNetwork.GetOutputLayerWeights();
fprintf(file, "  "OutputWeights" : [n");
for (size_t i = 0; i < outputWeights.size(); ++i)
{
fprintf(file, "    %f", outputWeights[i]);
if (i < outputWeights.size() - 1)
fprintf(file, ",");
fprintf(file, "n");
}
fprintf(file, "  ]n");

// done
fprintf(file, "}n");
fclose(file);
}

// You can use the code like the below to visualize an image if you want to.
//ShowImage(g_testData, 0);

system("pause");
return 0;
}


Thanks for reading, and if you have any questions, comments, or just want to chat, hit me up in the comments below, or on twitter at @Atrix256.

Neural Network Gradients: Backpropagation, Dual Numbers, Finite Differences

In the post How to Train Neural Networks With Backpropagation I said that you could also calculate the gradient of a neural network by using dual numbers or finite differences.

By special request, this is that post!

If you want an explanation of dual numbers, check out these posts:

If you want an explanation of finite differences, check out this post:
Finite Differences

Since the fundamentals are explained in the links above, we’ll go straight to the code.

We’ll be getting the gradient (learning values) for the network in example 4 in the backpropagation post:

Note that I am using “central differences” for the gradient, but it would be more efficient to do a forward or backward difference, at the cost of some accuracy. I’m using an epsilon of 0.001.

I didn’t compare the running times of each method as my code is meant to be readable, not fast, and the code isn’t doing enough work to make a meaningful performance test IMO. If you did do a speed test, the finite differences should be by far the slowest, backpropagation should be the fastest, and dual numbers are probably going to be closer to backpropagation than to finite differences.

The output of the program is below. Both backpropagation and dual numbers get the exact derivatives (within the tolerances of floating point math of course!) because they use the chain rule, whereas finite differences is a numerical approximation. This shows up in the fact that backpropagation and dual numbers agree for all values, where finite differences has some small error in the derivatives.

And here is the code:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>

#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

//----------------------------------------------------------------------
// Dual Number Class - CDualNumber
//----------------------------------------------------------------------

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

// constructor to make a constant
CDualNumber (float f = 0.0f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
CDualNumber (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// Set a constant value.
void Set (float f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// Set a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
void Set (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// storage for real and dual values
float                           m_real;
std::array<float, NUMVARIABLES> m_dual;
};

//----------------------------------------------------------------------
// Dual Number Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real + b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real - b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real * b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real / b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
return ret;
}

// NOTE: the "special functions" below all just use the chain rule, which you can also use to add more functions

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
float sqrtReal = sqrt(a.m_real);
ret.m_real = sqrtReal;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = pow(a.m_real, y);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> exp (const CDualNumber<NUMVARIABLES>& a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = exp(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * exp(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = sin(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = cos(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Driver Program
//----------------------------------------------------------------------

enum EWeightsBiases
{
e_weight0 = 0,
e_weight1,
e_weight2,
e_weight3,
e_weight4,
e_weight5,
e_weight6,
e_weight7,
e_bias0,
e_bias1,
e_bias2,
e_bias3,

e_count
};

// our dual number needs a "dual" for every value we want a derivative for: aka every weight and bias
typedef CDualNumber<EWeightsBiases::e_count> TDualNumber;

// templated so it can work for both the dual numbers, as well as the float finite differences
template <typename TBaseType>
void ForwardPass (const std::array<TBaseType, 2>& input, const std::array<TBaseType, 2>& desiredOutput, const std::array<TBaseType, EWeightsBiases::e_count>& weightsBiases, TBaseType& cost)
{
// calculate hidden layer neuron activations
TBaseType Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
TBaseType O0 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z0 * TBaseType(-1.0f)));

TBaseType Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
TBaseType O1 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z1 * TBaseType(-1.0f)));

// calculate output layer neuron activations
TBaseType Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
TBaseType O2 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z2 * TBaseType(-1.0f)));

TBaseType Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
TBaseType O3 = TBaseType(1.0f) / (TBaseType(1.0f) + exp(Z3 * TBaseType(-1.0f)));

// calculate the cost: 0.5 * ||target-actual||^2
// aka cost = half (error squared)
TBaseType diff1 = TBaseType(desiredOutput[0]) - O2;
TBaseType diff2 = TBaseType(desiredOutput[1]) - O3;
cost = TBaseType(0.5f) * (diff1*diff1 + diff2*diff2);
}

// backpropagation
void ForwardPassAndBackpropagation (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, EWeightsBiases::e_count>& weightsBiases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, EWeightsBiases::e_count>& deltaWeightsBiases
) {
// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weightsBiases[e_weight0] + input[1] * weightsBiases[e_weight1] + weightsBiases[e_bias0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weightsBiases[e_weight2] + input[1] * weightsBiases[e_weight3] + weightsBiases[e_bias1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// calculate Z2 and O2 for neuron2
float Z2 = O0 * weightsBiases[e_weight4] + O1 * weightsBiases[e_weight5] + weightsBiases[e_bias2];
float O2 = 1.0f / (1.0f + std::exp(-Z2));

// calculate Z3 and O3 for neuron3
float Z3 = O0 * weightsBiases[e_weight6] + O1 * weightsBiases[e_weight7] + weightsBiases[e_bias3];
float O3 = 1.0f / (1.0f + std::exp(-Z3));

// the actual output of the network is the activation of the output layer neurons
actualOutput[0] = O2;
actualOutput[1] = O3;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 2 -----

// calculate how much a change in neuron 2 activation affects the cost function
// deltaCost/deltaO2 = O2 - target0
float deltaCost_deltaO2 = O2 - desiredOutput[0];

// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
// deltaO2/deltaZ2 = O2 * (1 - O2)
float deltaO2_deltaZ2 = O2 * (1 - O2);

// calculate how much a change in neuron 2 weighted input affects the cost function.
// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
deltaWeightsBiases[e_bias2] = neuron2Error;

// calculate how much a change in weight4 affects the cost function.
// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * O0
// similar thing for weight5
deltaWeightsBiases[e_weight4] = neuron2Error * O0;
deltaWeightsBiases[e_weight5] = neuron2Error * O1;

//----- Neuron 3 -----

// calculate how much a change in neuron 3 activation affects the cost function
// deltaCost/deltaO3 = O3 - target1
float deltaCost_deltaO3 = O3 - desiredOutput[1];

// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
// deltaO3/deltaZ3 = O3 * (1 - O3)
float deltaO3_deltaZ3 = O3 * (1 - O3);

// calculate how much a change in neuron 3 weighted input affects the cost function.
// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
deltaWeightsBiases[e_bias3] = neuron3Error;

// calculate how much a change in weight6 affects the cost function.
// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * O0
// similar thing for weight7
deltaWeightsBiases[e_weight6] = neuron3Error * O0;
deltaWeightsBiases[e_weight7] = neuron3Error * O1;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
float deltaCost_deltaO0 = neuron2Error * weightsBiases[e_weight4] + neuron3Error * weightsBiases[e_weight6];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaWeightsBiases[e_bias0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaWeightsBiases[e_weight0] = neuron0Error * input[0];
deltaWeightsBiases[e_weight1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
float deltaCost_deltaO1 = neuron2Error * weightsBiases[e_weight5] + neuron3Error * weightsBiases[e_weight7];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO1/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaWeightsBiases[e_bias1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaWeightsBiases[e_weight2] = neuron1Error * input[0];
deltaWeightsBiases[e_weight3] = neuron1Error * input[1];
}

int main (int argc, char **argv)
{

// weights and biases, inputs and desired output
const std::array<float, EWeightsBiases::e_count> weightsBiases =
{
0.15f, // e_weight0
0.2f,  // e_weight1
0.25f, // e_weight2
0.3f,  // e_weight3
0.4f,  // e_weight4
0.45f, // e_weight5
0.5f,  // e_weight6
0.55f, // e_weight7
0.35f, // e_bias0
0.35f, // e_bias1
0.6f,  // e_bias2
0.6f   // e_bias3
};

const std::array<float, 2> inputs =
{
0.05f,
0.1f
};

std::array<float, 2> desiredOutput = {
0.01f,
0.99f
};

// =====================================================
// ===== FINITE DIFFERENCES CALCULATED DERIVATIVES =====
// =====================================================

std::array<float, EWeightsBiases::e_count> weightsBiasesFloat;
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
weightsBiasesFloat[i] = weightsBiases[i];

// use central differences to approximate the gradient
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
{
float costSample1 = 0.0f;
weightsBiasesFloat[i] = weightsBiases[i] - EPSILON;
ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample1);

float costSample2 = 0.0f;
weightsBiasesFloat[i] = weightsBiases[i] + EPSILON;
ForwardPass(inputs, desiredOutput, weightsBiasesFloat, costSample2);

gradientFiniteDifferences[i] = (costSample2 - costSample1) / (EPSILON * 2.0f);

weightsBiasesFloat[i] = weightsBiases[i];
}

// ==============================================
// ===== DUAL NUMBER CALCULATED DERIVATIVES =====
// ==============================================

// dual number weights and biases
std::array<TDualNumber, EWeightsBiases::e_count> weightsBiasesDual;
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
weightsBiasesDual[i].Set(weightsBiases[i], i);

// dual number inputs and desired output
std::array<TDualNumber, 2> inputsDual;
for (size_t i = 0; i < inputsDual.size(); ++i)
inputsDual[i].Set(inputs[i]);

std::array<TDualNumber, 2> desiredOutputDual;
for (size_t i = 0; i < desiredOutputDual.size(); ++i)
desiredOutputDual[i].Set(desiredOutput[i]);

// dual number derivatives

// ==================================================
// ===== BACKPROPAGATION CALCULATED DERIVATIVES =====
// ==================================================

float error;
float cost;
std::array<float, 2> actualOutput;
ForwardPassAndBackpropagation(inputs, desiredOutput, weightsBiases, error, cost, actualOutput, gradientBackPropagation);

// ==========================
// ===== Report Results =====
// ==========================

printf("Neural Network GradientnnBackpropagation     Dual Numbers (Error)       Finite Differences (Error)n");
for (size_t i = 0; i < EWeightsBiases::e_count; ++i)
{
printf("% 08f,         % 08f (% 08f),     % 08f (% 08f)n",
);
}
printf("n");

system("pause");
return 0;
}


How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

1. Explanation
2. Working through examples
3. Simple sample C++ source code using only standard includes
4. Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);


To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

$Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)$

Plugging in the specific values for the weights and biases, it looks like this:

$Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)$

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

$Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)$

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

$\sigma(x) = \frac{1}{1+e^{-x}}$

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

$Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295$

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula $y=mx+b$ and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is $\frac{dy}{dx}$, which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by $\frac{dy}{dx}$ to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: $\frac{\partial y}{\partial x}$.

The reason partial derivatives even exist is because if you have a function of multiple variables like $z=f(x,y)=x^2+3y+2$, you have two variables that you can take the derivative of. You can calculate $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$. The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

$\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\$

If you put both of those values together into a vector $(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

1. Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
2. Multiply the gradient by a small “learning rate” value, such as 0.05
3. Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

$Cost = ||target-output||^2$

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like $\frac{\partial Cost}{\partial Weight0}$, you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

$dy/dx$

That you can also write it like this:

$dy/du*du/dx$

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

$3/2 = 1.5$

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
$3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5$

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

1. dCost/dOutput: The derivative of how a neuron’s output affects cost
2. dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
3. dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Backpropagation Example 2: Single Neuron, Two Training Examples

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
$\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476$

If we follow the same steps with the “input 0, output 1” training example we get these:
$\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887$

To get the actual derivatives to train the network with, we just average them!
$\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294$

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225$

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167$

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

$\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360$

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

$\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463$

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463$

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729$

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984$

Now we start the backward pass to calculate the derivatives for training.

Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

$\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385$

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

$\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\$

Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

$\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381$

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

$\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\$

Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

$\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364$

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

$\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\$

Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

$\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01$

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\$

Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
float input, float desiredOutput,
float weight, float bias,
float& error, float& cost, float& actualOutput,
float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
float Z = input * weight + bias;
float O = 1.0f / (1.0f + std::exp(-Z));

// the actual output of the network is the activation of the neuron
actualOutput = O;

// calculate error
error = std::abs(desiredOutput - actualOutput);

// calculate cost
cost = 0.5f * error * error;

// calculate how much a change in neuron activation affects the cost function
// deltaCost/deltaO = O - target
float deltaCost_deltaO = O - desiredOutput;

// calculate how much a change in neuron weighted input affects neuron activation
// deltaO/deltaZ = O * (1 - O)
float deltaO_deltaZ = O * (1 - O);

// calculate how much a change in a neuron's weighted input affects the cost function.
// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
float neuronError = deltaCost_deltaO * deltaO_deltaZ;
deltaCost_deltaBias = neuronError;

// calculate how much a change in the weight affects the cost function.
// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
// deltaCost/deltaWeight = neuronError * input
deltaCost_deltaWeight = neuronError * input;

// As a bonus, calculate how much a change in the input affects the cost function.
// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
// deltaCost/deltaInput = neuronError * weight
deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example1.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 10000;

// training data
// input: 1, output: 0
const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float error = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
// run the network to get error and derivatives
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
}
#endif

weight -= deltaCost_deltaWeight * c_learningRate;
bias -= deltaCost_deltaBias * c_learningRate;
}

printf("Example1 Final Error: %fn", error);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example2.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 100000;

// training data
// input: 1, output: 0
// input: 0, output: 1
const std::array<std::array<float, 2>, 2> c_trainingData = { {
{1.0f, 0.0f},
{0.0f, 1.0f}
} };

// starting weight and bias values
float weight = 0.3f;
float bias = 0.5f;

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
avgError = 0.0f;
float avgOutput = 0.0f;
float avgCost = 0.0f;
float avgDeltaCost_deltaWeight = 0.0f;
float avgDeltaCost_deltaBias = 0.0f;
float avgDeltaCost_deltaInput = 0.0f;

// run the network to get error and derivatives for each training example
for (const std::array<float, 2>& trainingData : c_trainingData)
{
float error = 0.0f;
float output = 0.0f;
float cost = 0.0f;
float deltaCost_deltaWeight = 0.0f;
float deltaCost_deltaBias = 0.0f;
float deltaCost_deltaInput = 0.0f;
Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

avgError += error;
avgOutput += output;
avgCost += cost;
avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
avgDeltaCost_deltaBias += deltaCost_deltaBias;
avgDeltaCost_deltaInput += deltaCost_deltaInput;
}

avgError /= (float)c_trainingData.size();
avgOutput /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
avgDeltaCost_deltaBias /= (float)c_trainingData.size();
avgDeltaCost_deltaInput /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
}
#endif

weight -= avgDeltaCost_deltaWeight * c_learningRate;
bias -= avgDeltaCost_deltaBias * c_learningRate;
}

printf("Example2 Final Error: %fn", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
std::array<float, 2> m_input;
std::array<float, 2> m_output;
};

void Example3RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 4>& weights, const std::array<float, 2>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// the actual output of the network is the activation of the neurons
actualOutput[0] = O0;
actualOutput[1] = O1;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = O0 - target0
float deltaCost_deltaO0 = O0 - desiredOutput[0];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = O1 - target1
float deltaCost_deltaO1 = O1 - desiredOutput[1];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO0/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example3.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 520000;

// training data: OR/AND
// input: 00, output: 00
// input: 01, output: 10
// input: 10, output: 10
// input: 11, output: 11
const std::array<SExample3Training, 4> c_trainingData = { {
{{0.0f, 0.0f}, {0.0f, 0.0f}},
{{0.0f, 1.0f}, {1.0f, 0.0f}},
{{1.0f, 0.0f}, {1.0f, 0.0f}},
{{1.0f, 1.0f}, {1.0f, 1.0f}},
} };

// starting weight and bias values
std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
std::array<float, 2> biases = { 0.5f, 0.1f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
//float avgCost = 0.0f;
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example3 Final Error: %fn", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
const std::array<float, 8>& weights, const std::array<float, 4>& biases,
float& error, float& cost, std::array<float, 2>& actualOutput,
std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
// calculate Z0 and O0 for neuron0
float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
float O0 = 1.0f / (1.0f + std::exp(-Z0));

// calculate Z1 and O1 for neuron1
float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
float O1 = 1.0f / (1.0f + std::exp(-Z1));

// calculate Z2 and O2 for neuron2
float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
float O2 = 1.0f / (1.0f + std::exp(-Z2));

// calculate Z3 and O3 for neuron3
float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
float O3 = 1.0f / (1.0f + std::exp(-Z3));

// the actual output of the network is the activation of the output layer neurons
actualOutput[0] = O2;
actualOutput[1] = O3;

// calculate error
float diff0 = desiredOutput[0] - actualOutput[0];
float diff1 = desiredOutput[1] - actualOutput[1];
error = std::sqrt(diff0*diff0 + diff1*diff1);

// calculate cost
cost = 0.5f * error * error;

//----- Neuron 2 -----

// calculate how much a change in neuron 2 activation affects the cost function
// deltaCost/deltaO2 = O2 - target0
float deltaCost_deltaO2 = O2 - desiredOutput[0];

// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
// deltaO2/deltaZ2 = O2 * (1 - O2)
float deltaO2_deltaZ2 = O2 * (1 - O2);

// calculate how much a change in neuron 2 weighted input affects the cost function.
// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
deltaCost_deltaBiases[2] = neuron2Error;

// calculate how much a change in weight4 affects the cost function.
// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
// deltaCost/deltaWeight4 = neuron2Error * O0
// similar thing for weight5
deltaCost_deltaWeights[4] = neuron2Error * O0;
deltaCost_deltaWeights[5] = neuron2Error * O1;

//----- Neuron 3 -----

// calculate how much a change in neuron 3 activation affects the cost function
// deltaCost/deltaO3 = O3 - target1
float deltaCost_deltaO3 = O3 - desiredOutput[1];

// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
// deltaO3/deltaZ3 = O3 * (1 - O3)
float deltaO3_deltaZ3 = O3 * (1 - O3);

// calculate how much a change in neuron 3 weighted input affects the cost function.
// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
deltaCost_deltaBiases[3] = neuron3Error;

// calculate how much a change in weight6 affects the cost function.
// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
// deltaCost/deltaWeight6 = neuron3Error * O0
// similar thing for weight7
deltaCost_deltaWeights[6] = neuron3Error * O0;
deltaCost_deltaWeights[7] = neuron3Error * O1;

//----- Neuron 0 -----

// calculate how much a change in neuron 0 activation affects the cost function
// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
// deltaO0/deltaZ0 = O0 * (1 - O0)
float deltaO0_deltaZ0 = O0 * (1 - O0);

// calculate how much a change in neuron 0 weighted input affects the cost function.
// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
deltaCost_deltaBiases[0] = neuron0Error;

// calculate how much a change in weight0 affects the cost function.
// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
// deltaCost/deltaWeight0 = neuron0Error * input0
// similar thing for weight1
deltaCost_deltaWeights[0] = neuron0Error * input[0];
deltaCost_deltaWeights[1] = neuron0Error * input[1];

//----- Neuron 1 -----

// calculate how much a change in neuron 1 activation affects the cost function
// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
// deltaO1/deltaZ1 = O1 * (1 - O1)
float deltaO1_deltaZ1 = O1 * (1 - O1);

// calculate how much a change in neuron 1 weighted input affects the cost function.
// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
deltaCost_deltaBiases[1] = neuron1Error;

// calculate how much a change in weight2 affects the cost function.
// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
// deltaCost/deltaWeight2 = neuron1Error * input0
// similar thing for weight3
deltaCost_deltaWeights[2] = neuron1Error * input[0];
deltaCost_deltaWeights[3] = neuron1Error * input[1];

//----- Input -----

// As a bonus, calculate how much a change in the inputs affect the cost function.
// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
// That means that...
// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
//                       = neuron0Error * weight0 + neuron1Error * weight2
// and
// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
//                       = neuron0Error * weight1 + neuron1Error * weight3
deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
#if LOG_TO_CSV_NUMSAMPLES() > 0
// open the csv file for this example
FILE *file = fopen("Example4.csv","w+t");
if (file != nullptr)
fprintf(file, ""training index","error","cost"n");
#endif

// learning parameters for the network
const float c_learningRate = 0.5f;
const size_t c_numTrainings = 5000;

// training data: 0.05, 0.1 in = 0.01, 0.99 out
const std::array<SExample3Training, 1> c_trainingData = { {
{{0.05f, 0.1f}, {0.01f, 0.99f}},
} };

// starting weight and bias values
std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

// iteratively train the network
float avgError = 0.0f;
for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
{
std::array<float, 2> avgOutput = { 0.0f, 0.0f };
std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
avgError = 0.0f;
float avgCost = 0.0;

// run the network to get error and derivatives for each training example
for (const SExample3Training& trainingData : c_trainingData)
{
float error = 0.0f;
std::array<float, 2> output = { 0.0f, 0.0f };
float cost = 0.0f;
std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

avgError += error;
avgCost += cost;
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] += output[i];
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
}

avgError /= (float)c_trainingData.size();
avgCost /= (float)c_trainingData.size();
for (size_t i = 0; i < avgOutput.size(); ++i)
avgOutput[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

#if LOG_TO_CSV_NUMSAMPLES() > 0
const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
{
// log to the csv
fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
}
#endif

for (size_t i = 0; i < weights.size(); ++i)
weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
for (size_t i = 0; i < biases.size(); ++i)
biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
}

printf("Example4 Final Error: %fn", avgError);

#if LOG_TO_CSV_NUMSAMPLES() > 0
if (file != nullptr)
fclose(file);
#endif
}

int main (int argc, char **argv)
{
Example1();
Example2();
Example3();
Example4();
system("pause");
return 0;
}


The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

1. finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
2. Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

Multivariable Dual Numbers & Automatic Differentiation

In a previous post I showed how to use dual numbers to be able to get both the value and derivative of a function at the same time:
Dual Numbers & Automatic Differentiation

That post mentions that you can extend it to multivariable functions but doesn’t explain how. This post is that explanation, including simple working C++ code!

Extending this to multivariable functions is useful for ray marching, calculating analytical surface normals and also likely useful for training a neural network if you want an alternative to back propagation. I’m not sure about the efficiency comparison of this versus back propagation but I intend on looking into it (:

How Does it Work?

It turns out to be really simple to use dual numbers with multivariable functions. The end result is that you want a partial derivative for each variable in the equation, so to do that, you just have a dual number per variable, and process the entire equation for each of those dual numbers separately.

We’ll work through an example. Let’s find the partial derivatives of x and y of the function $3x^2-2y^3$, at input (5,2).

We’ll start by finding the derivative of x, and then the derivative of y.

Example: df/dx

We start by making a dual number for our x value, remembering that the real part is the actual value for x, and the dual part is the derivative of x, which is 1:

$5+1\epsilon$

or:

$5+\epsilon$

We multiply that value by itself to get the $x^2$ value, keeping in mind that $\epsilon^2$ is zero:
$(5+\epsilon)*(5+\epsilon)= \\ 25+10\epsilon+\epsilon^2= \\ 25+10\epsilon \\$

Next we need to multiply that by 3 to get the $3x^2$ term:

$3*(25+10\epsilon) = 75+30\epsilon$

Putting that aside for a moment, we need to make the $2y^3$ term. We start by making our y value:

$2+0\epsilon$

or:

$2$

If you are wondering why it has a zero for the epsilon term, it’s because when we are calculating the partial derivative of x, y is a constant, so has a derivative of zero.

Next, we multiply this y value by itself twice to get the $y^3$ value:

$2*2*2=8$

We then multiply it by 2 to get the $2y^3$ term:

$8*2=16$

Now that we have our two terms, we subtract the y term from the x term to get our final result:

$75+30\epsilon-16 = \\ 59+30\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30.

That checks out, let’s move on to the derivative of y!

Example: df/dy

Calculating the derivative of y is very similar to calculating the derivative of x, except that it’s the x term that has an epsilon value (derivative) of 0, instead of the y term. The y term has the epsilon value of 1 this time as well. We’ll work through it to see how it plays out.

First up, we need to make the value for x:

$5+0\epsilon$

or:

$5$

Next we square it and multiply it by 3 to get the $3x^2$ term:

$5*5*3=75$

Next we need to make the value for y, remembering that we use an epsilon value of 1 since the derivative of y is 1 this time around.

$2+\epsilon$

We cube that value and multiply by 2 to get the $2y^3$ term:
$2*(2+\epsilon)*(2+\epsilon)*(2+\epsilon)= \\ 2*(2+\epsilon)*(4+4\epsilon+\epsilon^2)= \\ 2*(2+\epsilon)*(4+4\epsilon)= \\ 2*(8+12\epsilon+4\epsilon^2)= \\ 2*(8+12\epsilon)= \\ 16+24\epsilon$

Now we subtract the y term from the x term to get the final result:

$75 - (16+24\epsilon)= \\ 59-24\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of y at that point is -24.

That also checks out, so we got the correct value and partial derivatives for the equation.

Reducing Redundancy

There was quite a bit of redundancy when working through the x and y derivatives wasn’t there? Increasing the number of variables will increase the amount of redundancy too, so it doesn’t scale up well.

Luckily there is a way to address this. Basically, instead of making two dual numbers which have two items, you make them share the real value (since it’s the same for both, as is the work to make it) and append the dual values for x and y to it.

$x'=(a+b\epsilon) \\ y'=(a+b\epsilon)$

You have:

$(a+b\epsilon_x+c\epsilon_y)$

Then, in your math or in your program, you treat it as if it’s two different dual numbers packed into one. This lets you do the work for the real number once instead of twice, but still lets you do your dual number work for each variable independently.

While it’s probably easiest to think of these as two dual numbers packed into one value, there is actually a mathematical basis for it as well, which may or may not surprise you.

Check out what happens when we multiply two of these together, keeping in mind that multiplying ANY two epsilon values together becomes zero, even if they are different epsilons:

$(a+b\epsilon_x+c\epsilon_y) * (d+e\epsilon_x+f\epsilon_y)= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + be\epsilon_x^2 + bf\epsilon_x\epsilon_y + cd\epsilon_y + ce\epsilon_x\epsilon_y + cf\epsilon_y^2= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + cd\epsilon_y= \\ ad + (ae+bd)\epsilon_x + (af+cd)\epsilon_y$

The interesting thing is that the above result gives you the same values as if you did the same work for two dual numbers individually.

Let’s see this three component dual number in action by re-doing the example again. Note that this pattern scales up to ANY number of variables!

Our goal is to calculate the value and partial derivatives of the function $3x^2-2y^3$ at location (5,2).

First we make our x value:

$5 + 1\epsilon_x + 0\epsilon_y$

or:

$5 + \epsilon_x$

We square that and multiply it by 3 to get our $3x^2$ term:

$3*(5 + \epsilon_x)*(5 + \epsilon_x)= \\ 3*(25+10\epsilon_x+\epsilon_x^2)= \\ 3*(25+10\epsilon_x)= \\ 75+30\epsilon_x$

Next, we make our y value:

$2 + 0\epsilon_x + 1\epsilon_y$

or:

$2 + \epsilon_y$

We cube it and multiply it by 2 to get our $2x^3$ term:

$16+24\epsilon_y$

Lastly we subtract the y term from the x term to get our final answer:

$(75+30\epsilon_x) - (16+24\epsilon_y)= \\ 59+30\epsilon_x-24\epsilon_y$

The result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30, and the derivative of y at that point is -24.

Neat, right?!

Example Code

Here is the example code output:

Here is the code that generated it:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>

#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

// constructor to make a constant
CDualNumber (float f = 0.0f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
CDualNumber (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// storage for real and dual values
float							m_real;
std::array<float, NUMVARIABLES> m_dual;
};

//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real + b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real - b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real * b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real / b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
float sqrtReal = sqrt(a.m_real);
ret.m_real = sqrtReal;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = pow(a.m_real, y);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = sin(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = cos(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------

void TestSmoothStep (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// calculate value and derivative using dual numbers
CDualNumber<1> y = SmoothStep(x);

// calculate numeric derivative using central differences
float derivNumeric = (SmoothStep(input + EPSILON) - SmoothStep(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 6.0f * input - 6.0f * input * input;

// show value and derivatives
printf("(smoothstep) y=3x^2-2x^3  (x=%0.4f)n", input);
printf("  y = %0.4fn", y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

void TestTrig (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sin
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = sin(x);

// calculate numeric derivative using central differences
float derivNumeric = (sin(input + EPSILON) - sin(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = cos(input);

// show value and derivatives
printf("sin(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// cos
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = cos(x);

// calculate numeric derivative using central differences
float derivNumeric = (cos(input + EPSILON) - cos(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -sin(input);

// show value and derivatives
printf("cos(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// tan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = tan(x);

// calculate numeric derivative using central differences
float derivNumeric = (tan(input + EPSILON) - tan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (cos(input)*cos(input));

// show value and derivatives
printf("tan(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// atan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = atan(x);

// calculate numeric derivative using central differences
float derivNumeric = (atan(input + EPSILON) - atan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (1.0f + input * input);

// show value and derivatives
printf("atan(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}
}

void TestSimple (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sqrt
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = CDualNumber<1>(3.0f) / sqrt(x);

// calculate numeric derivative using central differences
float derivNumeric = ((3.0f / sqrt(input + EPSILON)) - (3.0f / sqrt(input - EPSILON))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -3.0f / (2.0f * pow(input, 3.0f / 2.0f));

// show value and derivatives
printf("3/sqrt(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// pow
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = pow(x + CDualNumber<1>(1.0f), 1.337f);

// calculate numeric derivative using central differences
float derivNumeric = ((pow(input + 1.0f + EPSILON, 1.337f)) - (pow(input + 1.0f - EPSILON, 1.337f))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.337f * pow(input + 1.0f, 0.337f);

// show value and derivatives
printf("(%0.4f+1)^1.337 = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}
}

void Test2D (float inputx, float inputy)
{
// create dual numbers as the value of x and y
CDualNumber<2> x(inputx, 0);
CDualNumber<2> y(inputy, 1);

// z = 3x^2 - 2y^3
{
// calculate value and partial derivatives using dual numbers
CDualNumber<2> z = CDualNumber<2>(3.0f) * x * x - CDualNumber<2>(2.0f) * y * y * y;

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y) {
return 3.0f * x * x - 2.0f * y * y * y;
};
float derivNumericX = (f(inputx + EPSILON, inputy) - f(inputx - EPSILON, inputy)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON) - f(inputx, inputy - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = 6.0f * inputx;
float derivActualY = -6.0f * inputy * inputy;

// show value and derivatives
printf("z=3x^2-2y^3 (x = %0.4f, y = %0.4f)n", inputx, inputy);
printf("  z = %0.4fn", z.m_real);
printf("  dual# dz/dx = %0.4fn", z.m_dual[0]);
printf("  dual# dz/dy = %0.4fn", z.m_dual[1]);
printf("  actual dz/dx = %0.4fn", derivActualX);
printf("  actual dz/dy = %0.4fn", derivActualY);
printf("  numeric dz/dx = %0.4fn", derivNumericX);
printf("  numeric dz/dy = %0.4fnn", derivNumericY);
}
}

void Test3D (float inputx, float inputy, float inputz)
{
// create dual numbers as the value of x and y
CDualNumber<3> x(inputx, 0);
CDualNumber<3> y(inputy, 1);
CDualNumber<3> z(inputz, 2);

// w = sin(x*cos(2*y)) / tan(z)
{
// calculate value and partial derivatives using dual numbers
CDualNumber<3> w = sin(x * cos(CDualNumber<3>(2.0f)*y)) / tan(z);

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y, float z) {
return sin(x*cos(2.0f*y)) / tan(z);
};
float derivNumericX = (f(inputx + EPSILON, inputy, inputz) - f(inputx - EPSILON, inputy, inputz)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON, inputz) - f(inputx, inputy - EPSILON, inputz)) / (2.0f * EPSILON);
float derivNumericZ = (f(inputx, inputy, inputz + EPSILON) - f(inputx, inputy, inputz - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = cos(inputx*cos(2.0f*inputy))*cos(2.0f * inputy) / tan(inputz);
float derivActualY = cos(inputx*cos(2.0f*inputy)) *-2.0f*inputx*sin(2.0f*inputy) / tan(inputz);
float derivActualZ = sin(inputx * cos(2.0f * inputy)) / -(sin(inputz) * sin(inputz));

// show value and derivatives
printf("w=sin(x*cos(2*y))/tan(z) (x = %0.4f, y = %0.4f, z = %0.4f)n", inputx, inputy, inputz);
printf("  w = %0.4fn", w.m_real);
printf("  dual# dw/dx = %0.4fn", w.m_dual[0]);
printf("  dual# dw/dy = %0.4fn", w.m_dual[1]);
printf("  dual# dw/dz = %0.4fn", w.m_dual[2]);
printf("  actual dw/dx = %0.4fn", derivActualX);
printf("  actual dw/dy = %0.4fn", derivActualY);
printf("  actual dw/dz = %0.4fn", derivActualZ);
printf("  numeric dw/dx = %0.4fn", derivNumericX);
printf("  numeric dw/dy = %0.4fn", derivNumericY);
printf("  numeric dw/dz = %0.4fnn", derivNumericZ);
}
}

int main (int argc, char **argv)
{
TestSmoothStep(0.5f);
TestSmoothStep(0.75f);
TestTrig(PI * 0.25f);
TestSimple(3.0f);
Test2D(1.5f, 3.28f);
Test3D(7.12f, 8.93f, 12.01f);
return 0;
}


Closing

One of the neatest things about dual numbers is that they give precise results. They are not approximations and they are not numerical methods, unlike the central differences method that I compared them to in the example program (More info on numerical derivatives here: Finite Differences). Using dual numbers gives you exact derivatives, within the limitations of (eg) floating point math.

It turns out that backpropagation (the method that is commonly used to train neural networks) is just steepest gradient descent. You can read about that here: Backpropogation is Just Steepest Descent with Automatic Differentiation

That makes me wonder how dual numbers would do in run time speed compared to back propagation as well as numerical methods for getting the gradient to adjust a neural network during training.

If I had to guess, I’d say that dual numbers may be slightly slower than backpropagation, but not as slow as numerical methods which are going to be much, much slower. We’ll see though. It may be much easier to implement neural network learning using dual numbers compared to backpropagation, so may be worth an exploration and write up, even if only to make neural networks a little bit more accessible to people.

Comments, corrections, etc? Let me know in the comments below, or contact me on twitter at @Atrix256

Incremental Least Squares Surface and Hyper-Volume Fitting

The last post showed how to fit a $y=f(x)$ equation to a set of 2d data points, using least squares fitting. It allowed you to do this getting only one data point at a time, and still come up with the same answer as if you had all the data points at once, so it was an incremental, or “online” algorithm.

This post generalizes that process to equations of any dimension such as $z=f(x,y)$, $w=f(x,y,z)$ or greater.

Below is an image of a surface that is degree (2,2). This is a screenshot taken from the interactive webgl2 demo I made for this post: Least Squares Surface Fitting

How Do You Do It?

The two main players from the last post were the ATA matrix and the ATY vector. These two could be incrementally updated as new data points came in, which would allow you to do an incremental (or “online”) least squares fit of a curve.

When working with surfaces and volumes, you have the same things basically. Both the ATA matrix and the ATY vector still exist, but they contain slightly different information – which I’ll explain lower down. However, the ATY vector is renamed, since in the case of a surface it should be called ATZ, and for a volume it should be called ATW. I call it ATV to generalize it, where v stands for value, and represents the last component in a data point, which is the output value we are trying to fit given the input values. The input values are the rest of the components of the data point.

At the end, you once again need to solve the equation $A^TA*c=A^Tv$ to calculate the coefficients (named c) of the equation.

It’s all pretty similar to fitting a curve, but the details change a bit. Let’s work through an example to see how it differs.

Example: Bilinear Surface Fitting

Let’s fit 4 data points to a bilinear surface, otherwise known as a surface that is linear on each axis, or a surface of degree(1,1):
(0,0,5)
(0,1,3)
(1,0,8)
(1,1,2)

Since we are fitting those data points with a bilinear surface, we are looking for a function that takes in x,y values and gives as output the z coordinate. We want a surface that gives us the closest answer possible (minimizing the sum of the squared difference for each input data point) for the data points we do have, so that we can give it other data points and get z values as output that approximate what we expect to see for that input.

A linear equation looks like this, with coefficients A and B:
$y=Ax+B$

Since we want a bilinear equation this time around, this is the equation we are going to end up with, after solving for the coefficients A,B,C,D:
$y=Axy+Bx+Cy+D$

The first step is to make the A matrix. In the last post, this matrix was made up of powers of the x coordinates. In this post, they are actually going to be made up of the permutation of powers of the x and y coordinates.

Last time the matrix looked like this:
$A = \begin{bmatrix} x_0^0 & x_0^1 & x_0^2 \\ x_1^0 & x_1^1 & x_1^2 \\ x_2^0 & x_2^1 & x_2^2 \\ x_3^0 & x_3^1 & x_3^2 \\ \end{bmatrix}$

This time, the matrix is going to look like this:
$A = \begin{bmatrix} x_0^0y_0^0 & x_0^0y_0^1 & x_0^1y_0^0 & x_0^1y_0^1 \\ x_1^0y_1^0 & x_1^0y_1^1 & x_1^1y_1^0 & x_1^1y_1^1 \\ x_2^0y_2^0 & x_2^0y_2^1 & x_2^1y_2^0 & x_2^1y_2^1 \\ x_3^0y_3^0 & x_3^0y_3^1 & x_3^1y_3^0 & x_3^1y_3^1 \\ \end{bmatrix}$

Simplifying that matrix a bit, it looks like this:
$A = \begin{bmatrix} 1 & y_0 & x_0 & x_0y_0 \\ 1 & y_1 & x_1 & x_1y_1 \\ 1 & y_2 & x_2 & x_2y_2 \\ 1 & y_3 & x_3 & x_3y_3 \\ \end{bmatrix}$

To simplify it even further, there is one row in the A matrix per data point, where the row looks like this:
$\begin{bmatrix} 1 & y & x & xy \\ \end{bmatrix}$

You can see that every permutation of the powers of x and y for each data point is present in the matrix.

The A matrix for our data points is this:
$A = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix}$

Next we need to calculate the ATA matrix by multiplying the transpose of that matrix, by that matrix.

$A^TA = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} * \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix} = \begin{bmatrix} 4 & 2 & 2 & 1 \\ 2 & 2 & 1 & 1 \\ 2 & 1 & 2 & 1 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix}$

Taking the inverse of that matrix we get this:

$(A^TA)^{-1} = \begin{bmatrix} 1 & -1 & -1 & 1 \\ -1 & 2 & 1 & -2 \\ -1 & 1 & 2 & -2 \\ 1 & -2 & -2 & 4 \\ \end{bmatrix}$

Next we need to calculate the ATV vector (formerly known as ATY). We calculate that by multiplying the transpose of the A matrix by the Z values:

$A^TV = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} * \begin{bmatrix} 5 \\ 3 \\ 8 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 18 \\ 5 \\ 10 \\ 2 \\ \end{bmatrix}$

Lastly we multiply the inversed ATA matrix by the ATV vector to get our coefficients.

$\begin{bmatrix} 1 & -1 & -1 & 1 \\ -1 & 2 & 1 & -2 \\ -1 & 1 & 2 & -2 \\ 1 & -2 & -2 & 4 \\ \end{bmatrix} * \begin{bmatrix} 18 \\ 5 \\ 10 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ -2 \\ 3 \\ -4 \\ \end{bmatrix}$

In the last post, the coefficients we got out were in x power order, so the first (top) was for the $x^0$ term, the next was for the $x^1$ term etc.

This time around, the coefficients are in the same order as the permutations of the powers of x and y:
$\begin{bmatrix} 1 & y & x & xy \\ \end{bmatrix}$

That makes our final equation this:
$z = -4xy+3x-2y+5$

If you plug in the (x,y) values from the data set we fit, you’ll see that you get the corresponding z values as output. We perfectly fit the data set!

The process isn’t too different from last post and not too difficult either right?

Let’s see if we can generalize and formalize things a bit.

Some Observations

Firstly you may be wondering how we come up with the correct permutation of powers of our inputs. It actually doesn’t matter so long as you are consistent. You can have your A matrix rows have the powers in any order, so long as all orders are present, and you use the same order in all operations.

Regarding storage sizes needed, the storage of surfaces and (hyper) volumes are a bit different and generally larger than curves.

To see how, let’s look at the powers of the ATA matrix of a bilinear surface, using the ordering of powers that we used in the example:

$\begin{bmatrix} x^0y^0 & x^0y^1 & x^1y^0 & x^1y^1 \\ x^0y^1 & x^0y^2 & x^1y^1 & x^1y^2 \\ x^1y^0 & x^1y^1 & x^2y^0 & x^2y^1 \\ x^1y^1 & x^1y^2 & x^2y^1 & x^2y^2 \\ \end{bmatrix}$

Let’s rewrite it as just the powers:

$\begin{bmatrix} 00 & 01 & 10 & 11 \\ 01 & 02 & 11 & 12 \\ 10 & 11 & 20 & 21 \\ 11 & 12 & 21 & 22 \\ \end{bmatrix}$

And the permutation we used as just powers to help us find the pattern in the powers of x and y in the ATA matrix:
$\begin{bmatrix} 00 & 01 & 10 & 11 \\ \end{bmatrix}$

Can you find the pattern of the powers used at the different spots in the ATA matrix?

I had to stare at it for a while before I figured it out but it’s this: For the i,j location in the ATA matrix, the powers of x and y are the powers of x and y in the i permutation added to the powers of x and y in the j permutation.

For example, $A^TA_{0,2}$ has xy powers of 10. Permutation 0 has powers of 0,0 and permutation 2 has powers of 1,0, so we add those together to get powers 1,0.

Another example, $A^TA_{2,3}$ has xy powers of 21. Permutation 2 has powers of 1,0 and permutation 3 has powers 1,1. Adding those together we get 2,1 which is correct.

That’s a bit more complex than last post, not too much more difficult to construct the ATA matrix directly – and also construct it incrementally as new data points come in!

How many unique values are there in the ATA matrix though? We need to know this to know how much storage we need.

In the last post, we needed (degree+1)*2–1 values to store the unique ATA matrix values. That can also be written as degree*2+1.

It turns out that when generalizing this to surfaces and volumes, that we need to take the product of that for each axis.

For instance, a surface has ((degreeX*2)+1)*((degreeY*2)+1) unique values. A volume has ((degreeX*2)+1)*((degreeY*2)+1)*((degreeZ*2)+1) unique values.

The pattern continues for higher dimensions, as well as lower, since you can see how in the curve case, it’s the same formula as it was before.

For the same ATA matrix size, a surface has more unique values than a curve does.

As far as what those values actually are, they are the full permutations of the powers of a surface (or hyper volume) that is one degree higher on each axis. For a bilinear surface, that means the 9 permutations of x and y for powers 0,1 and 2:
$x^0y^0,x^0y^1,x^0y^2,x^1y^0,x^1y^1,x^1y^2,x^2y^0,x^2y^1,x^2y^2$
Or simplified:
$1,y,y^2,x,xy,xy^2,x^2,x^2y,x^2y^2$

For the bilinear case, The ATV vector is the sums of the permutations of x,y multiplied by z, for every data point. In other words, you add this to ATV for each data point:
$\begin{bmatrix} z & yz & xz & xyz \\ \end{bmatrix}$

How much storage space do we need in general for the ATV vector then? it’s the product of (degree+1) for each axis.

For instance, a surface has (degreeX+1)*(degreeY+1) values in ATV, and a volume has (degreeX+1)*(degreeY+1)*(degreeZ+1).

You may also be wondering how many data points are required minimum to fit a curve, surface or hypervolume to a data set. The answer is that you need as many data points as there are terms in the polynomial. We are trying to solve for the polynomial coefficients, so there are as many unknowns as there are polynomial terms.

How many polynomial terms are there? There are as many terms as there are permutations of the axes powers involved. In other words, the size of ATV is also the minimum number of points you need to fit a curve, surface, or hypervolume to a data set.

Measuring Quality of a Fit

You are probably wondering if there’s a way to calculate how good of a fit you have for a given data set. It turns out that there are a few ways to calculate a value for this.

The value I use in the code below and in the demos is called $R^2$ or residue squared.

First you calculate the average (mean) output value from the input data set.

Then you calculate SSTot which is the sum of the square of the mean subtracted from each input point’s output value. Pseudo code:

SSTot = 0;
for (point p in points)
SSTot += (p.out - mean)^2;


You then calculate SSRes which is the sum of the square of the fitted function evaluated at a point, subtracted from each input points’ output value. Pseudo code:

SSRes= 0;
for (point p in points)
SSRes += (p.out - f(p.in))^2;


The final value for R^2 is 1-SSRes/SSTot.

The value is nice because it’s unitless, and since SSRes and SSTot is a sum of squares, SSRes/SSTot is basically the value that the fitting algorithm minimizes. The value is subtracted from 1 so that it’s a fit quality metric. A value of 0 is a bad fit, and a value of 1 is a good fit and generally it will be between those values.

Example Code

Here is a run from the sample code:

And here is the source code:

#include <stdio.h>
#include <array>

#define FILTER_ZERO_COEFFICIENTS true // if false, will show terms which have a coefficient of 0

//====================================================================
template<size_t N>
using TVector = std::array<float, N>;

template<size_t M, size_t N>
using TMatrix = std::array<TVector<N>, M>;

//====================================================================
// Specify a degree per axis.
// 1 = linear, 2 = quadratic, etc
template <size_t... DEGREES>
class COnlineLeastSquaresFitter
{
public:
COnlineLeastSquaresFitter ()
{
// initialize our sums to zero
std::fill(m_SummedPowers.begin(), m_SummedPowers.end(), 0.0f);
std::fill(m_SummedPowersTimesValues.begin(), m_SummedPowersTimesValues.end(), 0.0f);
}

// Calculate how many summed powers we need.
// Product of degree*2+1 for each axis.
template <class T>
constexpr static size_t NumSummedPowers(T degree)
{
return degree * 2 + 1;
}
template <class T, class... DEGREES>
constexpr static size_t NumSummedPowers(T first, DEGREES... degrees)
{
return NumSummedPowers(first) * NumSummedPowers(degrees...);
}

// Calculate how many coefficients we have for our equation.
// Product of degree+1 for each axis.
template <class T>
constexpr static size_t NumCoefficients(T degree)
{
return (degree + 1);
}
template <class T, class... DEGREES>
constexpr static size_t NumCoefficients(T first, DEGREES... degrees)
{
return NumCoefficients(first) * NumCoefficients(degrees...);
}

// Helper function to get degree of specific axis
static size_t Degree (size_t axisIndex)
{
static const std::array<size_t, c_dimension-1> c_degrees = { DEGREES... };
return c_degrees[axisIndex];
}

// static const values
static const size_t c_dimension = sizeof...(DEGREES) + 1;
static const size_t c_numCoefficients = NumCoefficients(DEGREES...);
static const size_t c_numSummedPowers = NumSummedPowers(DEGREES...);

// Typedefs
typedef TVector<c_numCoefficients> TCoefficients;
typedef TVector<c_dimension> TDataPoint;

// Function for converting from an index to a specific power permutation
static void IndexToPowers (size_t index, std::array<size_t, c_dimension-1>& powers, size_t maxDegreeMultiply, size_t maxDegreeAdd)
{
for (int i = c_dimension-2; i >= 0; --i)
{
size_t degree = Degree(i) * maxDegreeMultiply + maxDegreeAdd;
powers[i] = index % degree;
index = index / degree;
}
}

// Function for converting from a specific power permuation back into an index
static size_t PowersToIndex (std::array<size_t, c_dimension - 1>& powers, size_t maxDegreeMultiply, size_t maxDegreeAdd)
{
size_t ret = 0;
for (int i = 0; i < c_dimension - 1; ++i)
{
ret *= Degree(i) * maxDegreeMultiply + maxDegreeAdd;
ret += powers[i];
}
return ret;
}

// Add a datapoint to our fitting
{
// Note: It'd be a good idea to memoize the powers and calculate them through repeated
// multiplication, instead of calculating them on demand each time, using std::pow.

// add the summed powers of the input values
std::array<size_t, c_dimension-1> powers;
for (size_t i = 0; i < m_SummedPowers.size(); ++i)
{
IndexToPowers(i, powers, 2, 1);
for (size_t j = 0; j < c_dimension - 1; ++j)
}

// add the summed powers of the input value, multiplied by the output value
for (size_t i = 0; i < m_SummedPowersTimesValues.size(); ++i)
{
IndexToPowers(i, powers, 1, 1);
float valueAdd = dataPoint[c_dimension - 1];
for (size_t j = 0; j < c_dimension-1; ++j)
}
}

// Get the coefficients of the equation fit to the points
bool CalculateCoefficients (TCoefficients& coefficients) const
{
// make the ATA matrix
std::array<size_t, c_dimension - 1> powersi;
std::array<size_t, c_dimension - 1> powersj;
std::array<size_t, c_dimension - 1> summedPowers;
TMatrix<c_numCoefficients, c_numCoefficients> ATA;
for (size_t j = 0; j < c_numCoefficients; ++j)
{
IndexToPowers(j, powersj, 1, 1);

for (size_t i = 0; i < c_numCoefficients; ++i)
{
IndexToPowers(i, powersi, 1, 1);

for (size_t k = 0; k < c_dimension - 1; ++k)
summedPowers[k] = powersi[k] + powersj[k];

size_t summedPowersIndex = PowersToIndex(summedPowers, 2, 1);
ATA[j][i] = m_SummedPowers[summedPowersIndex];
}
}

// solve: ATA * coefficients = m_SummedPowers
// for the coefficients vector, using Gaussian elimination.
coefficients = m_SummedPowersTimesValues;
for (size_t i = 0; i < c_numCoefficients; ++i)
{
for (size_t j = 0; j < c_numCoefficients; ++j)
{
if (ATA[i][i] == 0.0f)
return false;

float c = ((i == j) - ATA[j][i]) / ATA[i][i];
coefficients[j] += c*coefficients[i];
for (size_t k = 0; k < c_numCoefficients; ++k)
ATA[j][k] += c*ATA[i][k];
}
}

// Note: this is the old, "bad" way to solve the equation using matrix inversion.
// It's a worse choice for larger matrices, and surfaces and volumes use larger matrices than curves in general.
/*
// Inverse the ATA matrix
TMatrix<c_numCoefficients, c_numCoefficients> ATAInverse;
if (!InvertMatrix(ATA, ATAInverse))
return false;

// calculate the coefficients
for (size_t i = 0; i < c_numCoefficients; ++i)
coefficients[i] = DotProduct(ATAInverse[i], m_SummedPowersTimesValues);
*/

return true;
}

private:
//Storage Requirements:
// Summed Powers = Product of degree*2+1 for each axis.
// Summed Powers Times Values = Product of degree+1 for each axis.
TVector<c_numSummedPowers>		m_SummedPowers;
TVector<c_numCoefficients>		m_SummedPowersTimesValues;
};

//====================================================================
char AxisIndexToLetter (size_t axisIndex)
{
// x,y,z,w,v,u,t,....
if (axisIndex < 3)
return 'x' + char(axisIndex);
else
return 'x' + 2 - char(axisIndex);
}

//====================================================================
template <class T, size_t M, size_t N>
float EvaluateFunction (const T& fitter, const TVector<M>& dataPoint, const TVector<N>& coefficients)
{
float ret = 0.0f;
for (size_t i = 0; i < coefficients.size(); ++i)
{
float term = coefficients[i];

// then the powers of the input variables
std::array<size_t, T::c_dimension - 1> powers;
fitter.IndexToPowers(i, powers, 1, 1);
for (size_t j = 0; j < powers.size(); ++j)
term *= (float)std::pow(dataPoint[j], powers[j]);

// add this term to our return value
ret += term;
}
return ret;
}

//====================================================================
template <size_t... DEGREES>
void DoTest (const std::initializer_list<TVector<sizeof...(DEGREES)+1>>& data)
{
// say what we are are going to do
printf("Fitting a function of degree (");
for (size_t i = 0; i < COnlineLeastSquaresFitter<DEGREES...>::c_dimension - 1; ++i)
{
if (i > 0)
printf(",");
printf("%zi", COnlineLeastSquaresFitter<DEGREES...>::Degree(i));
}
printf(") to %zi data points: n", data.size());

// show input data points
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
{
printf("  (");
for (size_t i = 0; i < dataPoint.size(); ++i)
{
if (i > 0)
printf(", ");
printf("%0.2f", dataPoint[i]);
}
printf(")n");
}

// fit data
COnlineLeastSquaresFitter<DEGREES...> fitter;
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)

// calculate coefficients if we can
COnlineLeastSquaresFitter<DEGREES...>::TCoefficients coefficients;
bool success = fitter.CalculateCoefficients(coefficients);
if (!success)
{
printf("Could not calculate coefficients!nn");
return;
}

// print the polynomial
bool firstTerm = true;
printf("%c = ", AxisIndexToLetter(sizeof...(DEGREES)));
bool showedATerm = false;
for (int i = (int)coefficients.size() - 1; i >= 0; --i)
{
// don't show zero terms
if (FILTER_ZERO_COEFFICIENTS && std::abs(coefficients[i]) < 0.00001f)
continue;

showedATerm = true;

// show an add or subtract between terms
float coefficient = coefficients[i];
if (firstTerm)
firstTerm = false;
else if (coefficient >= 0.0f)
printf(" + ");
else
{
coefficient *= -1.0f;
printf(" - ");
}

printf("%0.2f", coefficient);

std::array<size_t, COnlineLeastSquaresFitter<DEGREES...>::c_dimension - 1> powers;
fitter.IndexToPowers(i, powers, 1, 1);

for (size_t j = 0; j < powers.size(); ++j)
{
if (powers[j] > 0)
printf("%c", AxisIndexToLetter(j));
if (powers[j] > 1)
printf("^%zi", powers[j]);
}
}
if (!showedATerm)
printf("0");
printf("n");

// Calculate and show R^2 value.
float rSquared = 1.0f;
if (data.size() > 0)
{
float mean = 0.0f;
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
mean += dataPoint[sizeof...(DEGREES)];
mean /= data.size();
float SSTot = 0.0f;
float SSRes = 0.0f;
for (const COnlineLeastSquaresFitter<DEGREES...>::TDataPoint& dataPoint : data)
{
float value = dataPoint[sizeof...(DEGREES)] - mean;
SSTot += value*value;

value = dataPoint[sizeof...(DEGREES)] - EvaluateFunction(fitter, dataPoint, coefficients);
SSRes += value*value;
}
if (SSTot != 0.0f)
rSquared = 1.0f - SSRes / SSTot;
}
printf("R^2 = %0.4fnn", rSquared);
}

//====================================================================
int main (int argc, char **argv)
{
// bilinear - 4 data points
DoTest<1, 1>(
{
TVector<3>{ 0.0f, 0.0f, 5.0f },
TVector<3>{ 0.0f, 1.0f, 3.0f },
TVector<3>{ 1.0f, 0.0f, 8.0f },
TVector<3>{ 1.0f, 1.0f, 2.0f },
}
);

// biquadratic - 9 data points
DoTest<2, 2>(
{
TVector<3>{ 0.0f, 0.0f, 8.0f },
TVector<3>{ 0.0f, 1.0f, 4.0f },
TVector<3>{ 0.0f, 2.0f, 6.0f },
TVector<3>{ 1.0f, 0.0f, 5.0f },
TVector<3>{ 1.0f, 1.0f, 2.0f },
TVector<3>{ 1.0f, 2.0f, 1.0f },
TVector<3>{ 2.0f, 0.0f, 7.0f },
TVector<3>{ 2.0f, 1.0f, 9.0f },
TVector<3>{ 2.0f, 2.5f, 12.0f },
}
);

// trilinear - 8 data points
DoTest<1, 1, 1>(
{
TVector<4>{ 0.0f, 0.0f, 0.0f, 8.0f },
TVector<4>{ 0.0f, 0.0f, 1.0f, 4.0f },
TVector<4>{ 0.0f, 1.0f, 0.0f, 6.0f },
TVector<4>{ 0.0f, 1.0f, 1.0f, 5.0f },
TVector<4>{ 1.0f, 0.0f, 0.0f, 2.0f },
TVector<4>{ 1.0f, 0.0f, 1.0f, 1.0f },
TVector<4>{ 1.0f, 1.0f, 0.0f, 7.0f },
TVector<4>{ 1.0f, 1.0f, 1.0f, 9.0f },
}
);

// trilinear - 9 data points
DoTest<1, 1, 1>(
{
TVector<4>{ 0.0f, 0.0f, 0.0f, 8.0f },
TVector<4>{ 0.0f, 0.0f, 1.0f, 4.0f },
TVector<4>{ 0.0f, 1.0f, 0.0f, 6.0f },
TVector<4>{ 0.0f, 1.0f, 1.0f, 5.0f },
TVector<4>{ 1.0f, 0.0f, 0.0f, 2.0f },
TVector<4>{ 1.0f, 0.0f, 1.0f, 1.0f },
TVector<4>{ 1.0f, 1.0f, 0.0f, 7.0f },
TVector<4>{ 1.0f, 1.0f, 1.0f, 9.0f },
TVector<4>{ 0.5f, 0.5f, 0.5f, 12.0f },
}
);

// Linear - 2 data points
DoTest<1>(
{
TVector<2>{ 1.0f, 2.0f },
TVector<2>{ 2.0f, 4.0f },
}
);

// Quadratic - 4 data points
DoTest<2>(
{
TVector<2>{ 1.0f, 5.0f },
TVector<2>{ 2.0f, 16.0f },
TVector<2>{ 3.0f, 31.0f },
TVector<2>{ 4.0f, 16.0f },
}
);

// Cubic - 4 data points
DoTest<3>(
{
TVector<2>{ 1.0f, 5.0f },
TVector<2>{ 2.0f, 16.0f },
TVector<2>{ 3.0f, 31.0f },
TVector<2>{ 4.0f, 16.0f },
}
);

system("pause");
return 0;
}


Closing

The next logical step here for me would be to figure out how to break the equation for a surface or hypervolume up into multiple equations, like you’d have with a tensor product surface/hypervolume equation. It would also be interesting to see how to convert from these multidimensional polynomials to multidimensional Bernstein basis functions, which are otherwise known as Bezier rectangles (and Bezier hypercubes i guess).

The last post inverted the ATA matrix and multiplied by ATY to get the coefficients. Thanks to some feedback on reddit, I found out that is NOT how you want to solve this sort of equation. I ended up going with Gaussian elimination for this post which is more numerically robust while also being less computation to calculate. There are other options out there too that may be even better choices. I’ve found out that in general, if you are inverting a matrix in code, or even just using an inverted matrix that has already been given to you, you are probably doing it wrong. You can read more about that here: John D. Cook: Don’t invert that matrix.

I didn’t go over what to do if you don’t have enough data points because if you find yourself in that situation, you can either decrease the degree of one of the axes, or you could remove and axis completely if you wanted to. It’s situational and ambiguous what parameter to decrease when you don’t have enough data points to fit a specific curve or hypervolume, but it’s still possible to decay the fit to a lower degree or dimension if you hit this situation, because you will already have all the values you need in the ATA matrix values and the ATV vector. I leave that to you to decide how to handle it in your own usage cases. Something interesting to note is that ATA[0][0] is STILL the count of how many data points you have, so you can use this value to know how much you need to decay your fit to be able to handle the data set.

In the WebGL2 demo I mention, I use a finite difference method to calculate the normals of the surface, however since the surface is described by a polynomial, it’d be trivial to calculate the coefficients for the equations that described the partial derivatives of the surface for each axis and use those instead.

I also wanted to mention that in the case of surfaces and hypervolumes it’s still possible to get an imperfect fit to your data set, even though you may give the exact minimum required number of control points. The reason for this is that not all axes are necesarily created equal. If you have a surface of degree (1,2) it’s linear on the x axis, but quadratic on the y axis, and requires a minimum of 6 data points to be able to fit a data set. As you can imagine, it’s possible to give data points such that the data points ARE NOT LINEAR on the x axis. When this happens, the surface will not be a perfect fit.

Lastly, you may be wondering how to fit data sets where there is more than one output value, like an equation of the form $(z,w)=f(x,y)$.

I’m not aware of any ways to least square fit that as a whole, but apparently a common practice is to fit one equation to z and another to w and treat them independently. There is a math stack exchange question about that here: Math Stack Exchange: Least square fitting multiple values

Here is the webgl demo that goes with this post again:
Least Squares Surface Fitting

Thanks for reading, and please speak up if you have anything to add or correct, or a comment to make!