Calculating Information Entropy

This post has an overview of information entropy then moves onto some technical details and experiments.

If you have any comments, questions, important additions, etc, please hit me up by commenting on this post, or on twitter at https://twitter.com/Atrix256.

The code that goes with this post is at https://github.com/Atrix256/Entropy/

Thanks to all the people on twitter who answered questions & helped me learn this stuff. Links to threads and resources are at the bottom of the post!

Information Entropy Overview

Information entropy is a measure of how much information there is in some specific data. It isn’t the length of the data, but the actual amount of information it contains.

For example, one text file could contain “Apples are red.” and another text file could contain “Apples are red. Apples are red. Apples are red.”. The second text file is 3 times longer than the first but doesn’t really give any extra information. These two text files would have nearly the same amount of information entropy, despite them having different lengths.

It isn’t really possible to calculate the actual amount of information entropy a specific piece of data has in it because it’s related to Kolmogorov complexity: the length of the shortest program that can generate the data (https://en.wikipedia.org/wiki/Kolmogorov_complexity), which turns out to be an incalculable value. That’s why i said that those two text files would have nearly the same amount of information entropy instead of saying they had the same amount. A program to generate the second text file is probably going to be very slightly more complex than the text file to generate the first one: it’ll have a for loop on the outside!

You can get a pretty good upper bound on information entropy though by making a histogram of how often each symbol appears, treating that histogram like a weighted N sided dice (a probability mass function or PMF), and calculating the entropy of that dice.

Interestingly, Huffman compression is related to this (https://en.wikipedia.org/wiki/Huffman_coding). In Huffman compression, you take a histogram of the data, and give shorter length bit patterns to the symbols that occur more. This makes the overall data smaller, but doesn’t change the amount of information there is in the data. In fact, it’s part of a family of compressors called “Entropy encoders” (https://en.wikipedia.org/wiki/Entropy_encoding).

Because of this, you can get a pretty good idea of how much information is in some data by how much you can compress it. If it shrinks a lot, the information density was low. If it doesn’t shrink much, or even gets larger (which happens when the compressed data plus the compression header is larger than the original data!), then the information density is pretty high.

However, there are counter examples to that. Encrypting data hides the patterns of the data, making it look more like uniformly distributed white noise (independent random numbers) while keeping the data the same size. Encrypting data doesn’t actually change the amount of entropy in the data, but it changes the APPARENT amount of entropy. It makes the “shortest program that can generate the data” harder to find, and in fact, encrypted data will not compress!

This actually feels opposite to some things i’ve read about machine learning type algorithms. In machine learning, finding the solution to a problem involves finding the global minimum, and you want a global minimum to be wide and flat so that it’s easy to find and hard to fall out of. There are techniques to make this be more true, such as the SVM kernel trick, or neural networks which make the problem space be higher dimensional and be saddle shaped to help not get stuck in local minima. Encryption seems to do the opposite, making the minimums (global OR local!) of the Kolmogorov complexity be very hard to find, like a dirac delta, where there is no hint where they are so that you have to know where they are in advance (aka you have to know the encryption key!) to be able to find the minimum.

Before moving onto details and experiments, here is a really cool read “Entropy and Redundancy in English” which talks about how Claude Shannon calculated letters in the english language to have about 2.62 bits of information per letter.
https://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm

Interestingly, that last link also talks about something I wrote up previously, which is using markov chains to generate text.
https://blog.demofox.org/2019/05/11/markov-chain-text-generation/

There is also a really interesting tweet from Donald Mitchell (who is famous for research which includes Mitchell’s best candidate algorithm for generating blue noise sample points!), which talks about encryption and compression and generating valid looking language. It’s a very cool tangle of ideas I’m still sorting out in my head 🙂

Consider a hypothetical cipher I call "Babble". It transforms English sentences into other English sentences. So a message might be "MEET ME AT THE THEATER ON TUESDAY NIGHT" and the cyphertext might be "I SPENT AN INTERESTING EVENING RECENTLY WITH A GRAIN OF SALT"

— Donald Mitchell (@DonaldM38768041) November 22, 2019

That “transforms English sentences to other English sentences” part also sounds a lot like format preserving encryption.

The Formula

To calculate information entropy, you need to calculate the entropy for each possible event or symbol and then sum them all up.

To calculate the entropy of a specific event X with probability P(X) you calculate this: $-P(X)*log_2(P(X))$

As an example, let’s calculate the entropy of a fair coin.

The probability of heads is 50%. Here’s the entropy we get when plugging that 0.5 into the equation:
$E_{\text{heads}} = -0.5*log_2(0.5) \\ E_{\text{heads}} = -0.5*-1 \\ E_{\text{heads}} = 0.5$
Since tails has the same probability it has the same entropy as heads (which is 0.5) so we add the entropy of heads and tails to get 1 bit of entropy for a fair coin flip. You can alternately say it’s 1 Shannon of entropy, and if you use different log bases, there are other unit of measurements. The important take away though is that a fair coin gives 1 bit of entropy.

What if the coin isn’t fair though? Here’s the entropy if there’s a 75% chance of heads and a 25% chance of tails.
$E_{\text{heads}} = -0.75 * log_2(0.75) \\ E_{\text{heads}} = 0.311 \\ E_{\text{tails}} = -0.25 * log_2(0.25) \\ E_{\text{tails}} = 0.5 \\ E = 0.311 + 0.5 = 0.811$
So, in this case, each flip of this unfair coin only gives you 0.811 bits of information. It kind of makes sense because each flip you are pretty sure it’s going to be heads, so when it is heads, it doesn’t give you a whole lot of information.

What if the coin somehow is incapable of landing tails. Maybe it is just a 2 headed coin.
$E_{\text{heads}} = -1 * log_2(1) \\ E_{\text{heads}} = 0 \\ E_{\text{tails}} = 0 * log_2(0) \\ E_{\text{tails}} = 0 \\ E = 0 + 0 = 0$
Flipping this maximally unfair coin gives us zero bits of information which makes sense. We know that it will always land heads so the coin flip gives us absolutely no new information.

This equation works the same way no matter how many events or what their probabilities are. It works with 6 sided dice. It works with byte values in a text file, it also works with bits in a text file (going back to heads or tails).

When you have N discrete symbols or events, and a probability of encountering each, it’s called a probability mass function or PMF, which again, can be made from a histogram.

Interestingly, entropy calculations have a full suite of calculations similar to probability calculations. There is joint entropy, conditional entropy, the chain rule of entropy and even a Bayes rule of entropy!

https://en.wikipedia.org/wiki/Joint_entropy
https://en.wikipedia.org/wiki/Conditional_entropy

Text Experiments

The C++ code and source data for these experiments can be found on github at https://github.com/Atrix256/Entropy

We are going to start with a 26KB text file of “The Last Question” by Isaac Asimov. (from https://hell.pl//szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm)

We can make a histogram of how many times each 8 bit letter occurred, then turn that into a PMF by calculating the percentage occurrence of each character. From there we can calculate the entropy and divide by 8 to get the entropy per bit. Doing that we get the value 0.582. So, there are 0.582 bits of information entropy per bit of data in that text file. (Technically: or less)

If we compress it with the standard zip file compressor in windows, making an 11KB zip file, then do the same to that file, we get a value of 0.962 bits of information per bit of data in that text file. The file shrank to 42.3% of the size (the old file is 2.36 times larger) and we got 1.65 times as much information entropy.

If instead of compressing the text file, we encrypt it with openssl (command: openssl enc -aes-256-cbc -salt -in lastquestion.txt -out lastquestion.enc -pass pass:moreentropyplease), then calculate the entropy of that file, we get a value of 0.926 bits of information entropy per bit of data.

It’s weird to me that the encrypted data reports a lower entropy than the compressed data, and i’d expect the opposite. The issue might be that our calculation of entropy is naive, and that a more robust entropy calculation would show the encrypted data to have more entropy.

An example of being less naive would like making a histogram of 4 bit, 6 bit, 12 bit values etc instead of only 8. Another thing to do would be to try calculating entropy of re-arrangements of the data. Ultimately, you would report the lowest entropy seen from different “views” of the data.

I’m not sure if that’s what’s at play in the entropy between encrypted and compressed data though.

Interestingly, if you try to compress the encrypted data, it doesn’t really compress and stays at 26KB. The encryption has hidden the structure of the source data pretty well! It now reports an information entropy per bit of 0.923 which is very slightly down from 0.926 bits. It might be the added useless compression header causing the information entropy to go down.

What if we take the zip file of the plain text file and base 64 encode it? We are taking the same data and encoding it in fewer 8 bit symbols. That makes the file larger without changing the amount of information in the data. It’s like the reverse of compression so we’d expect the amount of information entropy per bit to go down and that is what happens! The file increases from 11KB to 14KB and we get an information entropy per bit value of 0.749, which is down from 0.962.

Noise Distributions

What if we look at the entropy of uniform white noise (plain vanilla random numbers)? Taking 100,000 white noise bytes, building a histogram of each 8 bit value, making a PMF and calculating entropy, we get 0.999733. That makes sense because white noise gives maximal entropy so we’d expect a value of 1 and got something very close.

In a recent post i talked about using dice to generate different colors and distributions of noises (https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/). Do different distributions of noise have different amounts of entropy?

First, let’s look at a triangle distribution vs a uniform distribution.

Rolling a 6 sided die gives us a uniform distribution with possible values 1 through 6. The entropy of a 6 sided die is 2.585 bits.

If we wanted a triangle distribution with possible values 1 through 6, we could roll a 4 sided die and a 3 sided die (strangely, they exist! https://www.amazon.com/Bescon-Polyhedral-3-sided-Gaming-Assorted/dp/B06XHF4Y6J) and add them to get values 2 through 7, then we could subtract 1 to get values 1 through 6.

We can make a histogram by looking at how many ways each possible number can come up, and turn them into percentages:

one= 1 = 8%
two = 2 = 17%
three = 3 = 25%
four = 3 = 25%
five = 2 = 17%
six = 1 = 8%

Summing the entropy of each of those percentages gives about 2.454 vs a uniform 6 sided die giving 2.585 bits. So, triangular distributed values have less entropy than uniform distributed values. That makes a bit of sense because we already knew that a fair coin gave more entropy than an unfair coin, and a triangle distribution is like an extension of an unfair coin to dice.

Noise Colors

Next, let’s look at the color of noise (frequency composition / correlation). Let’s use blue noise because that’s my favorite.

I made some floating point 1d blue noise data (with values between 0 and 1) using Mitchell’s best candidate algorithm (https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/), and converted the floating point values to uint8.

If you look at a histogram of those uint8 values, it will show a flat histogram, meaning a uniform distribution and will give maximal entropy.

That isn’t quite right though, because we know that the values have randomized high frequency components and no low frequency components.

That means that each roll is based on the previous rolls, and so the rolls aren’t uncorrelated, and they aren’t independent. That means they should have less entropy.

One idea to capture this relationship is to take each pair of uint8s in the data, make them into a uint16, and make a histogram of those uint16s (Thanks Nathan! https://twitter.com/Reedbeta). If the uint8s were uncorrelated and evenly distributed, then the uint16s should be evenly distributed. That would mean that a value didn’t care about what it’s previous value was.

Since we know there is correlation in the blue noise, the result in our case shouldn’t be evenly distributed, and should result in less entropy than white noise.

That’s the result we get in fact, and so for 100,000 uint8 blue noise values, we get an entropy of 0.92608, where white noise gives 0.965912.

So interestingly, blue noise is more predictable and less random than white noise, which shouldn’t come as a huge surprise, since we know the PMF isn’t a uniform distribution at each new value.

Making a histogram of 16 bit values of the 8 bit stream like the above is equivalent to making an order 1 Markov chain and only has a “memory” of the previous value. You could make higher order Markov chains by making 24 or 32 bit values and beyond, to find correlation that went beyond immediate neighbors, but keeping a histogram of 2^24 or 2^32 values and larger is a pretty large amount of memory. You’d also need a lot more source data, to give the histogram a chance to stabilize / converge.

There might also be a delay of correlation, like each value could be correlated with the value that is 100 values ago. At one byte per value, you’d need an order 100 Markov chain to capture that relationship, and so would need a 2^800 bytes of memory which is astronomically large.

Other Info

If you look at the csv in the output folder of the repository for this post (https://github.com/Atrix256/Entropy), you’ll see some oddities.

One is how when looking at 11 bits at a time, the plain text has more entropy than when looking at it with 8 or 12 bits. The reason for that specific thing is that the data inherently is 8 bit, which is coprime to the 11 bits we are looking at it with, so makes the entropy look a lot higher by making it seem like there are more unique bit patterns than there are. Ultimately, when calculating entropy in a variety of ways, you should take the minimum value you find as the final answer, because it’s related to the shortest program you can reproduce it with. There are plenty of longer programs, but you only care about the shortest. You should also be careful not to get too crazy in whatever code you use to make your “view” of the data (re-arranging values / looking forward & backward a far amount) because that stuff is part of the program needed to reproduce the data, but isn’t represented in the entropy calculation you are doing.

Something else is that if you don’t have enough data, the histogram / PMF won’t converge to the actual values. You can see this manifesting in the csv data in the difference between the “small white noise” which is 8 bytes of white noise, and the “white noise” which is 100,000 bytes of white noise. They really do have the same characteristics and entropy amount per bit, but the small white noise doesn’t have enough data available to show the correct values.

Links

A very related, but more formalized, write up on information theory.
https://www.blackhc.net/blog/2019/better-intuition-for-information-theory/

A stack exchange question about how to calculate entropy of correlated samples.
https://math.stackexchange.com/questions/3428693/how-to-calculate-entropy-from-a-set-of-correlated-samples

Here’s a neat post that shows how to visualize entropy in a binary file, to find where encryption keys are since they are high entropy. Thanks Ashley! (https://twitter.com/ACEfanatic02)
https://corte.si/posts/visualisation/entropy/index.html

The “Breaking Math” podcast episode 24 is really good and related to this stuff:

“The Complexity Barrier” blog post by John Baez
https://johncarlosbaez.wordpress.com/2011/10/28/the-complexity-barrier/

A thread showing how Oskar Stålberg is bitbacking data very efficiently, so the data is not compressing very well at all.

Here's how I pack the bits now:

5 bits : A = bit count for offset
A bits : offsetX
A bits : offsetY
5 bits : B = bit count for pos
repeat
{
B bits : posX
B bits : posY
repeat
{
3 bits : 0 – 3 = color, 4 = empty, 5 = 2 empties, 6 = tower done, 7 = all done
}
}

— Oskar Stålberg (@OskSta) November 27, 2019

Some related twitter megathreads:

Is there a way to calculate the actual amount of information in a bit string? Like if you have a 1MB file that losslessly compresses to 1KB, the actual amount of info is <= 1KB. (1/N)

— Alan Wolfe (@Atrix256) October 31, 2019

Ok so, my naive unserstanding is that if you want to know how much entropy something actually has, you shove it into a black hole and the difference in surface area of Planck lengths squared will be the bits of entropy it contained. However you cant measure that without changing

— Alan Wolfe (@Atrix256) October 31, 2019

The difference between uncorrelated and independent:

maybe something along the lines of ICP (independent component analysis)?https://t.co/BEuWMf4gCl

independent is stronger than just uncorrelated

find a transform where your data looks maximally "interesting", ie. non-gaussian

— Christian Schüler (@aries_code) November 27, 2019

Generating Random Numbers From a Specific Distribution With The Metropolis Algorithm (MCMC)

There is ~400 lines of standalone C++ code that implements the main ideas in this post. You can find it at: https://github.com/Atrix256/MetropolisMCMC

In previous posts I showed how to generate random numbers from a specific distributing by using two techniques:

Rejection Sampling: https://blog.demofox.org/2017/08/08/generating-random-numbers-from-a-specific-distribution-with-rejection-sampling/

Inverting the CDF: https://blog.demofox.org/2017/08/05/generating-random-numbers-from-a-specific-distribution-by-inverting-the-cdf/

This post will show how to do it using a Markov Chain Monte Carlo method called “The Metropolis Algorithm”. This post also talks about using it for numerical integration.

If you want an intro or review to either Markov Chains or Monte Carlo, these two posts can help you out.

Monte Carlo: https://blog.demofox.org/2018/06/12/monte-carlo-integration-explanation-in-1d/

Markov Chains: https://blog.demofox.org/2019/05/11/markov-chain-text-generation/

Overview

The Metropolis algorithm lets you generate random numbers that follow a distribution given by any function y=f(x), where y is the probability of choosing x.

Rejection sampling does this as well, but you have to throw away an unknown number of bad samples before getting each good sample.

Inverting the CDF doesn’t throw out any samples, but it’s limited in the type of distributions it can do: It can be mathematically complex, or impossible, to find the inverted CDF for a given function analytically.

In these ways, the Metropolis algorithm is the best of both worlds. You can sample from a distribution defined from any function, and you don’t have to throw out any samples while doing it.

Another interesting thing about the Metropolis algorithm is that it can work blindly. It never actually has to KNOW what function describes the probability distribution. If there is a black box you can give an x to get a y, the Metropolis algorithm can generate random numbers from the distribution hidden in that black box.

The Metropolis algorithm also works in any dimension: you can use it with functions like z=f(x,y) or t = f(x,y,z,w,s).

In higher dimensions such as z=f(x,y), the z is the probability of a 2d random number (x,y) being chosen. It sounds weird but this situation is like if you rolled two dice, the value that one die came up as affected the probability of the other die. Maybe when the first die was a larger number, it made it be more probable for the other die to be a larger number too.

As weird as that is, if you can describe the relationship as a z=f(x,y) function, this can generate (x,y) random numbers from that distribution for you.

It’s worth noting that the Metropolis algorithm is a simpler special case of the Metropolis-Hastings algorithm, and these are just two of many Markov Chain Monte Carlo algorithms.

The Metropolis Algorithm

The metropolis algorithm is pretty simple.

You start with an x value and calculate y which is just f(x). This is the initial sample and hopefully is a location where y is greater than 0.

To get the next sample, we first need to calculate a candidate sample, and then choose whether to take it or not.

To make a candidate sample, take a small random step from the current x point to get a new x, either in the negative or positive direction. Calculate y which is just f(x) using the new x.

Calculate a value A (for acceptance value) which is the candidate y divided by the last y value. This is the percentage chance you should take the new sample as the current sample, otherwise you take the old sample as the current sample.

To make the decision, you just generate a random number between 0 and 1 and accept the new sample if it’s below the A value.

Rinse and repeat for as many samples as you want.

Here’s a simple but fully featured implementation that you can find in the code that goes with this post(https://github.com/Atrix256/MetropolisMCMC)

The x axis of each sample is a random number drawn from the distribution described by y=f(x). If you keep track of the average x value seen, you’ll get the expected value of the PDF.

The y axis of each sample isn’t that useful directly. It’s the pdf(x) but scaled up by an unknown amount – the normalization constant for the function.

The convergence rate of Metropolis MCMC isn’t as well understood as monte carlo integration, since Metropolis has dependent samples (a random walk that knows where it was last step) vs independent samples (a stateless random sample).

In higher dimensions, you just take a random step on each axis, instead of only the x axis. The rest of the algorithm remains the same.

Regarding the random walk, it’s possible to use many different types of random numbers to take a step, but it’s most common to use a normal distribution.

Metropolis Burn in and 0.234

Metropolis is stateless, and has no memory of the past. Despite this, it’s common to do a “burn in” with MCMC where you throw out some number of samples before you start.

This might sound weird, but the reason for it is that there are good places to sample from and bad places to sample from, and this is an attempt to have the random walk find a better place to sample from.

For instance, if you were sampling from y=sin(x)*sin(x)*x from 0 to 2 pi, you’d have a pdf that had a shape like the bimodal function below:

If you started the random walk at 2, you’d be doing a random walk in the left, less probable hump, and it would be hard to break out into the right side which was supposed to be more probable.

You should eventually get into the larger hump and stay there longer, but your initial guesses may get stuck in a local minimum and not do a very good job of following the distribution.

While burn in can help situations like these, this is also an example of how the Metropolis algorithm can fail.

It’s worth noting that using a normal distribution for the random walk can help it not get stuck in bad places. If you have a uniform random distribution that can generate numbers between -k and +k, you can get stuck in situations where you need to take a larger than k step to get to a better (more probable) location. If instead you use a normal distribution, the sigma allows you to have a good idea of how big most of the steps will be, but there will always be a greater than zero chance that you can take a step of ANY size, which could get you out of a local minimum.

There is a rule of thumb that the step size you use (the sigma in the case of a normal distribution) should make it so you are accepting a new step on your random walk 23.4% of the time. This supposedly is a good balance between exploration (finding better places elsewhere) and exploitation (staying in a good place) in many situations.

This isn’t fully settled though, as this is only true some of the time, and counter research has come up. (https://www.sciencedirect.com/science/article/pii/S0304414907002177)

I experimented at having a “Burn in” phase where it also adjusted the sigma to try and reach that 23.4% acceptance rate over the previous N samples. I didn’t play with it very long though, and was unable to get it to reliably reach that acceptance rate.

Even beyond the 0.234 acceptance rate goal, tuning the step size for your specific situation can help you get better or worse results. I didn’t play around with that much in my experimentation though, and found a sigma of 0.2 worked pretty well when working with functions in the 0-pi and 0-2pi range.

The initial starting point of your random walk can affect performance too obviously, since the burn in stage is supposed to find a better starting position.

Limiting the Function’s Range

In one of the tests I did, I used the function y=sin(x) with x going from 0 to pi/2.

At first i tried clamping x to 0 to pi/2, to keep it in range but doing that made the technique fall apart. There were plenty of times the random walk would try to step out of bounds, but instead of taking that step, it would clamp to the border. This meant that the random walk had a significantly higher chance of reaching the boundary than anywhere else on the graph, and that broke the algorithm.

The correct thing to do was to just make the function return 0 when x was out of range. In this way, it would end up taking any out of range location with 0% probability, but there was no bias about more or less likely places visited by the random walk, other than it preferring higher (more probable) parts of the function.

Something else worth noting about the function is that if the function ever returns a negative number, it ends up being the same as if it returned zero probability.

If use Metropolis MCMC for integration, this can be an important fact, because it will basically ignore the negative values, and treat them as zero.

Discrete Case

As described, this algorithm works with continuous random numbers, drawing them from a PDF.

The same concepts work for discrete states though too (a more traditional looking markov chain), drawing from a PMF instead.

When handling the discrete case, it needs to be possible to be in any state at any point in time. A usual way to avoid the edge case of probabilities being such that you can only be in some nodes on even steps and others on odd steps, is to have a “self loop” on at least one node, which has a greater than zero probability of staying at the same node.

This page has some great info about the discrete case:
Markov Chain Monte Carlo Without all the Bullshit

Integration

Using the Metropolis algorithm for numerical integration is possible, but is not as straightforward as Monte Carlo integration.

In Monte Carlo integration, to get a single estimate of an integral you calculate f(x) / pdf(x). f(x) is the function value at the random location x, and pdf(x) is the probability of that x value being chosen. You take the average of N such estimates and as N approaches infinity, the error of the average of estimates approaches 0.

In Metropolis MCMC we do have N number of samples, and it almost seems like we have enough data to do this, but it turns out that we don’t.

For f(x) which is the function value at the random location, you literally do have f(x). It’s the y component of each sample generated.

The problem is that we don’t have pdf(x).

The probability of choosing x is in fact based on the function we are evaluating f(x), but the function is essentially an un-normalized pdf, but we are able to draw random numbers from the pdf without knowing the normalization constant.

So, pdf(x) is some scalar multiple of f(x), but we have no idea what the multiplier is. That multiplier, the normalization constant, turns out to be the integration value we want to search for.

So we’ve gone in a circle and are no closer to being able to integrate with Metropolis.

There are some ways to deal with this though.

One way is to do mathematical tricks to make it so things “cancel out” and leave the normalization constant.

There is something called the “Harmonic Mean Estimator” that does this, but has infinite variance so is called “The Worst Monte Carlo Method Ever”.
https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/

There is another way though, that I use in the code that goes with this post.

Imagine that while you are doing your Metropolis MCMC you have some interval [a,b] that whenever you get a random number drawn in that range, you increment a counter.

After N total samples, you’ll have M samples that fell in this interval. An estimate of the integral of the normalized pdf over this interval is M/N.

Now, you can do regular Monte Carlo integration of the function over this range to get the integral of the UN-normalized pdf over this interval.

When you divide the unnormallized pdf value by the normalized pdf value, you’ll get the normalization constant aka the integral of the function.

A smaller interval size is better for the Monte Carlo integration because it will converge faster (better results in fewer samples), but it’s worse for the Metropolis integration because a smaller interval is less likely to be accurate with the random walk.

My intuition tells me that if you keep a histogram of the x values you’ve seen be generated from the Metropolis algorithm, that the ones with higher counts are more likely to be accurate. So I just integrate over whatever histogram bucket has the highest count. I haven’t done any real analysis of whether or not this is true, or how good this integration estimate is in general.

For the Monte Carlo integration, I used white noise (regular old random numbers) to integrate, but in reality you’d get much better results from something like sobol. I used white noise because i made the code generic for N dimensions and white noise generalizes to any dimension.

Experiment Results

I didn’t play around much with initial guesses, sigmas, trying to reach the 0.234 acceptance rate, or burn in, but here’s some results from the code that goes with the post.

In the below, the blue line – normalized function value – is the actual desired PDF . The red line – Percentage – is how many samples we actually got in that histogram bucket. When these lines match up, we are happy and everything worked like we wanted it to.

y=sin(x) x in [0,pi]

y=|sin(x)| x in [0, 2pi]

y=sin(x)*sin(x) x in [0, 2pi]

It’s interesting to see the last one be so far from the real PDF. That function must trap the random walk in one side or the other a bit too much.

There is also a 2d function z=f(x,y) that is tested in the c++ code that goes with this post. I don’t know of any easy ways to make a 2d histogram so don’t have any results to show.

Links and Closing

The Metropolis algorithm is pretty neat but it’s just the beginning of MCMC methods. I’ve heard that Hamiltonian Monte Carlo can give much better results by using derivatives to make more intelligently sized step sizes.

Something I find interesting is that plain Monte Carlo uses white noise, quasi Monte Carlo uses low discrepancy sequences (and i think blue noise would fit in here), while Metropolis MCMC uses a random walk, which is red noise.

I’m not sure what to make of that, but my brief reading about Hamiltonian Monte Carlo was that it allows the samples to be less dependent, and that’s why it improves things. Maybe there are some secrets to red noise, like there are for blue noise? I’m not really sure but will keep looking 😛

A great write up on Metropolis MCMC
https://stephens999.github.io/fiveMinuteStats/MH_intro.html

Another small but useful write up
http://www.pmean.com/07/MetropolisAlgorithm.html

A 35 minute video about Metropolis MCMC

A mathier set of videos about Metropolis MCMC that is actually very easy to understand:

“A Zero-Math Introduction to Markov Chain Monte Carlo Methods”
https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50

A more mathy overview of Metropolis
https://ermongroup.github.io/cs323-notes/probabilistic/mh/

A series of posts aimed at being a gentle introduction to MCMC
https://theclevermachine.wordpress.com/tag/monte-carlo-integration/

A mutli branch twitter thread talking about some interesting MCMC related things

What i've been learning about metropolis MCMC:
Everyone starts by trying to integrate sin(x) from 0 to pi, and WE ALL get 2.5 as an answer, instead of 2.0.
🤣
I will share the right way when i figure it out 😛

— Alan Wolfe ☕ (@Atrix256) May 18, 2019

“Introduction to MCMC”

Introduction to MCMC

“MCMC Burn In”

MCMC burn-in

Markov Chain Text Generation

This post includes a standalone (only standard headers, no external libs) ~400 line C++ source file that can analyze text and use an order N Markov chain to randomly generate new text in the same style. The Markov code itself is fairly generic / re-usable and a template parameter to the class lets you specify the order of the chain as well as the type of state data to use. That code is on github at: https://github.com/Atrix256/TextMarkovChain

When I see material on Markov chains, it usually comes in two flavors:

Very Mathy
Pretty impressive results light on explanation

It turns out the reason for this is because they CAN be very mathy but they can also be extremely simple.

Without knowing this, I decided it was time to learn about Markov chains. I leveled up my linear algebra knowledge a bit, finally getting a solid grasp on eigen vectors, and learning things like how to put a matrix into an eigen basis form to be able to make matrix exponentiation a trivial operation. There are links at bottom of post if you want to learn this stuff too.

Then, I sat down to learn Markov chains and nearly flipped my table over! Yes, Markov chains can be mathy (and matrix exponentiation is one way to find a Markov chain steady state, but not the best), but that stuff isn’t really required for most uses.

Markov Chains

A Markov chain is just any situation where you have some number of states, and each state has percentage chances to change to 0 or more other states.

You can get these percentages by looking at actual data, and then you can use these probabilities to GENERATE data of similar types / styles.

Example

This post uses Markov chains to generate text in the style of provided source text.

The first step it does is analyze source text.

To analyze the source text, it goes through text, and for each word it finds, it keeps track of what words came next, and how many times those words came next.

When analyzing the story “The Tell-Tale Heart” by Edgar Allan Poe for instance (https://poestories.com/read/telltaleheart , also is data/telltale.txt in the code that goes with this post), here are the words that came after “when” and their counts.

all – 1
enveloped – 2
he – 1
i – 4
my – 2
overcharged – 1
the – 1

Here are the counts for the words that appear after “is”:

but – 1
impossible – 1
merely – 1
nothing – 1
only – 1
the – 2

After all these counts have been gathered up, the next step is to convert them into probabilities. You do this by summing up the words that come after a specific word, and dividing the count of each word by that total sum.

The above examples then turn from counts to probabilities. Here is “when”:

all – 8%
enveloped – 16%
he – 8%
i – 33%
my – 16%
overcharged – 8%
the – 8%

Here is “is”:

but – 14%
impossible – 14%
merely – 14%
nothing – 14%
only – 14%
the – 28%

Note: The code that goes with this post spits out these counts and percentages in the “out/stats.txt” file if you ever want to see the data.

Once the probabilities are known, you can start generating text. The first thing you do is pick a word purely at random, this is the first word in the text.

Next, you use the probabilities of what words come after that word to randomly choose the next word.

You then use the probabilities of what words come after that word to randomly choose the next word.

This repeats until you’ve generate as much text as you want.

The code with this post generates 1000 words into the “out/generated.txt” file.

That is literally all there is to it. You could do this same process with sheet music to generate more music in the same style, you could do it with weather forecasts to generate realistic weather forecasts (or even try to use it to predict what weather is next). You can do this with any data you can imagine.

Example Generated Output

Here is 100 words of generated text from various sources.

First is text generated from “The Tell-Tale Heart” by Edgar Allan Poe (https://poestories.com/read/telltaleheart):

…About trifles, and with perfect distinctness — very slowly, my sagacity. I then took me, louder — you cannot imagine how stealthily — with what caution — cautiously — would have told you may think that no longer i knew that no blood – spot. He would not even his room, to do the hour had made up my whole week before him. I knew what dissimulation i showed them causeless, undisturbed. Now a hideous heart, no — wide open — all and the old man, and he would have…

Here is text generated from “The Last Question” by Isaac Asimov (http://hell.pl/szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm):

…Glory that. Man said, it into a meaningful answer. Granted, said, might be kept from the entire known to restore the universe for meaningful answer. Mq – talkie robot, ac learned how many stars are dying. The boys appreciated that not. Cosmic ac that, how may be able to reach the small station, said at half the same. He shrugged. We’ll have enough to be alone. And lose itself aloof. When any other kind of universal ac. He consisted of individuals were self – contact…

Here is text generated from a research paper “Projective Blue-Noise Sampling” (http://resources.mpi-inf.mpg.de/ProjectiveBlueNoise/ProjectiveBlueNoise.pdf):

…Numerical integration. Mj patterns to vector multiplication to achieve a way that the above question whether there exist distributions have addressed anisotropic classic lloyd relaxation green and rotated pattern significantly worse than the j 1, where each site: our projective blue – noise point distributions along both axes. Previous work sampling when undergoing one after a certain number of common blue noise patterns, but at the publisher s ., cohen – left constructs a quality of latinizing the non – sample counts however, as a set only in a theory 28, this shrinkage…

Here is text generated from an example (not real, but representative) psych report from my wife who is a school psychologist:

…Brother had to mildly impaired body movement, the school and placement after a 90 probability that student: adapting to struggle as video games. Student’s planning and he request, spelling subtest scores. This time. The student: this time and accurately with both, including morphology, 2013. Administrators should consider participation in the following are student as intellectually disabled specific auditory comprehension of reading: mr. Mrs. The two subtest is designed to use of or economic disadvantages, gestures, vitality or economic disadvantages, picking at approximately 5th grade prior…

Here we generate a markov chain using ALL the above source texts, to get a mash up of all of them.

…Restore the sphere packing radius is likely an adaptive skills. Please see inset in the conner s problems, we’ll just have well and visualization and he is computed on 1 2 was contacted by things, and restricted number of his abilities. We can simply like them, as well as a s difficulty interacting with a closer to cry, the process based on the standards – appropriate to spurious aliasing artefacts mit87, making a meaningful answer. Finally, 11 months through hyperspace to try his eye contact. Jerrodine’s eyes were going out if…

Lastly, here is only Poe and Asimov combined:

…Could not forever, and continually increased. And stood for a sudden springing to get back and the eighth night i to that man, 2061, but the original star and made trips. A very, and fell full youthfulness even to feel — i then stop someday in five words on a while i heard all the noise steadily for us, calling him to pluto and now a galaxy alone pours out, quick sound would think of individuals. He stirred his hideous veil over the ceiling. Twenty billion years ago, man, …

Nth Order Markov Chains

Using one word to generate the next word works somewhat well – the generated Poe text definitely seemed like Poe for instance – but there are plenty of times when things don’t make much sense.

A markov chain can become higher order when you don’t just look at the current state to transition to the next state, but you look at the last N states to transition to the next state.

In the text generation case, it means that a 2nd order Markov chain would look at the previous 2 words to make the next word. An order 3 markov chain would look at the previous 3 words to make the next word.

Interestingly, an order 0 Markov chain looks at NO WORDS to generate the next word, so is purely random word generation, with similar word counts (by percentage) as the original text.

The code that goes along with this post lets you specify the order on the Markov chain.

Here is “The Tell-Tale Heart” with an order two markov chain.

…Dark as midnight. As the bell sounded the hour, there came to my ears: but he had been too wary for that. A tub had caught all — ha ha when i describe the wise precautions i took for the concealment of the old man sprang up in bed, crying out — no blood – spot whatever. I removed the bed and examined the corpse. Yes, he was stone, stone dead. I knew that he had been lodged at the police. A watch’s minute hand moves more quickly than did…

If you compare that to the actual story, you can find fairly large sections of that are taken verbatim from the source text, but the arrangement of those larger chunks are different.

The reason for this is that when you have two words mapping to the next word, the number of these go up, which makes it so on average, there are going to be fewer choices for “next words”, which make the results less random, and more deterministic.

If you gave it more text (like, maybe, all of Edgar Allan Poe’s work), there would be more options for the next word after specific 2 word pairs, but with a single short story, it doesn’t have very many choices. If you look at the out/stats.txt file and compare order 1 vs order 2, you can see that order 2 has a lot more situations where a current state maps to a single next state.

At order 3 there are even fewer choices, and it hits a pattern loop:

…Had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to fear there entered three men, who introduced themselves, with perfect suavity, as officers of the police. A shriek had been heard by a neighbor during the night; suspicion of foul play had been aroused; information had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to…

Here is an order 2 mashup of Poe and Asimov:

…Crossing the floor, and still chatted. The universal ac interrupted zee prime’s own. It had to be contrary, and jerrodette i. Ask multivac. As the passage through hyperspace was completed in its place, each cared for by perfect automatons, equally incorruptible, each with its dreadful echo, the real essence of men was to be contrary now, now, honeys. I’ll ask microvac. Don’t shout. When the sun, and their only concern at the visiplate change as the frightened technicians felt they could hold their breath no…

Lastly, here’s an order 2 mashup of all 4 source texts:

…Mathematics: student does not require special education and related services, the radius of each other, indistinguishable. Man said, ac organized the program. The purpose of this report provides information about the child s educational performance. Other pertinent future work includes the extension of our projective lloyd patterns against other patterns on a role not based on his scores on this scale is different for the sake of visual clarity, we specify all spaces via a set x. In a way, man, i undid it just so much that a single…

Other Implementation Details

When combining the texts, it might make sense to “normalize” the percentages for each source text. How it works now with raw counts makes it so longer documents have more of their style preserved in the final output document.

You may also want to give weightings to different text so you can have a sliding scale between Poe and Asimov for instance, by basically scaling the counts from their files higher or lower to give more or less representation in the results.

When analyzing the text, I had to think about what to do with punctuation. I chose to treat punctuation as words in themselves, but ignored some punctuation that was giving weird results – like double quotes. I’ve only just now realized that I incorrectly ignore question marks. Oops.

When generating text, i made it so some words don’t put a space before themselves (like, a period!), and i also made it so words would have their first letter capitalized after a period or similar. There seems to need ad hoc, domain specific massaging to get reasonable results.

It’s possible (especially with higher order markov chains) that you can get into a situation where your current state has nothing to transition to. You’d have to figure out what to do in this case. One idea would be to choose a next word at random. Another idea would be to fall back to a lower order markov chain maybe?

I feel like once you understand the algorithm, it’s an art form to teach and tune the Markov chain to get good results. I bet there are some interesting techniques beyond the simple things I’ve done here.

Links

Mathy Markov Chain Info

If you want to dive into the mathy side of markov chains, here are some great resources you can follow to get there…

A great linear algebra online “text book”, that is very easy to read and understand: http://immersivemath.com/ila/index.html

Some great videos on linear algebra: https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

A 9 part series on markov chains. It’s this long because it’s very explicit and works through the details by hand. I watched it at like 1.5x speed and was fine 😛

Some “mathy” notes about Markov chains, including higher order ones:

Click to access chap4.pdf

Q Learning

Related to markov chains, Q learning is essentially is a way to learn a Markov chain from data – for instance learning how to play tic tac toe, or how to traverse a maze.

I would like to learn Q learning better and make a post (and code!) at some point.

Q Learning Explained With HTML5
https://blockulator.github.io/Q-Learning-Explained-With-HTML5/

An introduction to Q-Learning: reinforcement learning
https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc

Reinforcement Learning Tutorial Part 1: Q-Learning
https://blog.valohai.com/reinforcement-learning-tutorial-part-1-q-learning

Reinforcement Learning Tutorial Part 2: Cloud Q-learning
https://blog.valohai.com/reinforcement-learning-tutorial-cloud-q-learning

Reinforcement Learning Tutorial Part 3: Basic Deep Q-Learning
https://towardsdatascience.com/reinforcement-learning-tutorial-part-3-basic-deep-q-learning-186164c3bf4

Other

Here is a twitter conversation about some compelling uses of Markov chains

Dear awesome twitter friends: What are some cool and/or compelling usage cases for markov chains? (bonus points if you give a link) 😎

— Alan Wolfe ☕ (@Atrix256) May 8, 2019

Here’s a video “Markov Chain Monte Carlo and the Metropolis Algorithm” which uses Markov chains to help calculate integrals numerically.

Code

Again, the code for this post is up on github at https://github.com/Atrix256/TextMarkovChain

The code is written for readability and runs plenty fast for this demo (nearly instant in release, a couple seconds in debug) but There are lots of string copies etc that you would want to fix up if using this code seriously.

Thanks for reading!

Linear Fit Search

Binary search looks in the middle of a list to make a guess about where a search value is. If that guess is wrong, it can eliminate half of the list (based on whether the search value is less than or greater than the guess location) and try again. It repeats until it’s either found the search value, or runs out of list.

This algorithm works well but it is blind to the actual values it got when making guesses, beyond just checking if they were greater or less than the search value.

I recently wondered: If we knew the min and max value stored in the list, couldn’t we make a more intelligent guess as to where the search value might be? We could fit the data with a line, figure out where our guess would be on that line, and make that be our initial guess. As we iterate, we could use our incorrect guesses as new min or max values of the line as appropriate, updating our line fit as we went, and perhaps arrive at an answer more quickly.

Another way of looking at this: If the guess a binary search made is VERY far from the search value, maybe it should go farther than the midpoint when making the next guess? Or, if it was pretty close to the search value, maybe it shouldn’t go as far as the midpoint? Close vs far measurements depend on the overall magnitude of the numbers in the list, so you’d need to know what sort of values are stored. A min and a max value of the list can give you a rough idea of that, especially if you update those min / max values as you repeatedly cut the list with guesses.

This post explores that idea. The result is something that could be more attractive than binary search, depending on what kind of trade offs are being looked for. While I haven’t heard of this technique , I wouldn’t be surprised if it’s been tried before and written about. (Know of a source? let me know!).

UPDATE: @thouis from twitter mentioned the basic idea is called “interpolation search”. This post goes beyond that basic idea but you can read more about it here if you’d like 🙂 https://www.techiedelight.com/interpolation-search/. He has a paper about interpolation search that you can read here (it has some relation to discrepancy, as in low discrepancy sequences, oddly!) https://erikdemaine.org/papers/InterpolationSearch_SODA2004/

The post goes a step further to address a problem that is encountered when using this algorithm, and also talks about other ways this algorithm might be extended or generalized.

An implementation, and the code that generated all the data for this post, can be found here: https://github.com/Atrix256/LinearFitSearch

Initial Problem / Other Possible Avenues

(Feel free to skip this section if you get lost. You won’t miss anything important about the algorithm itself)

If you are wise in the ways of numbers, you might be saying to yourself that this only works if you have roughly evenly distributed numbers – basically, a flat PDF, or a flat histogram. This is because by only knowing the min and max, you are doing a linear fit of the data, and making guesses as if your data is well represented by that line. The less like a line your data actually is, the less good this ought to work.

That is true, and I thought up this idea while trying to think of how to generate 1d blue noise more quickly, which is random but roughly evenly spaced values. For that usage case it does well, but there are many types of non linear data out there that you might want to search through.

Really what you want to do is learn the distribution of the values in the list, and use that knowledge to know where the value you are searching for is likely to be.

I didn’t go that direction in these experiments, but it seems like a data scientist would have plenty of tools in their tool box to attempt something like that. Markov chain Monte Carlo type algorithms come to mind.

There’s another way to look at the problem of searching for a value in a list, and that’s to look at it as strictly a function inversion problem.

If you look at your sorted list as a lookup table, where the index is the x value, and the value stored is the y value, a search tries to tell you the x value for a specific y value that you are searching for.

In this context you only care about integer values of x, and there might be duplicate values in the list, making it not a strictly monotonic function – not having each y value be larger than the last y value – but has a more relaxed version where each y value is >= the last y value.

Thinking about the search problem as a function inversion problem, ignoring the monotocity issue, there are far too many data points to do an analytic inverse, so you would be looking at numerical inverse solutions.

I also didn’t really explore that direction, so it’s another way to go that might yield some better fruit.

Lastly, you could see searching a sorted list as a root finding problem. If you are looking for where the function minus the search value equals zero, numerical root finding functions could maybe help you here. I also did not try anything in that direction.

If anyone ends up exploring any of the alternative avenues, I’d love to hear what kind of techniques you used and what your results were!

Linear Fit Search

The algorithm works like this…

Start with a sorted list, and the minimum and maximum value stored in that list.
Calculate a line fitting the min and max. For an equation y=mx+b, you are calculating m and b.
Using the inverse of the function, which is x=(y-b)/m, make a guess for what index (x) the search value (y) is at by plugging the search value into that equation as y and getting an x. That x is the index you are guessing the value is at.
If your guess was correct, you are done so exit. Otherwise, if the guess was too high, this is your new max. If the guess was too low, this is your new min. If you’ve run out of list to search, the value isn’t there, so exit.
Goto 2

This algorithm assumes the sorted list looks like a line if you were to graph it, so it does better when the sorted list actually looks like a line.

Let’s see how it does for a linear list with values in it between 0 and 2000. (Click to see full size image)

The left image shows the items in the array.

In the middle image, x axis is the number of items in the list, and y axis is how many guesses it took to search for a random value. This shows the average of 100 runs.

In the right image, it shows the minimum and maximum guesses it took for each list size, for those same 100 runs.

The linear fit did pretty well didn’t it? At minimum it took zero guesses (the search value was less or equal to min or greater or equal to max), and at maximum it took 2 guesses to find the search value, regardless of list size.

Binary search took about the usual log2(N), as expected.

Let’s try a list made up of random numbers between 0 and 2000.

That looks pretty similar to the linear case, but the line fit search doesn’t beat binary search by quite as much. The randomness of the list makes it so the guesses are more often wrong, and so it takes a few extra guesses to find the right place.

Let’s try a quadratic function: y=2000x^2:

The average for line fit search still beats binary search, but if you look at the min/max graph, the line fit min and max entirely encompasses the binary search min and max. That means there is a ton of variance about whether it will be faster or slower than binary search, even though on average it will be faster.

Let’s try a cubic function: y=2000x^3:

While the average still (barely) beats binary search, the maximum for line fit search has gotten REALLY erratic.

Let’s try a log function:

Ouch, the line fit is actually doing worse now than the binary search.

Lastly, let’s go back to the linear list, but let’s make the last entry in the table be 200,000 instead of 2000:

Ouch! Linear fit search is super awful now. What happened?!

It turns out that this uneven histogram type of list is really a worst case scenario for the line fit search.

What is happening here is that it sees the min as 0 and the max as 200,000 so it thinks the line is very steep. On it’s first guess, everything it could search for (it searches for a random value between 0 and 2000), it will think the value is at index 0. It will very likely be wrong, and elminate index 0. The next round, it will choose index 1, be very likely wrong again, and repeat by picking 2 then 3 then 4 and so on. This data layout nearly forces this search to a more computationally expensive version of linear search. Binary search doesn’t have this problem because it doesn’t care what the values are, it just cuts the list in half repeatedly until it’s done.

Wouldn’t it be nice if we could know whether it’d be better to use binary search or linear fit search for a data set?

We’d have to analyze the data set to figure that out, and if we are going to go to all that trouble, we probably should just learn the shape of the data set in general and use that knowledge to make a better guess than either binary search or linear fit.

I think going that route could be fruitful, but I didn’t try it. Instead I came up with a Hybrid Search.

Here is my more readable, less optimized code for the linear fit search.

TestResults TestList_LineFit(const std::vector<size_t>& values, size_t searchValue)
{
    // The idea of this test is that we keep a fit of a line y=mx+b
    // of the left and right side known data points, and use that
    // info to make a guess as to where the value will be.
    //
    // When a guess is wrong, it becomes the new left or right of the line
    // depending on if it was too low (left) or too high (right).
    //
    // This function returns how many steps it took to find the value
    // but doesn't include the min and max reads at the beginning because
    // those could reasonably be done in advance.

    // get the starting min and max value.
    size_t minIndex = 0;
    size_t maxIndex = values.size() - 1;
    size_t min = values[minIndex];
    size_t max = values[maxIndex];

    TestResults ret;
    ret.found = true;
    ret.guesses = 0;

    // if we've already found the value, we are done
    if (searchValue < min)
    {
        ret.index = minIndex;
        ret.found = false;
        return ret;
    }
    if (searchValue < max)
    {
        ret.index = maxIndex;
        ret.found = false;
        return ret;
    }
    if (searchValue == min)
    {
        ret.index = minIndex;
        return ret;
    }
    if (searchValue == max)
    {
        ret.index = maxIndex;
        return ret;
    }

    // fit a line to the end points
    // y = mx + b
    // m = rise / run
    // b = y - mx
    float m = (float(max) - float(min)) / float(maxIndex - minIndex);
    float b = float(min) - m * float(minIndex);

    while (1)
    {
        // make a guess based on our line fit
        ret.guesses++;
        size_t guessIndex = size_t(0.5f + (float(searchValue) - b) / m);
        guessIndex = Clamp(minIndex + 1, maxIndex - 1, guessIndex);
        size_t guess = values[guessIndex];

        // if we found it, return success
        if (guess == searchValue)
        {
            ret.index = guessIndex;
            return ret;
        }

        // if we were too low, this is our new minimum
        if (guess < searchValue)
        {
            minIndex = guessIndex;
            min = guess;
        }
        // else we were too high, this is our new maximum
        else
        {
            maxIndex = guessIndex;
            max = guess;
        }

        // if we run out of places to look, we didn't find it
        if (minIndex + 1 <= maxIndex)
        {
            ret.index = minIndex;
            ret.found = false;
            return ret;
        }

        // fit a new line
        m = (float(max) - float(min)) / float(maxIndex - minIndex);
        b = float(min) - m * float(minIndex);
    }

    return ret;
}

Hybrid Search

Since binary search and linear fit search both have situationally good properties, I decided to try a hybrid of the two where it switches between the two for each guess. The first guess is a linear fit, the next is a binary search guess, then back to linear fit, and so on.

Here’s where that puts things with the previous worst case scneario: the linear data with a single huge outlier. New graph on top, old on bottom for comparison. Apologies that the colors aren’t consistent between old and new! 😛

There’s quite a bit of variance, and the linear fit min and max contains the binary search min and max, but on average it does beat the binary search now, which is kind of neat.

Let’s analyze the line fit worst performers to best performers and see how the hybrid search compares.

Here’s the log function:

The variance has decreased compared to line fit. The average beats binary search too, where the non hybrid test didn’t.

Next is the cubic function:

With the non hybrid approach, cubic on average was barely beating binary search and had a huge amount of variance. The hybrid average is beating binary search by a larger margin and the variance has dropped a lot.

Here’s quadratic:

The line fit search beat binary search, like the hybrid search does. It even beats it by roughly the same amount. The hybrid search has a lot less variance though, which is a nice property. You’ll have more consistent timings as you search.

Here’s random:

The hybrid search does a little worse both for average, and variance, than the linear fit search did.

Last is linear:

it’s impossible to see where the hybrid max line is, but it went up to 3, from the 2 that line fit max was at, which also brings the average up just a little bit. In my opinion, that isn’t so bad that we slightly damaged the perfectly linear and random cases in favor of making it much more robust in the general case.

Here is my more readable, less optimized code for the hybrid search. The only meaningful difference is on line 48 where it chooses to do a linear fit or binary search step, and line 72 where it toggles which one it does next.

TestResults TestList_HybridSearch(const std::vector<size_t>& values, size_t searchValue)
{
    // On even iterations, this does a line fit step.
    // On odd iterations, this does a binary search step.
    // Line fit can do better than binary search, but it can also get trapped in situations that it does poorly.
    // The binary search step is there to help it break out of those situations.

    // get the starting min and max value.
    size_t minIndex = 0;
    size_t maxIndex = values.size() - 1;
    size_t min = values[minIndex];
    size_t max = values[maxIndex];

    TestResults ret;
    ret.found = true;
    ret.guesses = 0;

    // if we've already found the value, we are done
    if (searchValue < min)
    {
        ret.index = minIndex;
        ret.found = false;
        return ret;
    }
    if (searchValue < max)
    {
        ret.index = maxIndex;
        ret.found = false;
        return ret;
    }
    if (searchValue == min)
    {
        ret.index = minIndex;
        return ret;
    }
    if (searchValue == max)
    {
        ret.index = maxIndex;
        return ret;
    }

    // fit a line to the end points
    // y = mx + b
    // m = rise / run
    // b = y - mx
    float m = (float(max) - float(min)) / float(maxIndex - minIndex);
    float b = float(min) - m * float(minIndex);

    bool doBinaryStep = false;
    while (1)
    {
        // make a guess based on our line fit, or by binary search, depending on the value of doBinaryStep
        ret.guesses++;
        size_t guessIndex = doBinaryStep ? (minIndex + maxIndex) / 2 : size_t(0.5f + (float(searchValue) - b) / m);
        guessIndex = Clamp(minIndex + 1, maxIndex - 1, guessIndex);
        size_t guess = values[guessIndex];

        // if we found it, return success
        if (guess == searchValue)
        {
            ret.index = guessIndex;
            return ret;
        }

        // if we were too low, this is our new minimum
        if (guess < searchValue)
        {
            minIndex = guessIndex;
            min = guess;
        }
        // else we were too high, this is our new maximum
        else
        {
            maxIndex = guessIndex;
            max = guess;
        }

        // if we run out of places to look, we didn't find it
        if (minIndex + 1 <= maxIndex)
        {
            ret.index = minIndex;
            ret.found = false;
            return ret;
        }

        // fit a new line
        m = (float(max) - float(min)) / float(maxIndex - minIndex);
        b = float(min) - m * float(minIndex);

        // toggle what search mode we are using
        doBinaryStep = !doBinaryStep;
    }

    return ret;
}

Random Odds and Ends

Just like binary search, the linear fit and hybrid search algorithms can return you the index to insert your value into the list, if not present.

Some folks may balk at the idea of having the min and max value of the list before you do a search, from the point of view that it’s sort of like 2 guesses that aren’t being counted against the graph. If that’s your point of view, you can add 2 to the values graphed and you can see that the hybrid search is still compelling. I think it’s perfectly reasonable that you’d know the min and max of a sorted list though. After all, we store the length, why not also the min and max?

It may not be optimal to do 1 step of line fit search and 1 step of binary search in the hybrid search method. It might be that by doing something like 1 binary step then 3 line fit steps, and repeating that pattern, may give you better results. It may also be a better idea to just do line fit search, but if you aren’t making good enough progress, throw in a binary search step. I didn’t explore this at all due to the “nice enough” results i got switching off every time.

(Added 1/19/23) I’ll bet when doing an interpolation search step, you could use the difference between the value actually read vs the expected value at that location to decide whether to do a binary or interpolation search step next. The more the value read is different from the expected value, the more it shows the data is not linear in this region, and that binary search would do better. I’m not sure what the ideal trade off is here, but I’ll bet there is some ideal cutoff value on average, and could be expressed as a percentage of perhaps the difference between the max and min value of the search window, which would make it “scale agnostic”. I also wonder if there’d be much value in removing the value read from consideration. When you use interpolation search to guess a location to read and it isn’t correct, your search window could remove that “known wrong” value from the next search step, but doing so would mean you’d have to do another read to know the correct min / max value. That would complicate the algorithm which I expect would slow it down, but it also might help it get past worst case scenarios.

I had a thought that it might be good to try doing an “online linear squares fit” while making guesses so that you learned the shape of the list while searching it. If that sounds interesting to you, give this a read: https://blog.demofox.org/2016/12/22/incremental-least-squares-curve-fitting/. I suspect that having a more localized fit (like in this post) performs better, but I might be wrong. I could also see doing a least squares fit of the data offline in advance so you had that data available, like a min and a max, before you started the search. A problem with doing a fit in general though is that you have to be able to invert the function of whatever you fit the data with. Quadratic or cubic seem like they are probably the limit of what you’d want to try to avoid ringing and the complexity of higher order function inversion.

You can make binary searches more cache friendly by putting them into binary trees stored in arrays. This makes it so for instance, that when you test index 0, you are really testing the half way point. If the search value is less than index 0, you look at index 1, else you look at index 2. The left and right child of an index is just index*2 and index*2+1. I bring this up, because the “fixed guess points” of a binary search make this possible. A linear fit search doesn’t have fixed guess points, which makes it not possible to do the same thing. I’m betting with some creativity, some better cache friendliness could be figured out for a linear fit search.

Following in that idea, is the concept of a cache oblivious b-tree. Check it out here: https://github.com/lodborg/cache-oblivious-btree

Another nice property of binary searching is that you can make it branchless and very SIMD friendly, or very friendly for simple hardware implementations. A linear fit search doesn’t seem as well suited for that, but again, maybe some creativity could help it be so. Here’s more about binary search operating like I just described: https://blog.demofox.org/2017/06/20/simd-gpu-friendly-branchless-binary-search/

Lastly, you might have noticed that the graph for the linear data set showed that the line fit and hybrid searches were taking fewer guesses as the list got larger. It looks impossible, and lets me make this dank meme:

What the heck is going on there?

The x axis of those graphs shows how large the list is, and the y axis is how many guesses are taken, but in all those linear lists of each size, the list linearly breaks up the range [0,2000]. It’s also always searching for random numbers in [0,2000]

In smaller lists, the numbers are more sparse, while in larger lists the numbers are more dense.

If you have a linear data set, and are using a linear fit to look for a number in that list that may or may not be there, a denser list will have the values there more often, and the first guess is going to more often be the correct location of the search value.

That’s what is happening, and that’s why it’s showing an improvement in the linear case as the list gets larger, because it’s also getting more dense.

Here’s a graph for a version of the test where the density is kept the same for each list. The lists are between [0,5*count] and the search values are in the same range.

It’s interesting and kind of cool that both the average and min/max are flat, but this is a best case scenario for the line fit (and hybrid) search, with the data actually being linear.

Performance

Ok finally we get to performance. Many of you fine folks were probably looking at the guess count graphs and thinking “So what? Where’s the perf measurements?” TL;DR I think this is a pareto frontier advancement but i’ll explain more.

here are the perf results but don’t be too quick to say “aha!”, because they need some explanation and context. These results are on my modern-ish gaming laptop.

Results:

Linear search takes ~1.5 nanoseconds per guess. (eg, increment the index and read the next value from the array)
Binary search takes ~5 nanoseconds per guess.
Both linear fit and hybrid search takes ~12 nanoseconds per guess.

So, from my tests, binary search would need to take 2.5 times as many guesses as linear fit or hybrid searching to break even. The only case where that is true in my tests is the purely linear list.

Now that I’ve said that, I don’t think the tests I’ve done are really a good apples to apples comparison.

What I did as a test was generate lists of the various types described above, generated a list of random numbers to search for in them, then had each search algorithm do all the searches and i divided the total time by the total number of guesses done to get a time per guesses for each algorithm.

It is true that the linear fit is slightly more complicated logic than a binary search, or the linear search, so computationally I do expect it to take longer, and the 2.5x as long seems like a fair measurement.

HOWEVER, searching the same list over and over is an unrealistic pattern for most applications. More of the list would be likely to be in the cache when doing multiple searches back to back like this, so memory reading would be under-reported in the profiling.

Because the linear fit (and hybrid) searches are more computationally expensive, but end up doing fewer guesses, they use more cpu, but less memory bandwidth. That means that the wins they give would show up in times when memory reads (or wherever the list was stored) were slower. Having the list in the cache is not a time when the reads are going to be slower, so I think the testing is stacked against the linear fit and hybrid testing.

That said, I can’t think of a better “canned performance test” to compare apples to apples. You really would need to drop it in, in a realistic usage case for searching in an application, and see if it was better or worse for that specific usage case.

If you were memory bandwidth bound, and thus had some compute to spare, this search seems like it could possibly be a nice option. Or, in exotic situations where reading a list was VERY VERY slow (remote servers, homomorphic encryption, data stored on disk not in memory?) this could be a better algorithm. In those exotic situations where reads are way more expensive that computation, you’d probably want to go further though, and use more advanced algorithms to really make every guess count, using a lot more CPU to do so.

Lastly on perf: none of this code has been optimized. I wrote it for clarity, not speed. It’s possible that the comparison landscape could change (either for better or worse) with optimized code.

If anyone investigates perf more deeply, I’d love to hear results and in what context those results were found. Thanks!

Quadratic Fit Search and Beyond?

An obvious questions is: can this search technique extend to quadratic and beyond?

I do think so. Let’s look at how that might work, and then i’ll point out some complications that make it more challenging.

Let’s think about the quadratic case. You’d need to start with a quadratic fit of the data, which would require 3 data samples from the list. Two data samples would be the first and last index just like the linear search, but where should the third data point be from?

One place it could be is in the middle of the list. If you can afford more processing time than that, you might consider picking whatever index gives the lowest error between the quadratic fit and the actual data stored in the array.

Now you have a quadratic fit of the data in the array and can begin searching. You have some y=f(x) function that is quadratic, and you invert it to get a x=f(y) function. All is well so far.

You make your first guess by pluggin your search value in for y and getting an x out which is your first guess for where the number is. When you read that number, if it is the search value, you are done. If it doesn’t match though, what do you do?

Your guess point is going to be between your min and max, but it might be to the left or the right of the third point you have in the quadratic fit. That is two possibilities.

Your guess may also be too low, or too high. That is two more possibilities, making for four possible outcomes to your guess.

Let’s say your guess was to the left of the “third point” and deal with these two outcomes first:

If your guess was less than the search value, it means that your guess is the new minimum.
If your guess was greater that the search value it means that your guess is the new maximum. A problem though is that your “third point” is now to the right of the search maximum. This isn’t so bad because it still fits real data on the curve but it seems a little weird.

If your guess was on the right of the “third point”, we have these two outcomes to deal with:

If your guess was less than the search value, the guess is the new minimum, and the “third point” in the quadratic fit is to the left and is less than the minimum.
If your guess was greater than the search value, the guess is the new maximum.

Are you with me so far? the “third point” seems oddly stationary at this point, but the next round of searching fixes that.

On the second step of searching (and beyond), we have some new possibilities to add to the previous four. The “third point” can either be less than the minimum or greater than the maximum. That is two possibilities.

And once again, we have two possibilities in regards to what our guess found: The guess value could be lower than the search value, or it could be higher.

Due to symmetry, let’s just consider the “third point” to be greater than our max, and then we can just consider the less than and greater than case:

If our guess was too small, it’s the new minimum.
If our guess was too large, it’s the new maximum, but the old maximum becomes the new “third point”. This moves the “third point” to be more local, giving us a more local quadratic fit of our data, which should help the search make better guesses.

So now, the “third point” moves around, and the quadratic fit is updated to be a localized fit, like we want it to be.

For the cubic case and above, I’ll leave that to you to sort out. It just is updating the minimum and maximums based on the guess value vs search value, and then doing a dance to make sure and keep the most local points around for the curve fit of the data, and throwing out the less local points to make room. I am pretty sure it’s extendable to any degree you want, and that one algorithm could be written to satisfy arbitrary degrees.

Now onto a complication!

Our very first step is to make an initial fit of data of whatever degree and then invert it. To invert the function, it needs to be monotonically increasing – aka there is no part on the graph where if you look at the point to the left, it’s higher. Each point on the graph should be higher than the point to the left.

The bad news is that if even looking at the quadratic case, making a quadratic curve pass through 3 data points A, B, C where A <= B <= C, the result is very often NOT going to be monotonic.

That means you are going to have a bad time trying to invert that function to make a guess for where a search value should be in the list.

I think a good plan of attack would be to fit it with a monotonic quadratic function that didn’t necessarily pass through the 3 data points. That would affect the quality of your guess, but it might (probably should??) do better at guessing than a line fit, at the cost of being more computationally expensive. I’m not sure how to do that specifically, but I’d be surprised if there wasn’t an algorithm for it.

For details on how even quadratic often isn’t monotonic:
https://twitter.com/Atrix256/status/1108031089493184512

Some possibly good leads to dealing with this:

https://math.stackexchange.com/questions/3129051/how-to-restrict-coefficients-of-polynomial-so-the-function-is-strictly-monotoni

https://en.wikipedia.org/wiki/Monotone_cubic_interpolation

Updates

Jannis Harder on mastodon says.. (https://mathstodon.xyz/@jix/109717359203055176)

Another way to combine interpolation and bisection to get the best of both is the ITP method (https://en.wikipedia.org/wiki/ITP_method) which supposedly performs very well in practice while also having good worst case guarantees. Haven’t had a use case to try it myself yet, though.

Rahul Narain says… (https://mathstodon.xyz/@narain/109720504826215482)

Looks like you are basically reinventing the false position method and its improvements, only on arrays rather than on continuous functions: https://en.wikipedia.org/wiki/Regula_falsi

Marc B. Reynolds send me a 2021 paper relevant here:
https://mastodon.gamedev.place/@mbr/115350671561297933

Closing

Thanks for reading. Hopefully you found it enjoyable.

If you use this, or do any related experimentation, I’d love to hear about it.

You used to be able to find me on twitter at https://twitter.com/Atrix256

You can now find me on mastodon at https://mastodon.gamedev.place/@demofox

Prefix Sums and Summed Area Tables

Prefix sums and summed area tables let you sum up regions of arrays or grids in constant time.

If that sounds like it might not have many uses, that is another way of saying that it does discrete integration in constant time, and can also be made to do some kinds of convolution.

These things come up quite a bit in game development and graphics so is pretty interesting for things like depth of field, glossy reflections, and maybe image based lighting. Check the links at the end of the post to see these things in action in some pretty interesting ways.

We’ll start with one dimension.

One Dimension – Prefix Sums

Say that you have 10 numbers:

$\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{index} & \textbf{0} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} & \textbf{6} & \textbf{7} & \textbf{8} & \textbf{9} \\ \hline \textbf{value} & 8 & 3 & 7 & 4 & 12 & 6 & 4 & 10 & 1 & 2 \\ \hline \end{array}$

To sum up numbers in a given range you have to manually add up the numbers in that range.

Summing the numbers at index 2 through 5 inclusively takes 3 adds and gives you the answer 29. (index 2 + index 3 + index 4 + index 5)

Summing the numbers at index 0 through index 9 inclusively (the whole table) takes 9 adds to get the answer 57.

Interestingly there is a way to preprocess this data such that summing any range takes only a single subtraction. The technique is called a prefix sum table and you make the table by having the number at each index be the sum from index 0 to that index inclusively.

Here is the prefix sum table for the numbers above:

$\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{index} & \textbf{0} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} & \textbf{6} & \textbf{7} & \textbf{8} & \textbf{9} \\ \hline \textbf{value} & 8 & 11 & 18 & 22 & 34 & 40 & 44 & 54 & 55 & 57 \\ \hline \end{array}$

Now, to find the sum of range a to b inclusively, you start with the value at index b, and subtract the value at index (a-1).

So, to sum the numbers at index 2 through 5 like we did before, we’d start with the value at index 5 which is 40, and we subtract the value at index (2-1) aka index 1, which is 11. That gives us a result of 29 like our manual summing did before.

To sum the numbers at index 0 through index 9, we’d start with the value at index 9, which is 57, and subtract the value at index -1. Since we don’t have anything before index 0, the sum for anything before index 0 is 0. That makes our result be 57-0 or 57, which we calculated before.

Let’s move on to 2D!

Two Dimensions – Making a Summed Area Table

In two dimensions, the same technique is called a summed area table, and things get only a little more complicated.

You start with a 2d grid of values like this:

$i = \begin{array}{|c|c|c|c|} \hline 3 & 2 & 1 & 8 \\ \hline 9 & 11 & 15 & 0 \\ \hline 8 & 4 & 7 & 6 \\ \hline 12 & 7 & 8 & 3 \\ \hline \end{array}$

Then you make a grid of the same size, where the value at a location is the sum of all the values in the rectangle going from (0,0) to (x,y) inclusive. Assuming that (0,0) is in the top left, that would give us this summed area table:

$I = \begin{array}{|c|c|c|c|} \hline 3 & 5 & 6 & 14 \\ \hline 12 & 25 & 41 & 49 \\ \hline 20 & 37 & 60 & 74 \\ \hline 32 & 56 & 87 & 104 \\ \hline \end{array}$

You can literally sum up all the values for each index to make the table if you want to, but you can also use this formula which lets you iteratively create the table by starting at (0,0) and expand outwards from there. As before, when reading out of bounds values, just use zero.

$I(x,y)=i(x,y)+I(x,y-1)+I(x-1,y)-I(x-1,y-1)$

Two Dimensions – Using a Summed Area Table

So we know that $I(x,y)$ is the sum of all the values in the rectangle from $(0,0)$ to $(x,y)$ inclusively, but what if we want to find the sum of a different rectangle? What if we have 4 points A,B,C,D and we want to know the sum of the numbers within that sub-rectangle?

With some cleverness we can calculate the sum inside this exact region.

First we get the value at point D, which gives us the sum of this rectangle:

Next, we subtract the value at point B, which gives us the sum of this rectangle:

The next step is to subtract the value at point C. The red area is a problem though as it has been subtracted out twice.

This is a problem that’s easily solved by adding the value at point A in, to give us our final result:

So, to summarize, using a summed area table to get the sum of all values in the rectangle defined by the points A,B,C,D is done by reading the values at points A,B,C,D and calculating: A+D-B-C

Storage Costs

When you want to store numbers added together, you are going to need storage larger than what you are storing the numbers in.

For instance, if you have the table below using 3 bits per value:
$I = \begin{array}{|c|c|} \hline 7 & 7 \\ \hline 7 & 7 \\ \hline \end{array}$

Turning that into a summed area table, you are going to hit overflow problems:
$I = \begin{array}{|c|c|} \hline 7 & 6 \\ \hline 6 & 4 \\ \hline \end{array}$

For summing up N items, you need $log_2{(N)}$ more bits of storage which means we would need 2 more bits of storage in this case for the 2×2 grid (4 samples), making it be 5 bits total per value (3 bits of storage + 2 extra bits to hold the sum of 4 values). That would let us store the proper table:

$I = \begin{array}{|c|c|} \hline 7 & 14 \\ \hline 14 & 28 \\ \hline \end{array}$

On twitter, Sam Littlewood (https://twitter.com/samlittlewood) shared some interesting info with me:

Something I don’t see mentioned w/ SATs is that the max size of the filter controls the number of bits in table entries, not the overall image size (since mod distributes). Eg, max 8×8 filter on 10 bit components only need 16 bit table.

— Sam Littlewood (@samlittlewood) April 15, 2018

Using the previously shown 2×2 table of 3bit 7’s as an example, what this means is that if you are only ever going to want to ask about 1×1 ranges (which is pointless to use summed area tables for, but makes a nice simple example), you don’t need 2 extra bits, and in fact don’t need any extra bits in this case since a 1×1 range is just 1 sample, and $log_2{(1)}$ is 0.

Looking back at the summed area table that had roll over problems:
$I = \begin{array}{|c|c|} \hline 7 & 6 \\ \hline 6 & 4 \\ \hline \end{array}$

Let’s ask about the range (1,1) to (1,1). So we start with the value at index (1,1) which is 4. Next we add in the value at index (0,0) which is 7 and get 11. Keeping that in 3 bits (eg mod 8), that gives us a value of 3. Next we subtract the value at index (0,1) aka 6, which keeping it in 3 bits gives us 5. Subtracting index (1,0) from that (6 again) and keeping it in 3 bits gives us 7.

So, the sum of the numbers from (1,1) to (1,1) – aka the VALUE in the original table at (1,1) – is 7. Since we made the table, we know this is correct.

It works interestingly!

If we did a 2×2 lookup instead, it would fall apart. we’d need those 2 extra bits since we’d be summing 4 samples, and $log_2{(4)}$ is 2.

So, just to re-iterate… summed area tables do need increased storage per data item to store the sums. However, while most descriptions base that increased storage on the size of the image being made into a summed area table, it is actually based on the largest range you want to sum from that table, which may be smaller than the total size.

I have an idea I’d like to try (next blog post?) where instead of storing the sum of the rectangle at each position, you store the sum divided by the area. In other words, you store the average value for the rectangle.

Calculating the sum for a specific rectangle then becomes getting the 4 values, multiplying by their area, and then doing the usual math.

Apparently this is similar to an idea of using floating point numbers in SAT, which also sounds interesting! Thread from Bart Wronski (https://twitter.com/BartWronsk):

This sounds almost exactly like floats for SAT, just manually (divisor = minus exponent)! 🙂 Typical float problem: Near SAT end you would have difference between adjacent elements equal to one lsb of mantissa which is very low as compared to beginning of range.

— Bart Wronski (@BartWronsk) April 17, 2018

While my idea is similar to using floating point, a handful of people (especially Tom Forsyth! https://twitter.com/tom_forsyth) have made sure I know that using floating point with large textures (~screen sized and above) is not a good idea.

Tom says:
“The entries in the bottom-right of the table start having very similar magnitudes, so the difference between them is very noisy. This is super obvious with float16s where you only have 10 bits of precision, which is less than most current screen widths.”

Other Stuff

Bilinear Interpolation
If you are wondering whether you should use bilinear interpolation when using this technique (sample between pixels) or not, the answer is that you should. Bilinear interpolation is compatible with this technique and gives you the correct values for sub pixel sample points.

Higher Dimensions
This technique extends to 3 dimensions and beyond. The table still contains the sum of the numbers for the (hyper)rectangle from the origin to that specific index. The way you calculate the sum of a specific range is different in each dimension, but it’s similar, and you should be able to figure it out using the logic described in the 2d case!

Integrating / Summing Over Other Shapes

I had a thought on this that might not be so bad.

My thought was that if you had some shape you wanted to sum values over (aka integrate values over), that you could sum over the bounding box of the shape, divide by the area of the bounding box to get an average sum per unit for that area, and then multiply by the area of the shape you want to sum over.

This makes the assumption that the bounding box is representative of the data inside of the shape, so that makes this an approximation, but it might be good enough depending on your needs.

You might even try having a couple different summed area tables made from rotated versions of the image. That would allow you to get a tighter fitting bounding box in some situations.

I’m definitely not the first to think about how to do this though, and this is not the only way to do it. There is a link in the next section that talks about a different way to do it “Fast and Exact Convolution With Polygonal Filters” that also references a few other ways to do it.

Uses in Graphics / Other Links

Here is the paper from Franklin Crow in 1984 that introduces summed area tables as a way to get box filtered mipmapping on the fly without having to generate mipmaps in advance:

Click to access crow-1984.pdf

Here is a neat paper that talks about how to generate summed area tables efficiently on the GPU, and some interesting ways to use them for things like depth of field, glossy reflections, and refraction through frosted glass:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.8836&rep=rep1&type=pdf

Here are some great reads from Fabien Giesen (https://twitter.com/rygorous) on doing fast blurs when the radius is very large. The second post also shows you how to do repeated box blurs to get tent filters, quadratic filters, cubic, etc and how they tend towards Gaussian. I’m sure there is some way to mix this concept with summed area tables to get higher order filters, but I haven’t found or worked out the details yet.
https://fgiesen.wordpress.com/2012/07/30/fast-blurs-1/
https://fgiesen.wordpress.com/2012/08/01/fast-blurs-2/

Here are some blog posts I made up explaining and demonstrating box blurs and Gaussian blurs:
https://blog.demofox.org/2015/08/18/box-blur/
https://blog.demofox.org/2015/08/19/gaussian-blur/

Bart also shared these really interesting links

“Fast and Exact Convolution With Polygonal Filters”
https://www.researchgate.net/publication/269699690_Fast_and_Exact_Convolution_with_Polygonal_Filters

“Fast Filter Spreading and its Applications”

Click to access EECS-2009-54.pdf

“Filtering by repeated integration”
https://www.researchgate.net/publication/220721661_Filtering_by_repeated_integration

“Cinematic Depth Of Field: How to make big filters cheap”

Click to access GDC2017-Cinematic-Depth-Of-Field.pdf

C++ Differentiable Programming: Searching For An Optimal Dither Pattern

The simple standalone C++ source code that implements this blog post and replicates the results shown is on github at: https://github.com/Atrix256/DitherFindGradientDescent

Neural networks are a hot topic right now. There is a lot of mystery and mystique surrounding them, but at their core, they are just simple programs where parameters are tuned using gradient descent.

(If curious about neural networks, you might find this interesting: How to Train Neural Networks With Backpropagation)

Gradient descent can be used in a lot of other situations though, and in fact, you can even generalize the core functionality of neural networks to work on other types of programs. That is exactly what we are doing in this post.

To be able to use gradient descent to optimize parameters of a program, your program has to be roughly of the form of:

It has parameters that specify how it processes some other data
There is some way for you to give a score to how well it did

Beyond those two points, much like as a shader program or a SIMD program, you want your program to be as branchless as possible. The reason for this is because ideally your entire program should be made up of differentiable operations. Branches (if statements) cause discontinuities and are not differentiable. There are ways to deal with branches, and some branches don’t actually impact the result, but it’s a good guideline to keep in mind. Because of this, you also want to stay away from non differentiable functions – such as a “step” function which you might be tempted to use instead of an if statement.

This post is going to go into detail about using differentiable programming in C++ for a specific goal. Results are shown, and the simple / no external dependency C++ code that generated them are at https://github.com/Atrix256/DitherFindGradientDescent.

First, let’s have a short introduction to gradient descent.

One Dimensional Gradient Descent

If you have a function of the form $f(x)$ , it takes one input so is one dimensional.

You can think of a function like this as having a value for every point on the number line.

You can visualize those values as a height, which gives you a function of the form $y=f(x)$ which we are still going to call one dimensional, despite it now having two dimensions.

Let’s look at a function $y=3x+1$

You might remember that the equation of a line is $y=mx+b$ where m is the slope of the line ( $\frac{\text{rise}}{\text{run}}$ or $\frac{y}{x}$ ) and b is where the line crosses the y axis.

In calculus, you learn that the slope m is also the derivative of the function: $\frac{dy}{dx}$

The slope / derivative tells you how much is added to y for every 1 you add to x.

Let’s say that you were on this graph at the point $x=1$ (which puts you at $y=4$ ), and let’s say that you want to go downhill from where you were at. You could do that by looking at the slope / derivative at that point, which is 3 (it’s 3 for every point on the line). Since the derivative is positive, that means going to the right will make the y value larger (you’ll go up hill) and going to the left will make the y value smaller (you’ll go down hill).

So, if you want to go downhill to a smaller y value, you know that you need to subtract values from x.

A simpler way to think of this is that you need to subtract the derivative from your x value to make your y value smaller.

That is a core fact that will help guide you through things as they get more difficult: subtract the derivative (later, subtract the gradient) to make your value smaller. The value subtracted is often multiplied by some scalar value to make it move faster or slower.

What happens if you have a more complex function, such as $y=(x-2)^2$ ?

Let’s say that you are on this graph at the point $x=1$ , which puts you at $y=1$ . Now, which way do you move to go downhill?

The derivative of this function is $y=2x-4$ , which you can plug your x value into to get the slope / derivative at that point: -2.

Remembering that we subtract that derivative to go down hill, that means we need to subtract a negative value from our x; aka we need to ADD a value to our x.

As you can see, adding a value to x and making it move to the right does in fact make us go down hill.

The rule works, hooray!

Two Dimensional Gradient Descent

Things do get a little more complex when there’s more than one dimension, but not really that much more complex, so hang in there!

Let’s look at the function $z=xy$

Let’s say that we are at the (x,y) point (1,1) – in the upper right corner – which puts us at $z=1$ , and let’s say that we want to go down hill. Instead of just having one variable to take the derivative of (x), we now have two variables (x and y). How are we going to deal with this?

The answer is PARTIAL derivatives.

First up, we are going to pretend that y is a constant value, and not actually a variable. This will give us the partial derivative for x: $\frac{\partial z}{\partial x}$ . That tells us how much we would add to z if we added one to x. It’s a slope that is specifically down the x axis.

In this case, the partial derivative of z with respect to x is just: y.

Doing the same thing for the other variable, the partial derivative of z with respect to y is just: x.

Now that we have partial derivatives for each variable, we put them into a vector. This vector is called the gradient, and has some intimidating notation that looks like this:

$\nabla z = \nabla f(x,y) = (\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y})$

For this function, the gradient is:

$\nabla z = \nabla f(x,y) = (y,x)$

That makes the gradient at our specific point:

$\nabla z = \nabla f(1,1) = (1,1)$

In the last section we saw that the derivative / slope pointed to where the function got larger. The same thing is true of gradients, they point in the direction where the function gets larger too!

So, if we want to go downhill, we need to subtract values from our x and our y to go there. In fact, we know that the steepest way down from our current point is when we subtract the same value from both x and y. This is because the gradient doesn’t just point to where it gets larger, it points to where it gets larger the FASTEST. So, the reverse of the gradient also points to where it gets smaller the fastest.

Pretty cool huh?

You can confirm this visually by looking at the graph of the function.

One last things about slopes, derivatives and gradients before moving on. While they do point in the direction of greatest increase, they are only valid for an infinitely small point on the graph for functions that are non linear. This will be important later when we move in the opposite direction of the gradients, but do so with very small steps to help make sure we find the lowest points on the graph.

Why Gradient Descent?

Why do we want to use gradient descent? Imagine that we have a function:

$w=f(x,y,z)$

Sure, we can pick some random starting values for x,y and z, and then use gradient descent to find the smallest w, but who cares?

Let’s give some other names to these variables and see if the value becomes a little more apparent:

$DamageTakenMultiplier = CalculateDamageTakenMultiplier(Armor, Dodge, Resist)$

Now, by only changing the names of the variables, we can see that we could use gradient descent to find what amount of Armor, Dodge and Resist would make it so our character takes the least amount of damage. This can now tell you how to distribute stat points to a character to get the best results 😛

Note that if you are ever trying to find the highest number possible, instead of the lowest, you can just multiply your function by -1 and do everything else the same way. You could also do gradient ASCENT, but it’s equivalent to multiplying by -1 and doing gradient descent.

Problems

Here are a few common problems you can encounter when doing gradient descent.

Local minima – when you get to the bottom of a bowl, but it isn’t the deepest bowl.
Flat derivatives – these make it hard to escape a local area because the derivatives are very small, which will make each movement also very small.
Discontinuities – The problem space (graph) changes abruptly without warning, making gradient descent do the wrong thing

Here’s an example of a local minima versus a global minima. You can see that depending on where you start on this graph, you might end up in the deeper bowl, or the shallower bowl if your only rule is “move downhill”.

(Image from wikipedia By KSmrq – http://commons.wikimedia.org/wiki/File:Extrema_example.svg, GFDL 1.2, https://commons.wikimedia.org/w/index.php?curid=6870865)

Here’s an example of a flat derivative. You can imagine that if you were at x=1, that you could see that the derivative would tell you to go to the left to decrease the y value, but it’s a very, very small number. This is a problem because it’s common to multiply the derivative or gradient by a multiplier before subtracting it, so you’d only take a very small step towards the goal.

It’s also possible to hit a perfectly flat derivative, which will be exactly 0. In this case, no matter how big or small of a number you multiply the derivative by, you won’t move AT ALL!

Below is a discontinuous function where if x is less than 0.5, the value is 1, otherwise the value is x. This essentially shows you what happens when you use if statements in differentiable programming. If you start on the right side, it’s going to correctly tell you that you should move left to improve your score. However, it’ll keep telling you to move left, until you get to x being less than 0.5, at which point your score will suddenly get a lot worse and your derivative will become 0. You will now be stuck!

There are ways to deal with these problems, but they are deep topics. If nothing else, you should know these problems exist, so you can know when they are affecting you, and/or why you should avoid them if you have a choice.

What If I Want to Avoid Calculus?

Let’s say that you don’t get a kick out of calculating all these partial derivatives. Or, more pragmatically, you don’t want to sit down and manually calculate the gradient function of some generic C++ code!

I have some great news for you.

While we do need partial derivatives for our gradients, we aren’t going to have to do all this calculus to get them!

Here are a few other ways to get partial derivatives:

Finite Differences – Conceptually super simple, but slow to calculate and not always very precise. More info: Finite Differences
Backpropagation – What neural networks use. Also called backwards mode automatic differentiation. Fast but a bit complex mentally. I linked this already but for more info: How to Train Neural Networks With Backpropagation
Dual Numbers – Also called forward mode automatic differentiation. Not as fast as backwards mode, but in the same neighborhood for speed. Super, super convinient and awesome for programmers. I love these. More info: Dual Numbers & Automatic Differentiation

Care to guess which one we are going to use? Yep, Dual Numbers!

In a nutshell, if you have code that uses floats, you can change it to use a templated type instead. Then, you put dual numbers through your code instead of floats. The output you get will be the specific value output from your code, but also the GRADIENT of your code at that value. Even better, this isn’t a numerical method (it’s not an approximation), it’s analytical (it’s exact).

That is seriously all there is to it. Dual numbers are amazing!

Since you made the code templated, you can still use it for floats when you don’t want or need the gradient.

Differentiable Programming / Gradient Descent Skeleton

Here’s the general skeleton we are going to be following for using gradient descent with our differentiable program.

Initialize the parameters to random (but valid) values, storing them in dual numbers.
Run the code that does our work, taking dual numbers as input for the parameters of how it does the work.
Put the result (which is dual numbers) into a scoring function to give us a score. Usually the score is such that smaller numbers are better. If not, just multiply the score by -1 so it is.
Since we did the work and calculated the score using dual numbers, we now have a gradient which describes how we need to adjust the parameters to make our score better.
Adjust our parameters using the gradient and go back to step 2. Repeating until whatever exit condition we want is hit: maybe when a certain number of iterations happen, or maybe when our score gets below a certain value.

That’s our game plan. Let’s dive into the specific problem we are going to be attacking.

Searching For an Ideal Dithering Pattern

Here is the problem we want to tackle:

We want to find a 3×3 dithering pattern such that when we use it to dither an image (by repeating the 3×3 pattern over and over across the image), and then blur the result by a specific amount, that it’s as close as possible to the original image being blurred by that same amount.

That sounds a bit challenging right? It’s not actually that bad, don’t worry (:

The steps the code has to do (differentiably) are:

Dither the source image
Blur the results
Blur the source image
Calculate a score for how similar they are
Use all this with Gradient Descent to optimize the dither pattern

Once again, we need to do this stuff differentiably, using dual numbers, so that we get a gradient for how to modify the dither pattern to better our score.

Step 1 – Dither Source Image

Dithering an image is a pretty simple process.

We are going to be dithering it such that we take a greyscale image as input and convert it to a black and white image using the dither pattern.

(If you are starting with a color image, this shows how to convert it to greyscale: Converting RGB to Grayscale)

For every pixel (x,y) in the source image, you look at pixel (x%3, y%3) in the dither pattern, and if the dither pattern pixel is less than the source, you write a black pixel out, else you write a white pixel out.

if (sourcePixel(x,y) < ditherPixel(x%3, y%3))
    pixelOut(x,y) = 0.0;
else
    pixelOut(x,y) = 1.0;

There’s a problem though… this is a branch, which makes a discontinuity, which will make it so we can’t have good derivatives to help us get to the goal.

Another way to write the dithering operation above is to write it like this:

difference = ditherPixel(x%3, y%3) - sourcePixel(x,y);
pixelOut(x,y) = step(difference);

Where “step” is the “heaviside step function”, which is 1 if x >= 0, otherwise is 0.

(Image from Wikipedia By Omegatron (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)%5D, via Wikimedia Commons)

That got rid of the branch (if statement), but we still have a discontinuous function.

Luckily we can approximate a step function with other functions. I decided to use the formula $0.5+atan(100*x)/pi$ which looks like this:

Unfortunately, I found that my results weren’t that good, so i switched it to $0.5+atan(10000*x)/pi$ which ended up working better for me:

This function does have the problem of having flat derivatives, but I found that it worked pretty well anyways. The flat derivatives don’t seem to be a big problem in this case luckily.

To put it all together, the differentiable version of dithering a pixel that I use looks like this:

difference = ditherPixel(x%3, y%3) - sourcePixel(x,y);
pixelOut(x,y) = 0.5+atan(10000.0f * difference) / pi;

As input to this dithering process, we take:

The source image
a 3×3 dither pattern, where each pixel is a dual number

As output this dithering process gives us:

A dithered image that is converted to black and white (either a 1.0 or 0.0 value per pixel)
It’s the same size as the source image
Each pixel is a dual number with 9 derivatives. There is one derivative per dither pixel.

Step 2 – Blur the Results

Blurring the results of the dither wasn’t that difficult. I used a Gaussian blur, but other blurs could be used easily.

I had some Gaussian blur code laying around (from this blog post: Gaussian Blur) and I converted it to using a templated type instead of floats/pixels where appropriate, also making sure there were no branches or anything discontinuous.

It turned out there wasn’t a whole lot to fix up here luckily so wasn’t too difficult.

This allowed me to take the dithered results which are a dual number per pixel, and do a Gaussian blur on them, preserving and correctly modifying the gradient (derivatives) as it did the Blur.

Step 3 – Blur the Source Image

Blurring the source image was easy since the last step made a generic gaussian blur function. I used the generic Gaussian blur function to blur the image. This doesn’t need to be done as dual numbers, so it was regular pixels in and regular pixels out.

You might wonder why this part doesn’t need to be done as dual numbers.

The simple answer is because these values are in no way dependant on the dither pattern (which are what we are tracking with the derivatives).

The more mathematical explanation is that you could in fact consider these dual numbers, which just have a gradient of zero because they are essentially constants that have nothing to do (yet) with the parameters of the function. The gradient would just implicitly be zero, like any other constant value you might introduce to the function.

Step 4 – Calculating a Similarity Score

Next up I needed to calculate a similarity score between the dithered then blurred results (which is made up of dual numbers), and the source image which was blurred (and is made up of regular pixels).

The similarity score I went with is just MSE or “Mean Squared Error”.

To calculate MSE, for every pixel you do this:

error = ditheredBlurredImage(x,y) - blurredImage(x,y);
errorSquared = error * error;

After you have the squared error for every pixel, you just take the average of them to get the MSE.

An interesting thing about MSE is that because errors are squared, it will favor smaller errors much more than larger errors, which is a nice property.

A not so nice property about MSE is that it might decide something is a small difference mathematically even though a human would decide that it was a huge difference perceptually. The reverse is also true. Despite this, I chose it because it is simple and I ended up getting decent results with it.

If you want to go down the rabbit hole of looking at “perceptual similarity scores of images” check out these links:

SSIM – http://www.cns.nyu.edu/~lcv/ssim/
GMSD – https://arxiv.org/ftp/arxiv/papers/1308/1308.3052.pdf
Multiple Scale SSIM -http://www.cns.nyu.edu/~zwang/files/papers/msssim.pdf

After this step, we have an MSE value which says how similar the images are. A lower value means lower average squared error, so lower numbers are indeed better.

What else is nice is that the MSE value is a dual number with a gradient that has the 9 partial derivatives that describe how much the MSE changes as you adjust each parameter.

That gradient tells us how to adjust the parameters (the 3×3 dither pixels!) to lower the MSE!

Step 5 – Putting it All Together

Now it’s time to put all of this together and use gradient descent to make our dither pattern better.

Here’s how the program operates:

Initialize the 3×3 dither pattern to random values, setting the derivatives to 1.0 in the gradient, for the variable that they represent.
do 1000 iterations of this loop:

Dither and blur the source image
Calculate MSE of this result compared to the source image blurred
Using the gradient from the MSE value, subtract the respective partial derivative from each of the pixels in the dither pattern, but scaling the partial derivative by a “learning rate”.

Output the best result found

The learning rate starts at 3.0 at loop iteration 0, but decays with each iteration, down to 0.1 at iteration 999. It starts above 1 to help escape local minima, and uses a very small rate at the end to try and get deeper into whatever minimum it has found.

After adjusting the dither pattern pixels, I clamp them to be between 0 and 1.

Something else I ought to mention is that while I’m doing the gradient descent, I keep track of the best scoring dither pattern seen.

This way, after the 1000 iterations are up, if we ever saw anything better than where we are at currently, we just use that instead of the final result.

Presumably, if you tune your parameters (learning rate, iterations, etc!) correctly, this won’t come up often, but it’s always a possibility that your final state is not the best state encountered, so this is a nice way to get better results more often.

Results

Did you notice that I called this post “searching for an ideal dither pattern” instead of “finding an ideal dither pattern”? (:

The results are decent, but I know they could be better. Even so, I think the techniques talked about here are a good start going down the path of differentiable programming, and similar topics.

Here are some results I was able to get with the code. Click to see the full size images. The shrunken down images have aliasing issues.

The images left to right are: The original, the dither pattern used (repeated), the dithered image, the blurred dither image, and lastly the blurred original image. The program aims to make the last two images look as close as possible as it can, using MSE as the metric for how close they are.

Here is the starting state of using a Gaussian blur with a sigma of 10:

Here it is after the 1000 iterations of gradient descent. Notice the black blob at the top is gone compared to where it started.

Here’s the starting state when using a Gaussian blur sigma of 1:

And here it is after 1000 iterations, which is pretty decent results:

Lastly, here it is with no blurring whatsoever:

And after 1000 iterations, I think it actually looks worse!

Using no blur at all makes for some really awful results. The blur gives the algorithm more freedom on how it can succeed, whereas with no blur, there is a lot less wiggle room for finding a solution.

Another benefit of using the blur before MSE calculation is that a blur is a low pass filter. That means that higher frequencies are gone before the MSE calculation. The result of this is that the algorithm will favor results which are closer to blue noise dithering. Pretty neat right?!

Closing

I hope you enjoyed this journey through differentiable programming and gradient descent, and I hope you were able to follow along.

Here are some potentially interesting things to do beyond what we talked about here:

Have it learn from a set of images, instead of only this single image. This should help prevent “over fitting” and let it find a dither pattern which works well for all images instead of just this one specific image.
Use a separate set of images to gauge the accuracy of the result that weren’t used as part of the training, to help prove that it really hasn’t overfit the training data.
Try applying “small corruption” in the learning to help prevent overfitting or getting stuck in local minima – one idea would be to have some percentage chance per derivative that you don’t apply the change to the dither pattern pixel. This would add some randomness to the gradient descent instead of it only being down the steepest direction all of the time.
Instead of optimizing the dithering patterns, you could make a formula that generated the dithering patterns, and instead optimize the coefficients / terms of that formula. If you get good results, you’ll end up with a formula you can use for dithering instead of a pattern, which might be nice for the case of avoiding a texture read in a pixel shader to do the dithering.

I’m not a data scientist or machine learning expert by any means, so there are plenty of improvements to be made. There is a lot of overlap with what is being done here and other algorithms – both in the machine learning realm and outside of the machine learning realm.

For one, you can use Newton’s method for gradient descent. It can find minima faster by using the second derivative in the calculations as well.

However, this algorithm is almost purely “exploitative” in that wherever you start with your parameters, it will try to go from there to the deepest point in whatever valley it’s already in. Some other types of algorithms differ from this in that they are more “explorative” and try to find other valleys, but aren’t always as good at finding the deepest part of the valleys that they do find. More explorative algorithms include simulated annealing, differential evolution, and genetic algorithms.

If you enjoyed this post, check out this book for deeper details on algorithms relating to gradient descent (simulated annealing, genetic algorithms, etc!). It’s a very good book and very easy to read!
Essentials of Metaheuristics

Any corrections to what i’ve said, the code, or suggestions for improvements, please let me know by leaving a comment here, or hitting me up on twitter: https://twitter.com/Atrix256

Counting Bits & The Normal Distribution

I recently saw some interesting posts on twitter about the normal distribution:

I’m not really a statistics kind of guy, but knowing that probability distributions come up in graphics (Like in PBR & Path Tracing), it seemed like a good time to upgrade knowledge in this area while sharing an interesting technique for generating normal distribution random numbers.

Basics

Below is an image showing a few normal (aka Gaussian) distributions (from wikipedia).

Normal distributions are defined by these parameters:

$\mu$ – “mu” is the mean. This is the average value of the distribution. This is where the center (peak) of the curve is on the x axis.
$\sigma^2$ – “sigma squared” is the variance, and is just the standard deviation squared. I find standard deviation more intuitive to think about.
$\sigma$ – “sigma” is the standard deviation, which (surprise surprise!) is the square root of the variance. This controls the “width” of the graph. The area under the cover is 1.0, so as you increase standard deviation and make the graph wider, it also gets shorter.

Here’s a diagram of standard deviations to help understand them (also from wikipedia):

I find the standard deviation intuitive because 68.2% of the data is within one standard deviation from the mean (on the plus and minus side of the mean). 95.4% of the data is within two standard deviations of the mean.

Standard deviation is given in the same units as the data itself, so if a bell curve described scores on a test, with a mean of 80 and a standard deviation of 5, it means that 68.2% of the students got between 75 and 85 points on the test, and that 95.4% of the students got between 70 and 90 points on the test.

The normal distribution is what’s called a “probability density function” or pdf, which means that the y axis of the graph describes the likelyhood of the number on the x axis being chosen at random.

This means that if you have a normal distribution that has a specific mean and variance (standard deviation), that numbers closer to the mean are more likely to be chosen randomly, while numbers farther away are less likely. The variance controls how the probability drops off as you get farther away from the mean.

Thinking about standard deviation again, 68.2% of the random numbers generated will be within 1 standard deviation of the mean (+1 std dev or -1 std dev). 95.4% will be within 2 standard deviations.

Generating Normal Distribution Random Numbers – Coin Flips

Generating uniform random numbers, where every number is as likely as every other number, is pretty simple. In the physical world, you can roll some dice or flip some coins. In the software world, you can use PRNGs.

How would you generate random numbers that follow a normal distribution though?

In C++, there is std::normal_distribution that can do this for you. There is also something called the Box-Muller transform that can turn uniformly distributed random numbers into normal distribution random numbers (info here: Generating Gaussian Random Numbers).

I want to talk about something else though and hopefully build some better intuition.

First let’s look at coin flips.

If you flip a fair coin a million times and keep a count of how many heads and tails you saw, you might get 500014 heads and 499986 tails (I got this with a PRNG – std::mt19937). That is a pretty uniform distribution of values in the range of [0,1]. (breadcrumb: pascal’s triangle row 2 is 1,1)

Let’s flip two coins at a time though and add our values together (say that heads is 0 and tails is 1). Here’s what that graph looks like:

Out of 1 million flips, 250639 had no tails, 500308 had one tail, and 249053 had two tails. It might seem weird that they aren’t all even, but it makes more sense when you look at the outcome of flipping two coins: we can get heads/heads (00), heads/tails (01), tails/heads (10) or tails/tails (11). Two of the four possibilities have a single tails, so it makes sense that flipping two coins and getting one coin being a tail would be twice as likely as getting no tails or two tails. (breadcrumb: pascal’s triangle row 3 is 1,2,1)

What happens when we sum 3 coins? With a million flips I got 125113 0’s, 375763 1’s, 373905 2’s and 125219 3’s.

If you work out the possible combinations, there is 1 way to get 0, 3 ways to get 1, 3 ways to get 2 and 1 way to get 3. Those numbers almost exactly follow that 1, 3, 3, 1 probability. (breadcrumb: pascal’s triangle row 4 is 1,3,3,1)

If we flip 100 coins and sum them, we get this:

That looks a bit like the normal distribution graphs at the beginning of this post doesn’t it?

Flipping and summing coins will get you something called the “Binomial Distribution”, and the interesting thing there is that the binomial distribution approaches the normal distribution the more coins you are summing together. At an infinite number of coins, it is the normal distribution.

Generating Normal Distribution Random Numbers – Dice Rolls

What if instead of flipping coins, we roll dice?

Well, rolling a 4 sided die a million times, you get each number roughly the same percentage of the time as you’d expect; roughly 25% each. 250125 0’s, 250103 1’s, 249700 2’s, 250072 3’s.

If we sum two 4 sided dice rolls we get this:

If we sum three 4 sided dice rolls we get this:

And if we sum one hundred we get this, which sure looks like a normal distribution:

This isn’t limited to four sided dice though, here’s one hundred 6 sided dice being summed:

With dice, instead of being a “binomial distribution”, it’s called a “multinomial distribution”, but as the number of dice goes to infinity, it also approaches the normal distribution.

This means you can get a normal distribution with not only coins, but any sided dice in general.

An even stronger statement than that is the Central Limit Theorem which says that if you have random numbers from ANY distribution, if you add enough of em together, you’ll often approach a normal distribution.

Strange huh?

Generating Normal Distribution Random Numbers – Counting Bits

Now comes a fun way of generating random numbers which follow a normal distribution. Are you ready for it?

Simply generate an N bit random number and return how many 1 bits are set.

That gives you a random number that follows a normal distribution!

One problem with this is that you have very low “resolution” random numbers. Counting the bits of a 64 bit random number for instance, you can only return 0 through 64 so there are only 65 possible random numbers.

That is a pretty big limitation, but if you need normal distribution numbers calculated quickly and don’t mind if they are low resolution (like in a pixel shader?), this technique could work well for you.

Another problem though is that you don’t have control over the variance or the mean of the distribution.

That isn’t a super huge deal though because you can easily convert numbers from one normal distribution into another normal distribution.

To do so, you get your normal distribution random number. First you subtract the mean of the distribution to make it centered on 0 (have a mean of 0). You then divide it by the standard deviation to make it be part of a distribution which has a standard deviation of 1.

At this point you have a random number from a normal distribution which has a mean of 0 and a standard deviation of 1.

Next, you multiply the number by the standard deviation of the distribution you want, and lastly you add the mean of the distribution you want.

That’s pretty simple (and is implemented in the source code at the bottom of this post), but to do this you need to know what standard deviation (variance) and mean you are starting with.

If you have some way to generate random numbers in [0, N) and you are summing M of those numbers together, the mean is $M*(N-1)/2$ . Note that if you instead are generating random numbers in [1,N], the mean instead is $M*(N+1)/2$ .

The variance in either case is $M*(N^2-1)/12$ . The standard deviation is the square root of that.

Using that information you have everything you need to generate normal distribution random numbers of a specified mean and variance.

Thanks to @fahickman for the help on calculating mean and variance of dice roll sums.

Code

Here is the source code I used to generate the data which was used to generate the graphs in this post. There is also an implementation of the bit counting algorithm i mentioned, which converts to the desired mean and variance.

#define _CRT_SECURE_NO_WARNINGS

#include <array>
#include <random>
#include <stdint.h>
#include <stdio.h>
#include <limits>

const size_t c_maxNumSamples = 1000000;
const char* c_fileName = "results.csv";

template <size_t DiceRange, size_t DiceCount, size_t NumBuckets>
void DumpBucketCountsAddRandomNumbers (size_t numSamples, const std::array<size_t, NumBuckets>& bucketCounts)
{
    // open file for append if we can
    FILE* file = fopen(c_fileName, "a+t");
    if (!file)
        return;

    // write the info
    float mean = float(DiceCount) * float(DiceRange - 1.0f) / 2.0f;
    float variance = float(DiceCount) * (DiceRange * DiceRange) / 12.0f;
    if (numSamples == 1)
    {
        fprintf(file, "\"%zu random numbers [0,%zu) added together (sum %zud%zu). %zu buckets.  Mean = %0.2f.  Variance = %0.2f.  StdDev = %0.2f.\"\n", DiceCount, DiceRange, DiceCount, DiceRange, NumBuckets, mean, variance, std::sqrt(variance));
        fprintf(file, "\"\"");
        for (size_t i = 0; i < NumBuckets; ++i)
            fprintf(file, ",\"%zu\"", i);
        fprintf(file, "\n");
    }
    fprintf(file, "\"%zu samples\",", numSamples);

    // report the samples
    for (size_t count : bucketCounts)
        fprintf(file, "\"%zu\",", count);

    fprintf(file, "\"\"\n");
    if (numSamples == c_maxNumSamples)
        fprintf(file, "\n");

    // close file
    fclose(file);
}

template <size_t DiceSides, size_t DiceCount>
void AddRandomNumbersTest ()
{
    std::mt19937 rng;
    rng.seed(std::random_device()());
    std::uniform_int_distribution<size_t> dist(size_t(0), DiceSides - 1);

    std::array<size_t, (DiceSides - 1) * DiceCount + 1> bucketCounts = { 0 };

    size_t nextDump = 1;
    for (size_t i = 0; i < c_maxNumSamples; ++i)
    {
        size_t sum = 0;
        for (size_t j = 0; j < DiceCount; ++j)
            sum += dist(rng);

        bucketCounts[sum]++;

        if (i + 1 == nextDump)
        {
            DumpBucketCountsAddRandomNumbers<DiceSides, DiceCount>(nextDump, bucketCounts);
            nextDump *= 10;
        }
    }
}

template <size_t NumBuckets>
void DumpBucketCountsCountBits (size_t numSamples, const std::array<size_t, NumBuckets>& bucketCounts)
{
    // open file for append if we can
    FILE* file = fopen(c_fileName, "a+t");
    if (!file)
        return;

    // write the info
    float mean = float(NumBuckets-1) * 1.0f / 2.0f;
    float variance = float(NumBuckets-1) * 3.0f / 12.0f;
    if (numSamples == 1)
    {
        fprintf(file, "\"%zu random bits (coin flips) added together. %zu buckets.  Mean = %0.2f.  Variance = %0.2f.  StdDev = %0.2f.\"\n", NumBuckets - 1, NumBuckets, mean, variance, std::sqrt(variance));
        fprintf(file, "\"\"");
        for (size_t i = 0; i < NumBuckets; ++i)
            fprintf(file, ",\"%zu\"", i);
        fprintf(file, "\n");
    }
    fprintf(file, "\"%zu samples\",", numSamples);

    // report the samples
    for (size_t count : bucketCounts)
        fprintf(file, "\"%zu\",", count);

    fprintf(file, "\"\"\n");
    if (numSamples == c_maxNumSamples)
        fprintf(file, "\n");

    // close file
    fclose(file);
}

template <size_t NumBits> // aka NumCoinFlips!
void CountBitsTest ()
{

    size_t maxValue = 0;
    for (size_t i = 0; i < NumBits; ++i)
        maxValue = (maxValue << 1) | 1;

    std::mt19937 rng;
    rng.seed(std::random_device()());
    std::uniform_int_distribution<size_t> dist(0, maxValue);

    std::array<size_t, NumBits + 1> bucketCounts = { 0 };

    size_t nextDump = 1;
    for (size_t i = 0; i < c_maxNumSamples; ++i)
    {
        size_t sum = 0;
        size_t number = dist(rng);
        while (number)
        {
            if (number & 1)
                ++sum;
            number = number >> 1;
        }

        bucketCounts[sum]++;

        if (i + 1 == nextDump)
        {
            DumpBucketCountsCountBits(nextDump, bucketCounts);
            nextDump *= 10;
        }
    }
}

float GenerateNormalRandomNumber (float mean, float variance)
{
    static std::mt19937 rng;
    static std::uniform_int_distribution<uint64_t> dist(0, (uint64_t)-1);

    static bool seeded = false;
    if (!seeded)
    {
        seeded = true;
        rng.seed(std::random_device()());
    }

    // generate our normal distributed random number from 0 to 65.
    // 
    float sum = 0.0f;
    uint64_t number = dist(rng);
    while (number)
    {
        if (number & 1)
            sum += 1.0f;
        number = number >> 1;
    }

    // convert from: mean 32, variance 16, stddev 4
    // to: mean 0, variance 1, stddev 1
    float ret = sum;
    ret -= 32.0f;
    ret /= 4.0f;

    // convert to the specified mean and variance
    ret *= std::sqrt(variance);
    ret += mean;
    return ret;
}

void VerifyGenerateNormalRandomNumber (float mean, float variance)
{
    // open file for append if we can
    FILE* file = fopen(c_fileName, "a+t");
    if (!file)
        return;

    // write info
    fprintf(file, "\"Normal Distributed Random Numbers. mean = %0.2f.  variance = %0.2f.  stddev = %0.2f\"\n", mean, variance, std::sqrt(variance));

    // write some random numbers
    fprintf(file, "\"100 numbers\"");
    for (size_t i = 0; i < 100; ++i)
        fprintf(file, ",\"%f\"", GenerateNormalRandomNumber(mean, variance));
    fprintf(file, "\n\n");

    // close file
    fclose(file);
}

int main (int argc, char **argv)
{
    // clear out the file
    FILE* file = fopen(c_fileName, "w+t");
    if (file)
        fclose(file);

    // coin flips
    {
        // flip a fair coin 
        AddRandomNumbersTest<2, 1>();

        // flip two coins and sum them
        AddRandomNumbersTest<2, 2>();

        // sum 3 coin flips
        AddRandomNumbersTest<2, 3>();

        // sum 100 coin flips
        AddRandomNumbersTest<2, 100>();
    }

    // dice rolls
    {
        // roll a 4 sided die
        AddRandomNumbersTest<4, 1>();

        // sum two 4 sided dice
        AddRandomNumbersTest<4, 2>();

        // sum three 4 sided dice
        AddRandomNumbersTest<4, 3>();

        // sum one hundred 4 sided dice
        AddRandomNumbersTest<4, 100>();

        // sum one hundred 6 sided dice
        AddRandomNumbersTest<6, 100>();
    }

    CountBitsTest<8>();
    CountBitsTest<16>();
    CountBitsTest<32>();
    CountBitsTest<64>();

    VerifyGenerateNormalRandomNumber(0.0f, 20.0f);

    VerifyGenerateNormalRandomNumber(0.0f, 10.0f);

    VerifyGenerateNormalRandomNumber(5.0f, 10.0f);

    return 0;
}

SIMD / GPU Friendly Branchless Binary Search

The other day I was thinking about how you might do a binary search branchlessly. I came up with a way, and I’m pretty sure I’m not the first to come up with it, but it was fun to think about and I wanted to share my solution.

All code in this post is public domain.

Here it is searching a list of 8 items in 3 steps:

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
    size_t ret = (haystack[4] <= needle) ? 4 : 0;
    ret += (haystack[ret + 2] <= needle) ? 2 : 0;
    ret += (haystack[ret + 1] <= needle) ? 1 : 0;
    return ret;
}

The three steps it does are:

The list has 8 items in it. We test index 4 to see if we need to continue searching index 0 to 3, or index 4 to 7. The returned index becomes either 0xx or 1xx.
The list has 4 items in it now. We test index 2 to see if we need to continue searching index 0 to 1, or index 2 to 3. The returned index becomes one of the following: 00x, 01x, 10x or 11x.
The list has 2 items in it. We test index 1 to see if we need to take the left item or the right item. The returned index becomes: 000, 001, 010, 011, 100, 101, 110, or 111.

But Big O Complexity is Worse!

Usually a binary search can take up to $O(log N)$ steps, where $N$ is the number of items in the list. In this post’s solution, it always takes $log_2N$ steps.

It probably seems odd that this branchless version could be considered an improvement when it has big O complexity that is always the worst case of a regular binary search. That is strange, but in the world of SIMD and shader programs, going branchless can be a big win that is not captured by looking at big O complexity. (Note that cache coherancy and thread contention are two other things not captured by looking at big O complexity).

Also, when working in something like video games or other interactive simulations, an even frame rate is more important than a high frame rate for making a game look and feel smooth. Because of this, if you have algorithms that have very fast common cases but much slower worst cases, you may actually prefer to use an algorithm that is slower in the common case but faster in the worst case just to keep performance more consistent. Using an algorithm such as this, which has a constant amount of work regardless of input can be a good trade off there.

Lastly, in cryptographic applications, attackers can gather secret information by seeing how long certain operations take. For instance, if you use a shorter password than other people, an attacker may be able to detect that by seeing that it consistently takes you a little bit less time to login than other people. They now have an idea of the length of your password, and maybe will brute force you, knowing that you are low hanging fruit!

These timing based attacks can be thwarted by algorithms which run at a constant time regardless of input. This algorithm is one of those algorithms.

As an example of another algorithm that runs in constant time regardless of input, check out CORDIC math. I really want to write up a post on that someday, it’s pretty cool stuff.

Traditional Binary Searching

You might have noticed that if the item you are searching for isn’t in the list, the function doesn’t return anything indicating that, and you might think that’s strange.

This function actually just returns the largest index that isn’t greater than the value you are searching for. If all the numbers are greater than the value you are searching for, it returns zero.

This might seem odd but this can actually come in handy if the list you are searching represents something like animation data, where there are keyframes sorted by time, and you want to find which two keyframes you are between so that you can interpolate.

To actually test if your value was in the list, you could do an extra check:

    size_t searchValue = 3;
    size_t index = BinarySearch8(searchValue, list);
    bool found = (list[index] == searchValue);

If you need that extra check, it’s easy enough to add, and if you don’t need that extra check, it’s nice to not have it.

Without Ternary Operator

If in your setup you don’t have a ternary operator, or if the ternary operator isn’t branchless for you, you get the same results using multiplication:

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
    size_t ret = (haystack[4] <= needle) * 4;
    ret += (haystack[ret + 2] <= needle) * 2;
    ret += (haystack[ret + 1] <= needle) * 1;
    return ret;
}

Note that on some platforms, the less than or equal test will be a branch! None of the platforms or compilers I tested had that issue but if you find yourself hitting that issue, you can do a branchless test via subtraction or similar.

Here is a godbolt link that lets you view the assembly for various compilers. When you open the link you’ll see clang doing this work branchlessly.
View Assembly

@adamjmiles from twitter also verified that GCN does it branchlessly, which you can see at the link below. Thanks for that!
View GCN Assembly

Something to keep in mind for the non GPU case though is that if you were doing this in SIMD, you’d be using SIMD intrinsics.

Larger Lists

It’s trivial to search larger numbers of values. Here it is searching 16 items in 4 steps:

size_t BinarySearch16 (size_t needle, const size_t haystack[16])
{
    size_t ret = (haystack[8] <= needle) ? 8 : 0;
    ret += (haystack[ret + 4] <= needle) ? 4 : 0;
    ret += (haystack[ret + 2] <= needle) ? 2 : 0;
    ret += (haystack[ret + 1] <= needle) ? 1 : 0;
    return ret;
}

And here it is searching 32 items in 5 steps:

size_t BinarySearch32 (size_t needle, const size_t haystack[32])
{
    size_t ret = (haystack[16] <= needle) ? 16 : 0;
    ret += (haystack[ret + 8] <= needle) ? 8 : 0;
    ret += (haystack[ret + 4] <= needle) ? 4 : 0;
    ret += (haystack[ret + 2] <= needle) ? 2 : 0;
    ret += (haystack[ret + 1] <= needle) ? 1 : 0;
    return ret;
}

Non Power of 2 Lists

Let’s say that your list is not a perfect power of two in length. GASP!

You can still use the technique, but you treat it as if it has the next power of 2 up items, and then make sure your indices stay in range. The nice part here is that you don’t have to do extra work on the index at each step of the way, only in the places where it’s possible for the index to go out of range.

Here it is searching an array of size 7 in 3 steps:

size_t BinarySearch7 (size_t needle, const size_t haystack[7])
{
size_t ret = 0;
size_t testIndex = 0;

// test index is at most 4, so is within range.
testIndex = ret + 4;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 6, so is within range.
testIndex = ret + 2;
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

// test index is at most 7, so could be out of range.
// use min() to make sure the index stays in range.
testIndex = std::min<size_t>(ret + 1, 6);
ret = (haystack[testIndex] <= needle) ? testIndex : ret;

return ret;

}

There are some other techniques for dealing with non power of 2 sized lists that you can find in the links at the bottom, but there was one particularly interesting that my friend and ex boss James came up with.

Basically, you start out with something like this if you were searching a list of 7 items:

    // 7 because the list has 7 items in it.
    // 4 because it's half of the next power of 2 that is <= 7.
    ret = (haystack[4] <= needle) * (7-4);

The result is that instead of having ret go to either 0 or 4, it goes to 0 or 3.

From there, in both cases you have 4 items in your sublist remaining, so you don’t need to worry about the index going out of bounds from that point on.

Code

Here’s some working code demonstrating the ideas above, as well as it’s output.

#include <algorithm>
#include <stdlib.h>

size_t BinarySearch8 (size_t needle, const size_t haystack[8])
{
    // using ternary operator
    size_t ret = (haystack[4] <= needle) ? 4 : 0;
    ret += (haystack[ret + 2] <= needle) ? 2 : 0;
    ret += (haystack[ret + 1] <= needle) ? 1 : 0;
    return ret;
}

size_t BinarySearch8b (size_t needle, const size_t haystack[8])
{
    // using multiplication
    size_t ret = (haystack[4] <= needle) * 4;
    ret += (haystack[ret + 2] <= needle) * 2;
    ret += (haystack[ret + 1] <= needle) * 1;
    return ret;
}

size_t BinarySearch7 (size_t needle, const size_t haystack[7])
{
    // non perfect power of 2.  use min() to keep it from going out of bounds.
    size_t ret = 0;
    size_t testIndex = 0;

    // test index is 4, so is within range.
    testIndex = ret + 4;
    ret = (haystack[testIndex] <= needle) ? testIndex : ret;

    // test index is at most 6, so is within range.
    testIndex = ret + 2;
    ret = (haystack[testIndex] <= needle) ? testIndex : ret;

    // test index is at most 7, so could be out of range.
    // use min() to make sure the index stays in range.
    testIndex = std::min<size_t>(ret + 1, 6);
    ret = (haystack[testIndex] <= needle) ? testIndex : ret;

    return ret;
}

int main (int argc, char **argv)
{
    // search a list of size 8
    {
        // show the data
        printf("Seaching through a list with 8 items:\n");
        size_t data[8] = { 1, 3, 5, 6, 9, 11, 15, 21 };
        printf("data = [");
        for (size_t i = 0; i < sizeof(data)/sizeof(data[0]); ++i)
        {
            if (i > 0)
                printf(", ");
            printf("%zu", data[i]);
        }
        printf("]\n");

        // do some searches on it using ternary operation based function
        printf("\nTernary based searches:\n");
        #define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch8(needle, data), data[BinarySearch8(needle, data)], data[BinarySearch8(needle, data)] == needle ? "true" : "false");
        FIND(2);
        FIND(3);
        FIND(0);
        FIND(22);
        FIND(16);
        FIND(15);
        FIND(21);
        #undef FIND

        // do some searches on it using multiplication based function
        printf("\nMultiplication based searches:\n");
        #define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch8b(needle, data), data[BinarySearch8b(needle, data)], data[BinarySearch8b(needle, data)] == needle ? "true" : "false");
        FIND(2);
        FIND(3);
        FIND(0);
        FIND(22);
        FIND(16);
        FIND(15);
        FIND(21);
        #undef FIND

        printf("\n\n\n\n");
    }

    // search a list of size 7
    {
        // show the data
        printf("Seaching through a list with 7 items:\n");
        size_t data[7] = { 1, 3, 5, 6, 9, 11, 15};
        printf("data = [");
        for (size_t i = 0; i < sizeof(data)/sizeof(data[0]); ++i)
        {
            if (i > 0)
                printf(", ");
            printf("%zu", data[i]);
        }
        printf("]\n");

        // do some searches on it using ternary operation based function
        printf("\nTernary based searches:\n");
        #define FIND(needle) printf("Find " #needle ": index = %zu, value = %zu, found = %s\n", BinarySearch7(needle, data), data[BinarySearch7(needle, data)], data[BinarySearch7(needle, data)] == needle ? "true" : "false");
        FIND(2);
        FIND(3);
        FIND(0);
        FIND(22);
        FIND(16);
        FIND(15);
        FIND(21);
        #undef FIND

        printf("\n\n\n\n");
    }

    system("pause");
    return 0;
}

Closing

Another facet of binary searching is that it isn’t the most cache friendly algorithm out there. There might be some value in combining the above with the information in the link below.

Cache-friendly binary search

If you like this sort of thing, here is an interesting paper from this year (2017):
Array Layouts For Comparison-Based Searching

And further down the rabbit hole a bit, this talks about re-ordering the search array to fit things into a cache line better:
https://people.mpi-inf.mpg.de/~rgemulla/publications/schlegel09search.pdf

Taking the next step is Intel and Oracle’s FAST paper:
http://www.timkaldewey.de/pubs/FAST__TODS11.pdf

Florian Gross from twitch made me aware of the last two links and also mentioned his master’s these in this area (thank you Florian!):
https://www.researchgate.net/profile/Florian_Gross/publication/275971053_Index_Search_Algorithms_for_Databases_and_Modern_CPUs/links/554cffca0cf29f836c9cd539.pdf

@rygorous mentioned on twitter some improvements such as ternary and quaternary search, as well as a way to handle the case of non power of 2 sized lists without extra index checks:
https://twitter.com/rygorous/status/877418592752488449/photo/1

Thanks to everyone who gave feedback. It’s a very interesting topic, of which this post only seems to scratch this surface!

Hopefully you found this interesting. Questions, comments, corrections, let me know!

When Random Numbers Are Too Random: Low Discrepancy Sequences

Random numbers can be useful in graphics and game development, but they have a pesky and sometimes undesirable habit of clumping together.

This is a problem in path tracing and monte carlo integration when you take N samples, but the samples aren’t well spread across the sampling range.

This can also be a problem for situations like when you are randomly placing objects in the world or generating treasure for a treasure chest. You don’t want your randomly placed trees to only be in one part of the forest, and you don’t want a player to get only trash items or only godly items when they open a treasure chest. Ideally you want to have some randomness, but you don’t want the random number generator to give you all of the same or similar random numbers.

The problem is that random numbers can be TOO random, like in the below where you can see clumps and large gaps between the 100 samples.

For cases like that, when you want random numbers that are a little bit more well distributed, you might find some use in low discrepancy sequences.

The standalone C++ code (one source file, standard headers, no libraries to link to) I used to generate the data and images are at the bottom of this post, as well as some links to more resources.

What Is Discrepancy?

In this context, discrepancy is a measurement of the highest or lowest density of points in a sequence. High discrepancy means that there is either a large area of empty space, or that there is an area that has a high density of points. Low discrepancy means that there are neither, and that your points are more or less pretty evenly distributed.

The lowest discrepancy possible has no randomness at all, and in the 1 dimensional case means that the points are evenly distributed on a grid. For monte carlo integration and the game dev usage cases I mentioned, we do want some randomness, we just want the random points to be spread out a little more evenly.

If more formal math notation is your thing, discrepancy is defined as:
$D_{N}(P)=\sup _{{B\in J}}\left|{\frac {A(B;P)}{N}}-\lambda _{s}(B)\right|$

You can read more about the formal definition here: Wikipedia:
Equidistributed sequence

For monte carlo integration specifically, this is the behavior each thing gives you:

High Discrepancy: Random Numbers / White Noise aka Uniform Distribution – At lower sample counts, convergance is slower (and have higher variance) due to the possibility of not getting good coverage over the area you integrating. At higher sample counts, this problem disappears. (Hint: real time graphics and preview renderings use a smaller number of samples)
Lowest Discrepancy: Regular Grid – This will cause aliasing, unlike the other “random” based sampling, which trade aliasing for noise. Noise is preferred over aliasing.
Low Discrepancy: Low Discrepancy Sequences – In lower numbers of samples, this will have faster convergence by having better coverage of the sampling space, but will use randomness to get rid of aliasing by introducing noise.

Also interesting to note, Quasi Monte Carlo has provably better asymptotic convergence than regular monte carlo integration.

1 Dimensional Sequences

We’ll first look at 1 dimensional sequences.

Grid

Here are 100 samples evenly spaced:

Random Numbers (White Noise)

This is actually a high discrepancy sequence. To generate this, you just use a standard random number generator to pick 100 points between 0 and 1. I used std::mt19937 with a std::uniform_real_distribution from 0 to 1:

Subrandom Numbers

Subrandom numbers are ways to decrease the discrepancy of white noise.

One way to do this is to break the sampling space in half. You then generate even numbered samples in the first half of the space, and odd numbered samples in the second half of the space.

There’s no reason you can’t generalize this into more divisions of space though.

This splits the space into 4 regions:

8 regions:

16 regions:

32 regions:

There are other ways to generate subrandom numbers though. One way is to generate random numbers between 0 and 0.5, and add them to the last sample, plus 0.5. This gives you a random walk type setup.

Here is that:

Uniform Sampling + Jitter

If you take the first subrandom idea to the logical maximum, you break your sample space up into N sections and place one point within those N sections to make a low discrepancy sequence made up of N points.

Another way to look at this is that you do uniform sampling, but add some random jitter to the samples, between +/- half a uniform sample size, to keep the samples in their own areas.

This is that:

I have heard that Pixar invented this technique interestingly.

Irrational Numbers

Rational numbers are numbers which can be described as fractions, such as 0.75 which can be expressed as 3/4. Irrational numbers are numbers which CANNOT be described as fractions, such as pi, or the golden ratio, or the square root of a prime number.

Interestingly you can use irrational numbers to generate low discrepancy sequences. You start with some value (could be 0, or could be a random number), add the irrational number, and modulus against 1.0. To get the next sample you add the irrational value again, and modulus against 1.0 again. Rinse and repeat until you get as many samples as you want.

Some values work better than others though, and apparently the golden ratio is provably the best choice (1.61803398875…), says Wikipedia.

Here is the golden ratio, using 4 different random (white noise) starting values:

Here I’ve used the square root of 2, with 4 different starting random numbers again:

Lastly, here is pi, with 4 random starting values:

Van der Corput Sequence

The Van der Corput sequence is the 1d equivelant of the Halton sequence which we’ll talk about later.

How you generate values in the Van der Corput sequence is you convert the index of your sample into some base.

For instance if it was base 2, you would convert your index to binary. If it was base 16, you would convert your index to hexadecimal.

Now, instead of treating the digits as if they are $B^0$ , $B^1$ , $B^2$ , etc (where B is the base), you instead treat them as $B^{-1}$ , $B^{-2}$ , $B^{-3}$ and so on. In other words, you multiply each digit by a fraction and add up the results.

To show a couple quick examples, let’s say we wanted sample 6 in the sequence of base 2.

First we convert 6 to binary which is 110. From right to left, we have 3 digits: a 0 in the 1’s place, a 1 in the 2’s place, and a 1 in the 4’s place. $0*1 + 1*2 + 1*4 = 6$ , so we can see that 110 is in fact 6 in binary.

To get the Van der Corput value for this, instead of treating it as the 1’s, 2’s and 4’s digit, we treat it as the 1/2, 1/4 and 1/8’s digit.

$0 * 1/2 + 1 * 1/4 + 1 * 1/8 = 3/8$ .

So, sample 6 in the Van der Corput sequence using base 2 is 3/8.

Let’s try sample 21 in base 3.

First we convert 21 to base 3 which is 210. We can verify this is right by seeing that $0 * 1 + 1 * 3 + 2 * 9 = 21$ .

Instead of a 1’s, 3’s and 9’s digit, we are going to treat it like a 1/3, 1/9 and 1/27 digit.

$0 * 1/3 + 1 * 1/9 + 2 * 1/27 = 5/27$

So, sample 21 in the Van der Corput sequence using base 3 is 5/27.

Here is the Van der Corput sequence for base 2:

Here it is for base 3:

Base 4:

Base 5:

Sobol

One dimensional Sobol is actually just the Van der Corput sequence base 2 re-arranged a little bit, but it’s generated differently.

You start with 0 (either using it as sample 0 or sample -1, doesn’t matter which), and for each sample you do this:

Calculate the Ruler function value for the current sample’s index(more info in a second)
Make the direction vector by shifting 1 left (in binary) 31 – ruler times.
XOR the last sample by the direction vector to get the new sample
To interpret the sample as a floating point number you divide it by $2^{32}$

That might sound completely different than the Van der Corput sequence but it actually is the same thing – just re-ordered.

In the final step when dividing by $2^{32}$ , we are really just interpreting the binary number as a fraction just like before, but it’s the LEFT most digit that is the 1/2 spot, not the RIGHT most digit.

The Ruler Function goes like: 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, …

It’s pretty easy to calculate too. Calculating the ruler function for an index (starting at 1) is just the zero based index of the right most 1’s digit after converting the number to binary.

1 in binary is 001 so Ruler(1) is 0.
2 in binary is 010 so Ruler(2) is 1.
3 in binary is 011 so Ruler(3) is 0.
4 in binary is 100 so Ruler(4) is 2.
5 in binary is 101 so Ruler(5) is 0.
and so on.

Here is 1D Sobol:

Hammersley

In one dimension, the Hammersley sequence is the same as the base 2 Van der Corput sequence, and in the same order. If that sounds strange that it’s the same, it’s a 2d sequence I broke down into a 1d sequence for comparison. The one thing Hammersley has that makes it unique in the 1d case is that you can truncate bits.

It doesn’t seem that useful for 1d Hammersley to truncate bits but knowing that is useful info too I guess. Look at the 2d version of Hammersley to get a fairer look at it, because it’s meant to be a 2d sequence.

Here is Hammersley:

With 1 bit truncated:

With 2 bits truncated:

Poisson Disc

Poisson disc points are points which are densely packed, but have a minimum distance from each other.

Computer scientists are still working out good algorithms to generate these points efficiently.

I use “Mitchell’s Best-Candidate” which means that when you want to generate a new point in the sequence, you generate N new points, and choose whichever point is farthest away from the other points you’ve generated so far.

Here it is where N is 100:

2 Dimensional Sequences

Next up, let’s look at some 2 dimensional sequences.

Grid

Below is 2d uniform samples on a grid.

Note that uniform grid is not particularly low discrepancy for the 2d case! More info here: Is it expected that uniform points would have non zero discrepancy?

Random

Here are 100 random points:

Uniform Grid + Jitter

Here is a uniform grid that has random jitter applied to the points. Jittered grid is a pretty commonly used low discrepancy sampling technique that has good success.

Subrandom

Just like in 1 dimensions, you can apply the subrandom ideas to 2 dimensions where you divide the X and Y axis into so many sections, and randomly choose points in the sections.

If you divide X and Y into the same number of sections though, you are going to have a problem because some areas are not going to have any points in them.

@Reedbeta pointed out that instead of using i%x and i%y, that you could use i%x and (i/x)%y to make it pick points in all regions.

Picking different numbers for X and Y can be another way to give good results. Here’s dividing X and Y into 2 and 3 sections respectively:

If you choose co-prime numbers for divisions for each axis you can get maximal period of repeats. 2 and 3 are coprime so the last example is a good example of that, but here is 3 and 11:

Here is 3 and 97. 97 is large enough that with only doing 100 samples, we are almost doing jittered grid on the y axis.

Here is the other subrandom number from 1d, where we start with a random value for X and Y, and then add a random number between 0 and 0.5 to each, also adding 0.5, to make a “random walk” type setup again:

Halton

The Halton sequence is just the Van der Corput sequence, but using a different base on each axis.

Here is the Halton sequence where X and Y use bases 2 and 3:

Here it is using bases 5 and 7:

Here are bases 13 and 9:

Irrational Numbers

The irrational numbers technique can be used for 2d as well but I wasn’t able to find out how to make it give decent looking output that didn’t have an obvious diagonal pattern in them. Bart Wronski shared a neat paper that explains how to use the golden ratio in 2d with great success: Golden Ratio Sequences For Low-Discrepancy Sampling

This uses the golden ratio for the X axis and the square root of 2 for the Y axis. Below that is the same, with a random starting point, to make it give a different sequence.

Here X axis uses square root of 2 and Y axis uses square root of 3. Below that is a random starting point, which gives the same discrepancy.

Hammersley

In 2 dimensions, the Hammersley sequence uses the 1d Hammersley sequence for the X axis: Instead of treating the binary version of the index as binary, you treat it as fractions like you do for Van der Corput and sum up the fractions.

For the Y axis, you just reverse the bits and then do the same!

Here is the Hammersley sequence. Note we would have to take 128 samples (not just the 100 we did) if we wanted it to fill the entire square with samples.

Truncating bits in 2d is a bit useful. Here is 1 bit truncated:

2 bits truncated:

Poisson Disc

Using the same method we did for 1d, we can generate points in 2d space:

N Rooks

There is a sampling pattern called N-Rooks where you put N rooks onto a chess board and arrange them such that no two are in the same row or column.

A way to generate these samples is to realize that there will be only one rook per row, and that none of them will ever be in the same column. So, you make an array that has numbers 0 to N-1, and then shuffle the array. The index into the array is the row, and the value in the array is the column.

Here are 100 rooks:

Sobol

Sobol in two dimensions is more complex to explain so I’ll link you to the source I used: Sobol Sequences Made Simple.

The 1D sobol already covered is used for the X axis, and then something more complex was used for the Y axis:

Links

Bart Wronski has a really great series on a related topic: Dithering in Games

Wikipedia: Low Discrepancy Sequence

Wikipedia: Halton Sequence

Wikipedia: Van der Corput Sequence

Using Fibonacci Sequence To Generate Colors

Deeper info and usage cases for low discrepancy sequences

Poisson-Disc Sampling

Low discrepancy sequences are related to blue noise. Where white noise contains all frequencies evenly, blue noise has more high frequencies and fewer low frequencies. Blue noise is essentially the ultimate in low discrepancy, but can be expensive to compute. Here are some pages on blue noise:

Free Blue Noise Textures

The problem with 3D blue noise

Stippling and Blue Noise

Vegetation placement in “The Witness”

Here are some links from @marc_b_reynolds:
Sobol (low-discrepancy) sequence in 1-3D, stratified in 2-4D.
Classic binary-reflected gray code.
Sobol.h

Weyl Sequence

Code

#define _CRT_SECURE_NO_WARNINGS

#include <windows.h>  // for bitmap headers and performance counter.  Sorry non windows people!
#include <vector>
#include <stdint.h>
#include <random>
#include <array>
#include <algorithm>
#include <stdlib.h>
#include <set>

typedef uint8_t uint8;

#define NUM_SAMPLES 100  // to simplify some 2d code, this must be a square
#define NUM_SAMPLES_FOR_COLORING 100

// Turning this on will slow things down significantly because it's an O(N^5) operation for 2d!
#define CALCULATE_DISCREPANCY 0

#define IMAGE1D_WIDTH 600
#define IMAGE1D_HEIGHT 50
#define IMAGE2D_WIDTH 300
#define IMAGE2D_HEIGHT 300
#define IMAGE_PAD   30

#define IMAGE1D_CENTERX ((IMAGE1D_WIDTH+IMAGE_PAD*2)/2)
#define IMAGE1D_CENTERY ((IMAGE1D_HEIGHT+IMAGE_PAD*2)/2)
#define IMAGE2D_CENTERX ((IMAGE2D_WIDTH+IMAGE_PAD*2)/2)
#define IMAGE2D_CENTERY ((IMAGE2D_HEIGHT+IMAGE_PAD*2)/2)

#define AXIS_HEIGHT 40
#define DATA_HEIGHT 20
#define DATA_WIDTH 2

#define COLOR_FILL SColor(255,255,255)
#define COLOR_AXIS SColor(0, 0, 0)

//======================================================================================
struct SImageData
{
    SImageData ()
        : m_width(0)
        , m_height(0)
    { }
  
    size_t m_width;
    size_t m_height;
    size_t m_pitch;
    std::vector<uint8> m_pixels;
};

struct SColor
{
    SColor (uint8 _R = 0, uint8 _G = 0, uint8 _B = 0)
        : R(_R), G(_G), B(_B)
    { }

    uint8 B, G, R;
};

//======================================================================================
bool SaveImage (const char *fileName, const SImageData &image)
{
    // open the file if we can
    FILE *file;
    file = fopen(fileName, "wb");
    if (!file) {
        printf("Could not save %s\n", fileName);
        return false;
    }
  
    // make the header info
    BITMAPFILEHEADER header;
    BITMAPINFOHEADER infoHeader;
  
    header.bfType = 0x4D42;
    header.bfReserved1 = 0;
    header.bfReserved2 = 0;
    header.bfOffBits = 54;
  
    infoHeader.biSize = 40;
    infoHeader.biWidth = (LONG)image.m_width;
    infoHeader.biHeight = (LONG)image.m_height;
    infoHeader.biPlanes = 1;
    infoHeader.biBitCount = 24;
    infoHeader.biCompression = 0;
    infoHeader.biSizeImage = (DWORD) image.m_pixels.size();
    infoHeader.biXPelsPerMeter = 0;
    infoHeader.biYPelsPerMeter = 0;
    infoHeader.biClrUsed = 0;
    infoHeader.biClrImportant = 0;
  
    header.bfSize = infoHeader.biSizeImage + header.bfOffBits;
  
    // write the data and close the file
    fwrite(&header, sizeof(header), 1, file);
    fwrite(&infoHeader, sizeof(infoHeader), 1, file);
    fwrite(&image.m_pixels[0], infoHeader.biSizeImage, 1, file);
    fclose(file);
 
    return true;
}

//======================================================================================
void ImageInit (SImageData& image, size_t width, size_t height)
{
    image.m_width = width;
    image.m_height = height;
    image.m_pitch = 4 * ((width * 24 + 31) / 32);
    image.m_pixels.resize(image.m_pitch * image.m_width);
    std::fill(image.m_pixels.begin(), image.m_pixels.end(), 0);
}

//======================================================================================
void ImageClear (SImageData& image, const SColor& color)
{
    uint8* row = &image.m_pixels[0];
    for (size_t rowIndex = 0; rowIndex < image.m_height; ++rowIndex)
    {
        SColor* pixels = (SColor*)row;
        std::fill(pixels, pixels + image.m_width, color);

        row += image.m_pitch;
    }
}

//======================================================================================
void ImageBox (SImageData& image, size_t x1, size_t x2, size_t y1, size_t y2, const SColor& color)
{
    for (size_t y = y1; y < y2; ++y)
    {
        uint8* row = &image.m_pixels[y * image.m_pitch];
        SColor* start = &((SColor*)row)[x1];
        std::fill(start, start + x2 - x1, color);
    }
}

//======================================================================================
float Distance (float x1, float y1, float x2, float y2)
{
    float dx = (x2 - x1);
    float dy = (y2 - y1);

    return std::sqrtf(dx*dx + dy*dy);
}

//======================================================================================
SColor DataPointColor (size_t sampleIndex)
{
    SColor ret;
    float percent = (float(sampleIndex) / (float(NUM_SAMPLES_FOR_COLORING) - 1.0f));

    ret.R = uint8((1.0f - percent) * 255.0f);
    ret.G = 0;
    ret.B = uint8(percent * 255.0f);

    float mag = (float)sqrt(ret.R*ret.R + ret.G*ret.G + ret.B*ret.B);
    ret.R = uint8((float(ret.R) / mag)*255.0f);
    ret.G = uint8((float(ret.G) / mag)*255.0f);
    ret.B = uint8((float(ret.B) / mag)*255.0f);

    return ret;
}

//======================================================================================
float RandomFloat (float min, float max)
{
    static std::random_device rd;
    static std::mt19937 mt(rd());
    std::uniform_real_distribution<float> dist(min, max);
    return dist(mt);
}

//======================================================================================
size_t Ruler (size_t n)
{
    size_t ret = 0;
    while (n != 0 && (n & 1) == 0)
    {
        n /= 2;
        ++ret;
    }
    return ret;
}

//======================================================================================
float CalculateDiscrepancy1D (const std::array<float, NUM_SAMPLES>& samples)
{
    // some info about calculating discrepancy
    // https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

    // Calculates the discrepancy of this data.
    // Assumes the data is [0,1) for valid sample range
    std::array<float, NUM_SAMPLES> sortedSamples = samples;
    std::sort(sortedSamples.begin(), sortedSamples.end());

    float maxDifference = 0.0f;
    for (size_t startIndex = 0; startIndex <= NUM_SAMPLES; ++startIndex)
    {
        // startIndex 0 = 0.0f.  startIndex 1 = sortedSamples[0]. etc

        float startValue = 0.0f;
        if (startIndex > 0)
            startValue = sortedSamples[startIndex - 1];

        for (size_t stopIndex = startIndex; stopIndex <= NUM_SAMPLES; ++stopIndex)
        {
            // stopIndex 0 = sortedSamples[0].  startIndex[N] = 1.0f. etc

            float stopValue = 1.0f;
            if (stopIndex < NUM_SAMPLES)
                stopValue = sortedSamples[stopIndex];

            float length = stopValue - startValue;

            // open interval (startValue, stopValue)
            size_t countInside = 0;
            for (float sample : samples)
            {
                if (sample > startValue &&
                    sample < stopValue)
                {
                    ++countInside;
                }
            }
            float density = float(countInside) / float(NUM_SAMPLES);
            float difference = std::abs(density - length);
            if (difference > maxDifference)
                maxDifference = difference;

            // closed interval [startValue, stopValue]
            countInside = 0;
            for (float sample : samples)
            {
                if (sample >= startValue &&
                    sample <= stopValue)
                {
                    ++countInside;
                }
            }
            density = float(countInside) / float(NUM_SAMPLES);
            difference = std::abs(density - length);
            if (difference > maxDifference)
                maxDifference = difference;
        }
    }
    return maxDifference;
}

//======================================================================================
float CalculateDiscrepancy2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples)
{
    // some info about calculating discrepancy
    // https://math.stackexchange.com/questions/1681562/how-to-calculate-discrepancy-of-a-sequence

    // Calculates the discrepancy of this data.
    // Assumes the data is [0,1) for valid sample range.

    // Get the sorted list of unique values on each axis
    std::set<float> setSamplesX;
    std::set<float> setSamplesY;
    for (const std::array<float, 2>& sample : samples)
    {
        setSamplesX.insert(sample[0]);
        setSamplesY.insert(sample[1]);
    }
    std::vector<float> sortedXSamples;
    std::vector<float> sortedYSamples;
    sortedXSamples.reserve(setSamplesX.size());
    sortedYSamples.reserve(setSamplesY.size());
    for (float f : setSamplesX)
        sortedXSamples.push_back(f);
    for (float f : setSamplesY)
        sortedYSamples.push_back(f);

    // Get the sorted list of samples on the X axis, for faster interval testing
    std::array<std::array<float, 2>, NUM_SAMPLES> sortedSamplesX = samples;
    std::sort(sortedSamplesX.begin(), sortedSamplesX.end(),
        [] (const std::array<float, 2>& itemA, const std::array<float, 2>& itemB)
        {
            return itemA[0] < itemB[0];
        }
    );

    // calculate discrepancy
    float maxDifference = 0.0f;
    for (size_t startIndexY = 0; startIndexY <= sortedYSamples.size(); ++startIndexY)
    {
        float startValueY = 0.0f;
        if (startIndexY > 0)
            startValueY = *(sortedYSamples.begin() + startIndexY - 1);

        for (size_t startIndexX = 0; startIndexX <= sortedXSamples.size(); ++startIndexX)
        {
            float startValueX = 0.0f;
            if (startIndexX > 0)
                startValueX = *(sortedXSamples.begin() + startIndexX - 1);

            for (size_t stopIndexY = startIndexY; stopIndexY <= sortedYSamples.size(); ++stopIndexY)
            {
                float stopValueY = 1.0f;
                if (stopIndexY < sortedYSamples.size())
                    stopValueY = sortedYSamples[stopIndexY];

                for (size_t stopIndexX = startIndexX; stopIndexX <= sortedXSamples.size(); ++stopIndexX)
                {
                    float stopValueX = 1.0f;
                    if (stopIndexX < sortedXSamples.size())
                        stopValueX = sortedXSamples[stopIndexX];

                    // calculate area
                    float length = stopValueX - startValueX;
                    float height = stopValueY - startValueY;
                    float area = length * height;

                    // open interval (startValue, stopValue)
                    size_t countInside = 0;
                    for (const std::array<float, 2>& sample : samples)
                    {
                        if (sample[0] > startValueX &&
                            sample[1] > startValueY &&
                            sample[0] < stopValueX &&
                            sample[1] < stopValueY)
                        {
                            ++countInside;
                        }
                    }
                    float density = float(countInside) / float(NUM_SAMPLES);
                    float difference = std::abs(density - area);
                    if (difference > maxDifference)
                        maxDifference = difference;

                    // closed interval [startValue, stopValue]
                    countInside = 0;
                    for (const std::array<float, 2>& sample : samples)
                    {
                        if (sample[0] >= startValueX &&
                            sample[1] >= startValueY &&
                            sample[0] <= stopValueX &&
                            sample[1] <= stopValueY)
                        {
                            ++countInside;
                        }
                    }
                    density = float(countInside) / float(NUM_SAMPLES);
                    difference = std::abs(density - area);
                    if (difference > maxDifference)
                        maxDifference = difference;
                }
            }
        }
    }

    return maxDifference;
}

//======================================================================================
void Test1D (const char* fileName, const std::array<float, NUM_SAMPLES>& samples)
{
    // create and clear the image
    SImageData image;
    ImageInit(image, IMAGE1D_WIDTH + IMAGE_PAD * 2, IMAGE1D_HEIGHT + IMAGE_PAD * 2);

    // setup the canvas
    ImageClear(image, COLOR_FILL);

    // calculate the discrepancy
    #if CALCULATE_DISCREPANCY
        float discrepancy = CalculateDiscrepancy1D(samples);
        printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
    #endif

    // draw the sample points
    size_t i = 0;
    for (float f: samples)
    {
        size_t pos = size_t(f * float(IMAGE1D_WIDTH)) + IMAGE_PAD;
        ImageBox(image, pos, pos + 1, IMAGE1D_CENTERY - DATA_HEIGHT / 2, IMAGE1D_CENTERY + DATA_HEIGHT / 2, DataPointColor(i));
        ++i;
    }

    // draw the axes lines. horizontal first then the two vertical
    ImageBox(image, IMAGE_PAD, IMAGE1D_WIDTH + IMAGE_PAD, IMAGE1D_CENTERY, IMAGE1D_CENTERY + 1, COLOR_AXIS);
    ImageBox(image, IMAGE_PAD, IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);
    ImageBox(image, IMAGE1D_WIDTH + IMAGE_PAD, IMAGE1D_WIDTH + IMAGE_PAD + 1, IMAGE1D_CENTERY - AXIS_HEIGHT / 2, IMAGE1D_CENTERY + AXIS_HEIGHT / 2, COLOR_AXIS);

    // save the image
    SaveImage(fileName, image);
}

//======================================================================================
void Test2D (const char* fileName, const std::array<std::array<float,2>, NUM_SAMPLES>& samples)
{
    // create and clear the image
    SImageData image;
    ImageInit(image, IMAGE2D_WIDTH + IMAGE_PAD * 2, IMAGE2D_HEIGHT + IMAGE_PAD * 2);
    
    // setup the canvas
    ImageClear(image, COLOR_FILL);

    // calculate the discrepancy
    #if CALCULATE_DISCREPANCY
        float discrepancy = CalculateDiscrepancy2D(samples);
        printf("%s Discrepancy = %0.2f%%\n", fileName, discrepancy*100.0f);
    #endif

    // draw the sample points
    size_t i = 0;
    for (const std::array<float, 2>& sample : samples)
    {
        size_t posx = size_t(sample[0] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
        size_t posy = size_t(sample[1] * float(IMAGE2D_WIDTH)) + IMAGE_PAD;
        ImageBox(image, posx - 1, posx + 1, posy - 1, posy + 1, DataPointColor(i));
        ++i;
    }

    // horizontal lines
    ImageBox(image, IMAGE_PAD - 1, IMAGE2D_WIDTH + IMAGE_PAD + 1, IMAGE_PAD - 1, IMAGE_PAD, COLOR_AXIS);
    ImageBox(image, IMAGE_PAD - 1, IMAGE2D_WIDTH + IMAGE_PAD + 1, IMAGE2D_HEIGHT + IMAGE_PAD, IMAGE2D_HEIGHT + IMAGE_PAD + 1, COLOR_AXIS);

    // vertical lines
    ImageBox(image, IMAGE_PAD - 1, IMAGE_PAD, IMAGE_PAD - 1, IMAGE2D_HEIGHT + IMAGE_PAD + 1, COLOR_AXIS);
    ImageBox(image, IMAGE_PAD + IMAGE2D_WIDTH, IMAGE_PAD + IMAGE2D_WIDTH + 1, IMAGE_PAD - 1, IMAGE2D_HEIGHT + IMAGE_PAD + 1, COLOR_AXIS);

    // save the image
    SaveImage(fileName, image);
}

//======================================================================================
void TestUniform1D (bool jitter)
{
    // calculate the sample points
    const float c_cellSize = 1.0f / float(NUM_SAMPLES+1);
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i] = float(i+1) / float(NUM_SAMPLES+1);
        if (jitter)
            samples[i] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
    }

    // save bitmap etc
    if (jitter)
        Test1D("1DUniformJitter.bmp", samples);
    else
        Test1D("1DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom1D ()
{
    // calculate the sample points
    const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
        samples[i] = RandomFloat(0.0f, 1.0f);

    // save bitmap etc
    Test1D("1DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA1D (size_t numRegions)
{
    const float c_randomRange = 1.0f / float(numRegions);

    // calculate the sample points
    const float c_halfJitter = 1.0f / float((NUM_SAMPLES + 1) * 2);
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i] = RandomFloat(0.0f, c_randomRange);
        samples[i] += float(i % numRegions) / float(numRegions);
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "1DSubRandomA_%zu.bmp", numRegions);
    Test1D(fileName, samples);
}

//======================================================================================
void TestSubRandomB1D ()
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    float sample = RandomFloat(0.0f, 0.5f);
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        sample = std::fmodf(sample + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
        samples[i] = sample;
    }

    // save bitmap etc
    Test1D("1DSubRandomB.bmp", samples);
}

//======================================================================================
void TestVanDerCorput (size_t base)
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i] = 0.0f;
        float denominator = float(base);
        size_t n = i;
        while (n > 0)
        {
            size_t multiplier = n % base;
            samples[i] += float(multiplier) / denominator;
            n = n / base;
            denominator *= base;
        }
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "1DVanDerCorput_%zu.bmp", base);
    Test1D(fileName, samples);
}

//======================================================================================
void TestIrrational1D (float irrational, float seed)
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    float sample = seed;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        sample = std::fmodf(sample + irrational, 1.0f);
        samples[i] = sample;
    }

    // save bitmap etc
    char irrationalStr[256];
    sprintf(irrationalStr, "%f", irrational);
    char seedStr[256];
    sprintf(seedStr, "%f", seed);
    char fileName[256];
    sprintf(fileName, "1DIrrational_%s_%s.bmp", &irrationalStr[2], &seedStr[2]);
    Test1D(fileName, samples);
}

//======================================================================================
void TestSobol1D ()
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        size_t ruler = Ruler(i + 1);
        size_t direction = size_t(size_t(1) << size_t(31 - ruler));
        sampleInt = sampleInt ^ direction;
        samples[i] = float(sampleInt) / std::pow(2.0f, 32.0f);
    }

    // save bitmap etc
    Test1D("1DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley1D (size_t truncateBits)
{
    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        size_t n = i >> truncateBits;
        float base = 1.0f / 2.0f;
        samples[i] = 0.0f;
        while (n)
        {
            if (n & 1)
                samples[i] += base;
            n /= 2;
            base /= 2.0f;
        }
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "1DHammersley_%zu.bmp", truncateBits);
    Test1D(fileName, samples);
}

//======================================================================================
float MinimumDistance1D (const std::array<float, NUM_SAMPLES>& samples, size_t numSamples, float x)
{
    // Used by poisson.
    // This returns the minimum distance that point (x) is away from the sample points, from [0, numSamples).
    float minimumDistance = 0.0f;
    for (size_t i = 0; i < numSamples; ++i)
    {
        float distance = std::abs(samples[i] - x);
        if (i == 0 || distance < minimumDistance)
            minimumDistance = distance;
    }
    return minimumDistance;
}

//======================================================================================
void TestPoisson1D ()
{
    // every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
    const size_t c_bestOfAttempts = 100;

    // calculate the sample points
    std::array<float, NUM_SAMPLES> samples;
    for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
    {
        // generate some random points and keep the one that has the largest minimum distance from any of the existing points
        float bestX = 0.0f;
        float bestMinDistance = 0.0f;
        for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
        {
            float attemptX = RandomFloat(0.0f, 1.0f);
            float minDistance = MinimumDistance1D(samples, sampleIndex, attemptX);

            if (minDistance > bestMinDistance)
            {
                bestX = attemptX;
                bestMinDistance = minDistance;
            }
        }
        samples[sampleIndex] = bestX;
    }

    // save bitmap etc
    Test1D("1DPoisson.bmp", samples);
}

//======================================================================================
void TestUniform2D (bool jitter)
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
    const float c_cellSize = 1.0f / float(c_oneSide+1);
    for (size_t iy = 0; iy < c_oneSide; ++iy)
    {
        for (size_t ix = 0; ix < c_oneSide; ++ix)
        {
            size_t sampleIndex = iy * c_oneSide + ix;

            samples[sampleIndex][0] = float(ix + 1) / (float(c_oneSide + 1));
            if (jitter)
                samples[sampleIndex][0] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);

            samples[sampleIndex][1] = float(iy + 1) / (float(c_oneSide) + 1.0f);
            if (jitter)
                samples[sampleIndex][1] += RandomFloat(-c_cellSize*0.5f, c_cellSize*0.5f);
        }
    }

    // save bitmap etc
    if (jitter)
        Test2D("2DUniformJitter.bmp", samples);
    else
        Test2D("2DUniform.bmp", samples);
}

//======================================================================================
void TestUniformRandom2D ()
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
    const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i][0] = RandomFloat(0.0f, 1.0f);
        samples[i][1] = RandomFloat(0.0f, 1.0f);
    }

    // save bitmap etc
    Test2D("2DUniformRandom.bmp", samples);
}

//======================================================================================
void TestSubRandomA2D (size_t regionsX, size_t regionsY)
{
    const float c_randomRangeX = 1.0f / float(regionsX);
    const float c_randomRangeY = 1.0f / float(regionsY);

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samples[i][0] = RandomFloat(0.0f, c_randomRangeX);
        samples[i][0] += float(i % regionsX) / float(regionsX);

        samples[i][1] = RandomFloat(0.0f, c_randomRangeY);
        samples[i][1] += float(i % regionsY) / float(regionsY);
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "2DSubRandomA_%zu_%zu.bmp", regionsX, regionsY);
    Test2D(fileName, samples);
}

//======================================================================================
void TestSubRandomB2D ()
{
    // calculate the sample points
    float samplex = RandomFloat(0.0f, 0.5f);
    float sampley = RandomFloat(0.0f, 0.5f);
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samplex = std::fmodf(samplex + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
        sampley = std::fmodf(sampley + 0.5f + RandomFloat(0.0f, 0.5f), 1.0f);
        samples[i][0] = samplex;
        samples[i][1] = sampley;
    }
    
    // save bitmap etc
    Test2D("2DSubRandomB.bmp", samples);
}

//======================================================================================
void TestHalton (size_t basex, size_t basey)
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    const size_t c_oneSide = size_t(std::sqrt(NUM_SAMPLES));
    const float c_halfJitter = 1.0f / float((c_oneSide + 1) * 2);
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        // x axis
        samples[i][0] = 0.0f;
        {
            float denominator = float(basex);
            size_t n = i;
            while (n > 0)
            {
                size_t multiplier = n % basex;
                samples[i][0] += float(multiplier) / denominator;
                n = n / basex;
                denominator *= basex;
            }
        }

        // y axis
        samples[i][1] = 0.0f;
        {
            float denominator = float(basey);
            size_t n = i;
            while (n > 0)
            {
                size_t multiplier = n % basey;
                samples[i][1] += float(multiplier) / denominator;
                n = n / basey;
                denominator *= basey;
            }
        }
    }

    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "2DHalton_%zu_%zu.bmp", basex, basey);
    Test2D(fileName, samples);
}

//======================================================================================
void TestSobol2D ()
{
    // calculate the sample points

    // x axis
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        size_t ruler = Ruler(i + 1);
        size_t direction = size_t(size_t(1) << size_t(31 - ruler));
        sampleInt = sampleInt ^ direction;
        samples[i][0] = float(sampleInt) / std::pow(2.0f, 32.0f);
    }

    // y axis
    // Code adapted from http://web.maths.unsw.edu.au/~fkuo/sobol/
    // uses numbers: new-joe-kuo-6.21201

    // Direction numbers
    std::vector<size_t> V;
    V.resize((size_t)ceil(log((double)NUM_SAMPLES) / log(2.0)));
    V[0] = size_t(1) << size_t(31);
    for (size_t i = 1; i < V.size(); ++i)
        V[i] = V[i - 1] ^ (V[i - 1] >> 1);

    // Samples
    sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i) {
        size_t ruler = Ruler(i + 1);
        sampleInt = sampleInt ^ V[ruler];
        samples[i][1] = float(sampleInt) / std::pow(2.0f, 32.0f);
    }

    // save bitmap etc
    Test2D("2DSobol.bmp", samples);
}

//======================================================================================
void TestHammersley2D (size_t truncateBits)
{
    // figure out how many bits we are working in.
    size_t value = 1;
    size_t numBits = 0;
    while (value < NUM_SAMPLES)
    {
        value *= 2;
        ++numBits;
    }

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    size_t sampleInt = 0;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        // x axis
        samples[i][0] = 0.0f;
        {
            size_t n = i >> truncateBits;
            float base = 1.0f / 2.0f;
            while (n)
            {
                if (n & 1)
                    samples[i][0] += base;
                n /= 2;
                base /= 2.0f;
            }
        }

        // y axis
        samples[i][1] = 0.0f;
        {
            size_t n = i >> truncateBits;
            size_t mask = size_t(1) << (numBits - 1 - truncateBits);

            float base = 1.0f / 2.0f;
            while (mask)
            {
                if (n & mask)
                    samples[i][1] += base;
                mask /= 2;
                base /= 2.0f;
            }
        }
    }


    // save bitmap etc
    char fileName[256];
    sprintf(fileName, "2DHammersley_%zu.bmp", truncateBits);
    Test2D(fileName, samples);
}

//======================================================================================
void TestRooks2D ()
{
    // make and shuffle rook positions
    std::random_device rd;
    std::mt19937 mt(rd());
    std::array<size_t, NUM_SAMPLES> rookPositions;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
        rookPositions[i] = i;
    std::shuffle(rookPositions.begin(), rookPositions.end(), mt);

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        // x axis
        samples[i][0] = float(rookPositions[i]) / float(NUM_SAMPLES-1);

        // y axis
        samples[i][1] = float(i) / float(NUM_SAMPLES - 1);
    }

    // save bitmap etc
    Test2D("2DRooks.bmp", samples);
}

//======================================================================================
void TestIrrational2D (float irrationalx, float irrationaly, float seedx, float seedy)
{
    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    float samplex = seedx;
    float sampley = seedy;
    for (size_t i = 0; i < NUM_SAMPLES; ++i)
    {
        samplex = std::fmodf(samplex + irrationalx, 1.0f);
        sampley = std::fmodf(sampley + irrationaly, 1.0f);

        samples[i][0] = samplex;
        samples[i][1] = sampley;
    }

    // save bitmap etc
    char irrationalxStr[256];
    sprintf(irrationalxStr, "%f", irrationalx);
    char irrationalyStr[256];
    sprintf(irrationalyStr, "%f", irrationaly);
    char seedxStr[256];
    sprintf(seedxStr, "%f", seedx);
    char seedyStr[256];
    sprintf(seedyStr, "%f", seedy);
    char fileName[256];
    sprintf(fileName, "2DIrrational_%s_%s_%s_%s.bmp", &irrationalxStr[2], &irrationalyStr[2], &seedxStr[2], &seedyStr[2]);
    Test2D(fileName, samples);
}

//======================================================================================
float MinimumDistance2D (const std::array<std::array<float, 2>, NUM_SAMPLES>& samples, size_t numSamples, float x, float y)
{
    // Used by poisson.
    // This returns the minimum distance that point (x,y) is away from the sample points, from [0, numSamples).
    float minimumDistance = 0.0f;
    for (size_t i = 0; i < numSamples; ++i)
    {
        float distance = Distance(samples[i][0], samples[i][1], x, y);
        if (i == 0 || distance < minimumDistance)
            minimumDistance = distance;
    }
    return minimumDistance;
}

//======================================================================================
void TestPoisson2D ()
{
    // every time we want to place a point, we generate this many points and choose the one farthest away from all the other points (largest minimum distance)
    const size_t c_bestOfAttempts = 100;

    // calculate the sample points
    std::array<std::array<float, 2>, NUM_SAMPLES> samples;
    for (size_t sampleIndex = 0; sampleIndex < NUM_SAMPLES; ++sampleIndex)
    {
        // generate some random points and keep the one that has the largest minimum distance from any of the existing points
        float bestX = 0.0f;
        float bestY = 0.0f;
        float bestMinDistance = 0.0f;
        for (size_t attempt = 0; attempt < c_bestOfAttempts; ++attempt)
        {
            float attemptX = RandomFloat(0.0f, 1.0f);
            float attemptY = RandomFloat(0.0f, 1.0f);
            float minDistance = MinimumDistance2D(samples, sampleIndex, attemptX, attemptY);

            if (minDistance > bestMinDistance)
            {
                bestX = attemptX;
                bestY = attemptY;
                bestMinDistance = minDistance;
            }
        }
        samples[sampleIndex][0] = bestX;
        samples[sampleIndex][1] = bestY;
    }

    // save bitmap etc
    Test2D("2DPoisson.bmp", samples);
}

//======================================================================================
int main (int argc, char **argv)
{
    // 1D tests
    {
        TestUniform1D(false);
        TestUniform1D(true);

        TestUniformRandom1D();

        TestSubRandomA1D(2);
        TestSubRandomA1D(4);
        TestSubRandomA1D(8);
        TestSubRandomA1D(16);
        TestSubRandomA1D(32);

        TestSubRandomB1D();

        TestVanDerCorput(2);
        TestVanDerCorput(3);
        TestVanDerCorput(4);
        TestVanDerCorput(5);

        // golden ratio mod 1 aka (sqrt(5) - 1)/2
        TestIrrational1D(0.618034f, 0.0f);
        TestIrrational1D(0.618034f, 0.385180f);
        TestIrrational1D(0.618034f, 0.775719f);
        TestIrrational1D(0.618034f, 0.287194f);

        // sqrt(2) - 1
        TestIrrational1D(0.414214f, 0.0f);
        TestIrrational1D(0.414214f, 0.385180f);
        TestIrrational1D(0.414214f, 0.775719f);
        TestIrrational1D(0.414214f, 0.287194f);

        // PI mod 1
        TestIrrational1D(0.141593f, 0.0f);
        TestIrrational1D(0.141593f, 0.385180f);
        TestIrrational1D(0.141593f, 0.775719f);
        TestIrrational1D(0.141593f, 0.287194f);
        
        TestSobol1D();

        TestHammersley1D(0);
        TestHammersley1D(1);
        TestHammersley1D(2);

        TestPoisson1D();
    }

    // 2D tests
    {
        TestUniform2D(false);
        TestUniform2D(true);

        TestUniformRandom2D();

        TestSubRandomA2D(2, 2);
        TestSubRandomA2D(2, 3);
        TestSubRandomA2D(3, 11);
        TestSubRandomA2D(3, 97);

        TestSubRandomB2D();

        TestHalton(2, 3);
        TestHalton(5, 7);
        TestHalton(13, 9);

        TestSobol2D();

        TestHammersley2D(0);
        TestHammersley2D(1);
        TestHammersley2D(2);

        TestRooks2D();

        // X axis = golden ratio mod 1 aka (sqrt(5)-1)/2
        // Y axis = sqrt(2) mod 1
        TestIrrational2D(0.618034f, 0.414214f, 0.0f, 0.0f);
        TestIrrational2D(0.618034f, 0.414214f, 0.775719f, 0.264045f);

        // X axis = sqrt(2) mod 1
        // Y axis = sqrt(3) mod 1
        TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.0f, 0.0f);
        TestIrrational2D(std::fmodf((float)std::sqrt(2.0f), 1.0f), std::fmodf((float)std::sqrt(3.0f), 1.0f), 0.775719f, 0.264045f);

        TestPoisson2D();
    }

    #if CALCULATE_DISCREPANCY
        printf("\n");
        system("pause");
    #endif
}

How to Train Neural Networks With Backpropagation

This post is an attempt to demystify backpropagation, which is the most common method for training neural networks. This post is broken into a few main sections:

Explanation
Working through examples
Simple sample C++ source code using only standard includes
Links to deeper resources to continue learning

Let’s talk about the basics of neural nets to start out, specifically multi layer perceptrons. This is a common type of neural network, and is the type we will be talking about today. There are other types of neural networks though such as convolutional neural networks, recurrent neural networks, Hopfield networks and more. The good news is that backpropagation applies to most other types of neural networks too, so what you learn here will be applicable to other types of networks.

Basics of Neural Networks

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron. This network takes 3 floating point values as input, passes them through 4 neurons in a hidden layer and outputs two floating point values. The hidden layer neurons and the output layer neurons do processing of the values they are giving, but the input neurons do not.

To calculate the output value of a single neuron, you multiply every input into that neuron by a weight for that input, sum them up, and add a bias that is set for the neuron. This “weighted input” value is fed into an activation function and the result is the output value of that neuron. Here is a diagram for a single neuron:

The code for calculating the output of a single neuron could look like this:

float weightedInput = bias;

for (int i = 0; i < inputs.size(); ++i)
  weightedInput += inputs[i] * weights[i];

float output = Activation(weightedInput);

To evaluate an entire network of neurons, you just repeat this process for all neurons in the network, going from left to right (from input to output).

Neural networks are basically black boxes. We train them to give specific ouputs when we give them specific inputs, but it is often difficult to understand what it is that they’ve learned, or what part of the data they are picking up on.

Training a neural network just means that we adjust the weight and bias values such that when we give specific inputs, we get the desired outputs from the network. Being able to figure out what weights and biases to use can be tricky, especially for networks with lots of layers and lots of neurons per layer. This post talks about how to do just that.

Regarding training, there is a funny story where some people trained a neural network to say whether or not a military tank was in a photograph. It had a very high accuracy rate with the test data they trained it with, but when they used it with new data, it had terrible accuracy. It turns out that the training data was a bit flawed. Pictures of tanks were all taken on a sunny day, and the pictures without tanks were taken on a cloudy day. The network learned how to detect whether a picture was of a sunny day or a cloudy day, not whether there was a tank in the photo or not!

This is one type of pitfall to watch out for when dealing with neural networks – having good training data – but there are many other pitfalls to watch out for too. Architecting and training neural networks is quite literally an art form. If it were painting, this post would be teaching you how to hold a brush and what the primary colors are. There are many, many techniques to learn beyond what is written here to use as tools in your toolbox. The information in this post will allow you to succeed in training neural networks, but there is a lot more to learn to get higher levels of accuracy from your nets!

Neural Networks Learn Using Gradient Descent

Let’s take a look at a simple neural network where we’ve chosen random values for the weights and the bias:

If given two floating point inputs, we’d calculate the output of the network like this:

$Output = Activation(Input0 * Weight0 + Input1 * Weight1 + Bias)$

Plugging in the specific values for the weights and biases, it looks like this:

$Output = Activation(Input0 * 0.23 + Input1 * -0.1 + 0.3)$

Let’s say that we want this network to output a zero when we give an input of 1,0, and that we don’t care what it outputs otherwise. We’ll plug 1 and 0 in for Input0 and Input1 respectively and see what the output of the network is right now:

$Output = Activation(1* 0.23 + 0 * -0.1 + 0.3) \\ Output = Activation(0.53)$

For the activation function, we are going to use a common one called the sigmoid activation function, which is also sometimes called the logistic activation function. It looks like this:

$\sigma(x) = \frac{1}{1+e^{-x}}$

Without going into too much detail, the reason why sigmoid is so commonly used is because it’s a smoother and differentiable version of the step function.

Applying that activation function to our output neuron, we get this:

$Output = Activation(0.53) \\ Output = \sigma(0.53) \\ Output = 0.6295$

So, we plugged in 1 and 0, but instead of getting a 0 out, we got 0.6295. Our weights and biases are wrong, but how do we correct them?

The secret to correcting our weights and biases is whichever of these terms seem least scary to you: slopes, derivatives, gradients.

If “slope” was the least scary term to you, you probably remember the line formula $y=mx+b$ and that the m value was the “rise over run” or the slope of the line. Well believe it or not, that’s all a derivative is. A derivative is just the slope of a function at a specific point on that function. Even if a function is curved, you can pick a point on the graph and get a slope at that point. The notation for a derivative is $\frac{dy}{dx}$ , which literally means “change in y divided by change in x”, or “delta y divided by delta x”, which is literally rise over run.

In the case of a linear function (a line), it has the same derivative over the entire thing, so you can take a step size of any size on the x axis and multiply that step size by $\frac{dy}{dx}$ to figure out how much to add or subtract from y to stay on the line.

In the case of a non linear function, the derivative can change from one point to the next, so this slope is only guaranteed to be accurate for an infinitely small step size. In practice, people just often use “small” step sizes and calling it good enough, which is what we’ll be doing momentarily.

Now that you realize you already knew what a derivative is, we have to talk about partial derivatives. There really isn’t anything very scary about them and they still mean the exact same thing – they are the slope! They are even calculated the exact same way, but they use a fancier looking d in their notation: $\frac{\partial y}{\partial x}$ .

The reason partial derivatives even exist is because if you have a function of multiple variables like $z=f(x,y)=x^2+3y+2$ , you have two variables that you can take the derivative of. You can calculate $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$ . The first value tells you how much the z value changes for a change in x, the second value tells you how much the z value changes for a change in y.

By the way, if you are curious, the partial derivatives for that function above are below. When calculating partial derivatives, any variable that isn’t the one you care about, you just treat as a constant and do normal derivation.

$\frac{\partial z}{\partial x} = 2x\\ \frac{\partial z}{\partial y} = 3\\$

If you put both of those values together into a vector $(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ you have what is called the gradient vector.

The gradient vector has an interesting property, which is that it points in the direction that makes the function output grow the most. Basically, if you think of your function as a surface, it points up the steepest direction of the surface, from the point you evaluated the function at.

We are going to use that property to train our neural network by doing the following:

Calculate the gradient of a function that describes the error in our network. This means we will have the partial derivatives of all the weights and biases in the network.
Multiply the gradient by a small “learning rate” value, such as 0.05
Subtract these scaled derivatives from the weights and biases to decrease the error a small amount.

This technique is called steepest gradient descent (SGD) and when we do the above, our error will decrease by a small amount. The only exception is that if we use too large of a learning rate, it’s possible that we make the error grow, but usually the error will decrease.

We will do the above over and over, until either the error is small enough, or we’ve decided we’ve tried enough iterations that we think the neural network is never going to learn the things we want to teach it. If the network doesn’t learn, it means it needs to be re-architected with a different structure, different numbers of neurons and layers, different activation functions, etc. This is part of the “art” that I mentioned earlier.

Before moving on, there is one last thing to talk about: global minimums vs local minimums.

Imagine that the function describing the error in our network is visualized as bumpy ground. When we initialize our weights and biases to random numbers we are basically just choosing a random location on the ground to start at. From there, we act like a ball, and just roll down hill from wherever we are. We are definitely going to get to the bottom of SOME bump / hole in the ground, but there is absolutely no reason to except that we’ll get to the bottom of the DEEPEST bump / hole.

The problem is that SGD will find a LOCAL minimum – whatever we are closest too – but it might not find the GLOBAL minimum.

In practice, this doesn’t seem to be too large of a problem, at least for people casually using neural nets like you and me, but it is one of the active areas of research in neural networks: how do we do better at finding more global minimums?

You might notice the strange language I’m using where I say we have a function that describes the error, instead of just saying we use the error itself. The function I’m talking about is called the “cost function” and the reason for this is that different ways of describing the error give us different desirable properties.

For instance, a common cost function is to use mean squared error of the actual output compared to the desired output.

For a single training example, you plug the input into the network and calculate the output. You then plug the actual output and the target output into the function below:

$Cost = ||target-output||^2$

In other words, you take the vector of the neuron outputs, subtract it from the actual output that we wanted, calculate the length of the resulting vector and square it. This gives you the squared error.

The reason we use squared error in the cost function is because this way error in either direction is a positive number, so when gradient descent does it’s work, we’ll find the smallest magnitude of error, regardless of whether it’s positive or negative amounts. We could use absolute value, but absolute value isn’t differentiable, while squaring is.

To handle calculating the cost of multiple inputs and outputs, you just take the average of the squared error for each piece of training data. This gives you the mean squared error as the cost function across all inputs. You also average the derivatives to get the combined gradient.

More on Training

Before we go into backpropagation, I want to re-iterate this point: Neural Networks Learn Using Gradient Descent.

All you need is the gradient vector of the cost function, aka the partial derivatives of all the weights and the biases for the cost.

Backpropagation gets you the gradient vector, but it isn’t the only way to do so!

Another way to do it is to use dual numbers which you can read about on my post about them: Multivariable Dual Numbers & Automatic Differentiation.

Using dual numbers, you would evaluate the output of the network, using dual numbers instead of floating point numbers, and at the end you’d have your gradient vector. It’s not quite as efficient as backpropagation (or so I’ve heard, I haven’t tried it), but if you know how dual numbers work, it’s super easy to implement.

Another way to get the gradient vector is by doing so numerically using finite differences. You can read about numerical derivatives on my post here: Finite Differences

Basically what you would do is if you were trying to calculate the partial derivative of a weight, like $\frac{\partial Cost}{\partial Weight0}$ , you would first calculate the cost of the network as usual, then you would add a small value to Weight0 and evaluate the cost again. You subtract the new cost from the old cost, and divide by the small value you added to Weight0. This will give you the partial derivative for that weight value. You’d repeat this for all your weights and biases.

Since realistic neural networks often have MANY MANY weights and biases, calculating the gradient numerically is a REALLY REALLY slow process because of how many times you have to run the network to get cost values with adjusted weights. The only upside is that this method is even easier to implement than dual numbers. You can literally stop reading and go do this right now if you want to 😛

Lastly, there is a way to train neural networks which doesn’t use derivatives or the gradient vector, but instead uses the more brute force-ish method of genetic algorithms.

Using genetic algorithms to train neural networks is a huge topic even to summarize, but basically you create a bunch of random networks, see how they do, and try combining features of networks that did well. You also let some of the “losers” reproduce as well, and add in some random mutation to help stay out of local minimums. Repeat this for many many generations, and you can end up with a well trained network!

Here’s a fun video visualizing neural networks being trained by genetic algorithms: Youtube: Learning using a genetic algorithm on a neural network

Backpropagation is Just the Chain Rule!

Going back to our talk of dual numbers for a second, dual numbers are useful for what is called “forward mode automatic differentiation”.

Backpropagation actually uses “reverse mode automatic differentiation”, so the two techniques are pretty closely tied, but they are both made possible by what is known as the chain rule.

The chain rule basically says that if you can write a derivative like this:

$dy/dx$

That you can also write it like this:

$dy/du*du/dx$

That might look weird or confusing, but since we know that derivatives are actual values, aka actual ratios, aka actual FRACTIONS, let’s think back to fractions for a moment.

$3/2 = 1.5$

So far so good? Now let’s choose some number out of the air – say, 5 – and do the same thing we did with the chain rule
$3/2 = \\ 3/5 * 5/2 = \\ 15/10 = \\ 3/2 = \\ 1.5$

Due to doing the reverse of cross cancellation, we are able to inject multiplicative terms into fractions (and derivatives!) and come up with the same answer.

Ok, but who cares?

Well, when we are evaluating the output of a neural network for given input, we have lots of equations nested in each other. We have neurons feeding into neurons feeding into neurons etc, with the logistic activation function at each step.

Instead of trying to figure out how to calculate the derivatives of the weights and biases for the entire monster equation (it’s common to have hundreds or thousands of neurons or more!), we can instead calculate derivatives for each step we do when evaluating the network and then compose them together.

Basically, we can break the problem into small bites instead of having to deal with the equation in it’s entirety.

Instead of calculating the derivative of how a specific weight affects the cost directly, we can instead calculate these:

dCost/dOutput: The derivative of how a neuron’s output affects cost
dOutput/dWeightedInput: The derivative of how the weighted input of a neuron affects a neuron’s output
dWeightedInput/dWeight: The derivative of how a weight affects the weighted input of a neuron

Then, when we multiply them all together, we get the real value that we want:
dCost/dOutput * dOutput/dWeightedInput * dWeightedInput/dWeight = dCost/dWeight

Now that we understand all the basic parts of back propagation, I think it’d be best to work through some examples of increasing complexity to see how it all actually fits together!

Backpropagation Example 1: Single Neuron, One Training Example

This example takes one input and uses a single neuron to make one output. The neuron is only trained to output a 0 when given a 1 as input, all other behavior is undefined. This is implemented as the Example1() function in the sample code.

Backpropagation Example 2: Single Neuron, Two Training Examples

Let’s start with the same neural network from last time:

This time, we are going to teach it not only that it should output 0 when given a 1, but also that it should output 1 when given a 0.

We have two training examples, and we are training the neuron to act like a NOT gate. This is implemented as the Example2() function in the sample code.

The first thing we do is calculate the derivatives (gradient vector) for each of the inputs.

We already calculated the “input 1, output 0” derivatives in the last example:
$\frac{\partial Cost}{\partial Weight} = 0.1476 \\ \frac{\partial Cost}{\partial Bias} = 0.1476$

If we follow the same steps with the “input 0, output 1” training example we get these:
$\frac{\partial Cost}{\partial Weight} = 0.0 \\ \frac{\partial Cost}{\partial Bias} = -0.0887$

To get the actual derivatives to train the network with, we just average them!
$\frac{\partial Cost}{\partial Weight} = 0.0738 \\ \frac{\partial Cost}{\partial Bias} = 0.0294$

From there, we do the same adjustments as before to the weight and bias values to get a weight of 0.2631 and a bias of 0.4853.

If you are wondering how to calculate the cost, again you just take the cost of each training example and average them. Adjusting the weight and bias values causes the cost to drop from 0.1547 to 0.1515, so we have made progress.

It takes 10 times as many iterations with these two training examples to get the same level of error as it did with only one training example though.

As we saw in the last example, after 10,000 iterations, the error was 0.007176.

In this example, after 100,000 iterations, the error is 0.007141. At that point, weight is -9.879733 and bias is 4.837278

Backpropagation Example 3: Two Neurons in One Layer

Here is the next example, implemented as Example3() in the sample code. Two input neurons feed to two neurons in a single layer giving two outputs.

Let’s look at how we’d calculate the derivatives needed to train this network using the training example that when we give the network 01 as input that it should give out 10 as output.

First comes the forward pass where we calculate the network’s output when we give it 01 as input.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0*0.2+1*0.8+0.5 \\ Z0=1.3 \\ \\ O0=\sigma(1.3) \\ O0=0.7858\\ \\ Z1=input0*weight2+input0*weight3+bias1\\ Z1=0*0.6+1*0.4+0.1\\ Z1=0.5\\ \\ O1=\sigma(0.5)\\ O1=0.6225$

Next we calculate a cost. We don’t strictly need to do this step since we don’t use this value during backpropagation, but this will be useful to verify that we’ve improved things after an iteration of training.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(1,0)-(0.7858,0.6225)||^2\\ Cost=0.5*||(0.2142,-0.6225)||^2\\ Cost=0.5*0.6583^2\\ Cost=0.2167$

Now we begin the backwards pass to calculate the derivatives that we’ll need for training.

Let’s calculate dCost/dZ0 aka the error in neuron 0. We’ll do this by calculating dCost/dO0, then dO0/dZ0 and then multiplying them together to get dCost/dZ0. Just like before, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias0.

$\frac{\partial Cost}{\partial O0}=O0-target0\\ \frac{\partial Cost}{\partial O0}=0.7858-1\\ \frac{\partial Cost}{\partial O0}=-0.2142\\ \\ \frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.7858 * 0.2142\\ \frac{\partial O0}{\partial Z0} = 0.1683\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = -0.2142 * 0.1683\\ \frac{\partial Cost}{\partial Z0} = -0.0360\\ \\ \frac{\partial Cost}{\partial Bias0} = -0.0360$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1 by multiplying it by dZ0/dWeight0 and dZ0/dWeight1, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0} \\ \frac{\partial Cost}{\partial Weight0} = -0.0360 * 0 \\ \frac{\partial Cost}{\partial Weight0} = 0\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1} \\ \frac{\partial Cost}{\partial Weight1} = -0.0360 * 1 \\ \frac{\partial Cost}{\partial Weight1} = -0.0360$

Next we need to calculate dCost/dZ1 aka the error in neuron 1. We’ll do this like before. We’ll calculate dCost/dO1, then dO1/dZ1 and then multiplying them together to get dCost/dZ1. Again, this is also the derivative for the bias of the neuron, so this value is also dCost/dBias1.

$\frac{\partial Cost}{\partial O1}=O1-target1\\ \frac{\partial Cost}{\partial O1}=0.6225-0\\ \frac{\partial Cost}{\partial O1}=0.6225\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.6225 * 0.3775\\ \frac{\partial O1}{\partial Z1} = 0.235\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.6225 * 0.235\\ \frac{\partial Cost}{\partial Z1} = 0.1463\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.1463$

Just like with neuron 0, we can use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3 by multiplying it by dZ1/dWeight2 and dZ1/dWeight2, which are input0 and input1 respectively.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2} \\ \frac{\partial Cost}{\partial Weight2} = 0.1463 * 0 \\ \frac{\partial Cost}{\partial Weight2} = 0\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3} \\ \frac{\partial Cost}{\partial Weight3} = 0.1463 * 1 \\ \frac{\partial Cost}{\partial Weight3} = 0.1463$

After using these derivatives to update the weights and biases with a learning rate of 0.5, they become:
Weight0 = 0.2
Weight1 = 0.818
Weight2 = 0.6
Weight3 = 0.3269
Bias0 = 0.518
Bias1 = 0.0269

Using these values, the cost becomes 0.1943, which dropped from 0.2167, so we have indeed made progress with our learning!

Interestingly, it takes about twice as many trainings as example 1 to get a similar level of error. In this case, 20,000 iterations of learning results in an error of 0.007142.

If we have the network learn the four patterns below instead:
00 = 00
01 = 10
10 = 10
11 = 11

It takes 520,000 learning iterations to get to an error of 0.007223.

Backpropagation Example 4: Two Layers, Two Neurons Each

This is the last example, implemented as Example4() in the sample code. Two input neurons feed to two neurons in a hidden layer, feeding into two neurons in the output layer giving two outputs. This is the exact same network that is walked through on this page which is also linked to at the end of this post: A Step by Step Backpropagation Example

First comes the forward pass where we calculate the network’s output. We’ll give it 0.05 and 0.1 as input, and we’ll say our desired output is 0.01 and 0.99.

$Z0=input0*weight0+input1*weight1+bias0 \\ Z0=0.05*0.15+0.1*0.2+0.35 \\ Z0=0.3775 \\ \\ O0=\sigma(0.3775) \\ O0=0.5933 \\ \\ Z1=input0*weight2+input1*weight3+bias1\\ Z1=0.05*0.25+0.1*0.3+0.35\\ Z1=0.3925\\ \\ O1=\sigma(0.3925)\\ O1=0.5969\\ \\ Z2=O0*weight4+O1*weight5+bias2\\ Z2=0.5933*0.4+0.5969*0.45+0.6\\ Z2=1.106\\ \\ O2=\sigma(1.106)\\ O2=0.7514\\ \\ Z3=O0*weight6+O1*weight7+bias3\\ Z3=0.5933*0.5+0.5969*0.55+0.6\\ Z3=1.225\\ \\ O3=\sigma(1.225)\\ O3=0.7729$

Next we calculate the cost, taking O2 and O3 as our actual output, and 0.01 and 0.99 as our target (desired) output.

$Cost=0.5*||target-actual||^2\\ Cost=0.5*||(0.01,0.99)-(0.7514,0.7729)||^2\\ Cost=0.5*||(-0.7414,-0.2171)||^2\\ Cost=0.5*0.7725^2\\ Cost=0.2984$

Now we start the backward pass to calculate the derivatives for training.

Neuron 2

First we’ll calculate dCost/dZ2 aka the error in neuron 2, remembering that the value is also dCost/dBias2.

$\frac{\partial Cost}{\partial O2}=O2-target0\\ \frac{\partial Cost}{\partial O2}=0.7514-0.01\\ \frac{\partial Cost}{\partial O2}=0.7414\\ \\ \frac{\partial O2}{\partial Z2} = O2 * (1-O2)\\ \frac{\partial O2}{\partial Z2} = 0.7514 * 0.2486\\ \frac{\partial O2}{\partial Z2} = 0.1868\\ \\ \frac{\partial Cost}{\partial Z2} = \frac{\partial Cost}{\partial O2} * \frac{\partial O2}{\partial Z2}\\ \frac{\partial Cost}{\partial Z2} = 0.7414 * 0.1868\\ \frac{\partial Cost}{\partial Z2} = 0.1385\\ \\ \frac{\partial Cost}{\partial Bias2} = 0.1385$

We can use dCost/dZ2 to calculate dCost/dWeight4 and dCost/dWeight5.

$\frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight4}\\ \frac{\partial Cost}{\partial Weight4} = \frac{\partial Cost}{\partial Z2} * O0\\ \frac{\partial Cost}{\partial Weight4} = 0.1385 * 0.5933\\ \frac{\partial Cost}{\partial Weight4} = 0.0822\\ \\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial Weight5}\\ \frac{\partial Cost}{\partial Weight5} = \frac{\partial Cost}{\partial Z2} * O1\\ \frac{\partial Cost}{\partial Weight5} = 0.1385 * 0.5969\\ \frac{\partial Cost}{\partial Weight5} = 0.0827\\$

Neuron 3

Next we’ll calculate dCost/dZ3 aka the error in neuron 3, which is also dCost/dBias3.

$\frac{\partial Cost}{\partial O3}=O3-target1\\ \frac{\partial Cost}{\partial O3}=0.7729-0.99\\ \frac{\partial Cost}{\partial O3}=-0.2171\\ \\ \frac{\partial O3}{\partial Z3} = O3 * (1-O3)\\ \frac{\partial O3}{\partial Z3} = 0.7729 * 0.2271\\ \frac{\partial O3}{\partial Z3} = 0.1755\\ \\ \frac{\partial Cost}{\partial Z3} = \frac{\partial Cost}{\partial O3} * \frac{\partial O3}{\partial Z3}\\ \frac{\partial Cost}{\partial Z3} = -0.2171 * 0.1755\\ \frac{\partial Cost}{\partial Z3} = -0.0381\\ \\ \frac{\partial Cost}{\partial Bias3} = -0.0381$

We can use dCost/dZ3 to calculate dCost/dWeight6 and dCost/dWeight7.

$\frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight6}\\ \frac{\partial Cost}{\partial Weight6} = \frac{\partial Cost}{\partial Z3} * O0\\ \frac{\partial Cost}{\partial Weight6} = -0.0381 * 0.5933\\ \frac{\partial Cost}{\partial Weight6} = -0.0226\\ \\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial Weight7}\\ \frac{\partial Cost}{\partial Weight7} = \frac{\partial Cost}{\partial Z3} * O1\\ \frac{\partial Cost}{\partial Weight7} = -0.0381 * 0.5969\\ \frac{\partial Cost}{\partial Weight7} = -0.0227\\$

Neuron 0

Next, we want to calculate dCost/dO0, but doing that requires us to do something new. Neuron 0 affects both neuron 2 and neuron 3, which means that it affects the cost through those two neurons as well. That means our calculation for dCost/dO0 is going to be slightly different, where we add the derivatives of both paths together. Let’s work through it:

$\frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O0} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O0}\\ \frac{\partial Cost}{\partial O0} = \frac{\partial Cost}{\partial Z2} * Weight4 + \frac{\partial Cost}{\partial Z3} * Weight6\\ \frac{\partial Cost}{\partial O0} = 0.1385 * 0.4 - 0.0381 * 0.5\\ \frac{\partial Cost}{\partial O0} = 0.0364$

We can then continue and calculate dCost/dZ0, which is also dCost/dBias0, and the error in neuron 0.

$\frac{\partial O0}{\partial Z0} = O0 * (1-O0)\\ \frac{\partial O0}{\partial Z0} = 0.5933 * 0.4067\\ \frac{\partial O0}{\partial Z0} = 0.2413\\ \\ \frac{\partial Cost}{\partial Z0} = \frac{\partial Cost}{\partial O0} * \frac{\partial O0}{\partial Z0}\\ \frac{\partial Cost}{\partial Z0} = 0.0364 * 0.2413\\ \frac{\partial Cost}{\partial Z0} = 0.0088\\ \\ \frac{\partial Cost}{\partial Bias0} = 0.0088$

We can use dCost/dZ0 to calculate dCost/dWeight0 and dCost/dWeight1.

$\frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight0}\\ \frac{\partial Cost}{\partial Weight0} = \frac{\partial Cost}{\partial Z0} * input0\\ \frac{\partial Cost}{\partial Weight0} = 0.0088 * 0.05\\ \frac{\partial Cost}{\partial Weight0} = 0.0004\\ \\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * \frac{\partial Z0}{\partial Weight1}\\ \frac{\partial Cost}{\partial Weight1} = \frac{\partial Cost}{\partial Z0} * input1\\ \frac{\partial Cost}{\partial Weight1} = 0.0088 * 0.1\\ \frac{\partial Cost}{\partial Weight1} = 0.0009\\$

Neuron 1

We are almost done, so hang in there. For our home stretch, we need to calculate dCost/dO1 similarly as we did for dCost/dO0, and then use that to calculate the derivatives of bias1 and weight2 and weight3.

$\frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * \frac{\partial Z2}{\partial O1} + \frac{\partial Cost}{\partial Z3} * \frac{\partial Z3}{\partial O1}\\ \frac{\partial Cost}{\partial O1} = \frac{\partial Cost}{\partial Z2} * Weight5 + \frac{\partial Cost}{\partial Z3} * Weight7\\ \frac{\partial Cost}{\partial O1} = 0.1385 * 0.45 - 0.0381 * 0.55\\ \frac{\partial Cost}{\partial O1} = 0.0414\\ \\ \frac{\partial O1}{\partial Z1} = O1 * (1-O1)\\ \frac{\partial O1}{\partial Z1} = 0.5969 * 0.4031\\ \frac{\partial O1}{\partial Z1} = 0.2406\\ \\ \frac{\partial Cost}{\partial Z1} = \frac{\partial Cost}{\partial O1} * \frac{\partial O1}{\partial Z1}\\ \frac{\partial Cost}{\partial Z1} = 0.0414 * 0.2406\\ \frac{\partial Cost}{\partial Z1} = 0.01\\ \\ \frac{\partial Cost}{\partial Bias1} = 0.01$

Lastly, we will use dCost/dZ1 to calculate dCost/dWeight2 and dCost/dWeight3.

$\frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight2}\\ \frac{\partial Cost}{\partial Weight2} = \frac{\partial Cost}{\partial Z1} * input0\\ \frac{\partial Cost}{\partial Weight2} = 0.01 * 0.05\\ \frac{\partial Cost}{\partial Weight2} = 0.0005\\ \\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * \frac{\partial Z1}{\partial Weight3}\\ \frac{\partial Cost}{\partial Weight3} = \frac{\partial Cost}{\partial Z1} * input1\\ \frac{\partial Cost}{\partial Weight3} = 0.01 * 0.1\\ \frac{\partial Cost}{\partial Weight3} = 0.001\\$

Backpropagation Done

Phew, we have all the derivatives we need now.

Here’s our new weights and biases using a learning rate of 0.5:

Weight0 = 0.15 – (0.5 * 0.0004) = 0.1498
Weight1 = 0.2 – (0.5 * 0.0009) = 0.1996
Weight2 = 0.25 – (0.5 * 0.0005) = 0.2498
Weight3 = 0.3 – (0.5 * 0.001) = 0.2995
Weight4 = 0.4 – (0.5 * 0.0822) = 0.3589
Weight5 = 0.45 – (0.5 * 0.0827) = 0.4087
Weight6 = 0.5 – (0.5 * -0.0226) = 0.5113
Weight7 = 0.55 – (0.5 * -0.0227) = 0.5614
Bias0 = 0.35 – (0.5 * 0.0088) = 0.3456
Bias1 = 0.35 – (0.5 * 0.01) = 0.345
Bias2 = 0.6 – (0.5 * 0.1385) = 0.5308
Bias3 = 0.6 – (0.5 * -0.0381) = 0.6191

Using these new values, the cost function value drops from 0.2984 to 0.2839, so we have made progress!

Interestingly, it only takes 5,000 iterations of learning for this network to reach an error of 0.007157, when it took 10,000 iterations of learning for example 1 to get to 0.007176.

Before moving on, take a look at the weight adjustments above. You might notice that the derivatives for the weights are much smaller for weights 0,1,2,3 compared to weights 4,5,6,7. The reason for this is because weights 0,1,2,3 appear earlier in the network. The problem is that earlier layer neurons don’t learn as fast as later layer neurons and this is caused by the nature of the neuron activation functions – specifically, that the sigmoid function has a long tail near 0 and 1 – and is called the “vanishing gradient problem”. The opposite effect can also happen however, where earlier layer gradients explode to super huge numbers, so the more general term is called the “unstable gradient problem”. This is an active area of research on how to address, and this becomes more and more of a problem the more layers you have in your network.

You can use other activation functions such as tanh, identity, relu and others to try and get around this problem. If trying different activation functions, the forward pass (evaluation of a neural network) as well as the backpropagation of error pass remain the same, but of course the calculation for getting O from Z changes, and of course, calculating the derivative deltaO/deltaZ becomes different. Everything else remains the same.

Check the links at the bottom of the post for more information about this!

Sample Code

Below is the sample code which implements all the back propagation examples we worked through above.

Note that this code is meant to be readable and understandable. The code is not meant to be re-usable or highly efficient.

A more efficient implementation would use SIMD instructions, multithreading, stochastic gradient descent, and other things.

It’s also useful to note that calculating a neuron’s Z value is actually a dot product and an addition and that the addition can be handled within the dot product by adding a “fake input” to each neuron that is a constant of 1. This lets you do a dot product to calculate the Z value of a neuron, which you can take further and combine into matrix operations to calculate multiple neuron values at once. You’ll often see neural networks described in matrix notation because of this, but I have avoided that in this post to try and make things more clear to programmers who may not be as comfortable thinking in strictly matrix notation.

#include <stdio.h>
#include <array>

// Nonzero value enables csv logging.
#define LOG_TO_CSV_NUMSAMPLES() 50

// ===== Example 1 - One Neuron, One training Example =====

void Example1RunNetwork (
	float input, float desiredOutput,
	float weight, float bias,
	float& error, float& cost, float& actualOutput,
	float& deltaCost_deltaWeight, float& deltaCost_deltaBias, float& deltaCost_deltaInput
) {
	// calculate Z (weighted input) and O (activation function of weighted input) for the neuron
	float Z = input * weight + bias;
	float O = 1.0f / (1.0f + std::exp(-Z));

	// the actual output of the network is the activation of the neuron
	actualOutput = O;

	// calculate error
	error = std::abs(desiredOutput - actualOutput);

	// calculate cost
	cost = 0.5f * error * error;

	// calculate how much a change in neuron activation affects the cost function
	// deltaCost/deltaO = O - target
	float deltaCost_deltaO = O - desiredOutput;

	// calculate how much a change in neuron weighted input affects neuron activation
	// deltaO/deltaZ = O * (1 - O)
	float deltaO_deltaZ = O * (1 - O);


	// calculate how much a change in a neuron's weighted input affects the cost function.
	// This is deltaCost/deltaZ, which equals deltaCost/deltaO * deltaO/deltaZ
	// This is also deltaCost/deltaBias and is also refered to as the error of the neuron
	float neuronError = deltaCost_deltaO * deltaO_deltaZ;
	deltaCost_deltaBias = neuronError;

	// calculate how much a change in the weight affects the cost function.
	// deltaCost/deltaWeight = deltaCost/deltaO * deltaO/deltaZ * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * deltaZ/deltaWeight
	// deltaCost/deltaWeight = neuronError * input
	deltaCost_deltaWeight = neuronError * input;


	// As a bonus, calculate how much a change in the input affects the cost function.
	// Follows same logic as deltaCost/deltaWeight, but deltaZ/deltaInput is the weight.
	// deltaCost/deltaInput = neuronError * weight
	deltaCost_deltaInput = neuronError * weight;
}

void Example1 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example1.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 10000;

	// training data
	// input: 1, output: 0
	const std::array<float, 2> c_trainingData = {1.0f, 0.0f};

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float error = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		// run the network to get error and derivatives
		float output = 0.0f;
		float cost = 0.0f;
		float deltaCost_deltaWeight = 0.0f;
		float deltaCost_deltaBias = 0.0f;
		float deltaCost_deltaInput = 0.0f;
		Example1RunNetwork(c_trainingData[0], c_trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, error, cost, weight, bias, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= deltaCost_deltaWeight * c_learningRate;
		bias -= deltaCost_deltaBias * c_learningRate;
	}

	printf("Example1 Final Error: %fn", error);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 2 - One Neuron, Two training Examples =====

void Example2 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example2.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost","weight","bias","dCost/dWeight","dCost/dBias","dCost/dInput"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 100000;

	// training data
	// input: 1, output: 0
	// input: 0, output: 1
	const std::array<std::array<float, 2>, 2> c_trainingData = { {
		{1.0f, 0.0f},
		{0.0f, 1.0f}
	} };

	// starting weight and bias values
	float weight = 0.3f;
	float bias = 0.5f;

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		avgError = 0.0f;
		float avgOutput = 0.0f;
		float avgCost = 0.0f;
		float avgDeltaCost_deltaWeight = 0.0f;
		float avgDeltaCost_deltaBias = 0.0f;
		float avgDeltaCost_deltaInput = 0.0f;

		// run the network to get error and derivatives for each training example
		for (const std::array<float, 2>& trainingData : c_trainingData)
		{
			float error = 0.0f;
			float output = 0.0f;
			float cost = 0.0f;
			float deltaCost_deltaWeight = 0.0f;
			float deltaCost_deltaBias = 0.0f;
			float deltaCost_deltaInput = 0.0f;
			Example1RunNetwork(trainingData[0], trainingData[1], weight, bias, error, cost, output, deltaCost_deltaWeight, deltaCost_deltaBias, deltaCost_deltaInput);

			avgError += error;
			avgOutput += output;
			avgCost += cost;
			avgDeltaCost_deltaWeight += deltaCost_deltaWeight;
			avgDeltaCost_deltaBias += deltaCost_deltaBias;
			avgDeltaCost_deltaInput += deltaCost_deltaInput;
		}

		avgError /= (float)c_trainingData.size();
		avgOutput /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		avgDeltaCost_deltaWeight /= (float)c_trainingData.size();
		avgDeltaCost_deltaBias /= (float)c_trainingData.size();
		avgDeltaCost_deltaInput /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f","%f","%f","%f","%f","%f",n", trainingIndex, avgError, avgCost, weight, bias, avgDeltaCost_deltaWeight, avgDeltaCost_deltaBias, avgDeltaCost_deltaInput);
			}
		#endif

		// adjust weights and biases
		weight -= avgDeltaCost_deltaWeight * c_learningRate;
		bias -= avgDeltaCost_deltaBias * c_learningRate;
	}

	printf("Example2 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 3 - Two inputs, two neurons in one layer =====

struct SExample3Training
{
	std::array<float, 2> m_input;
	std::array<float, 2> m_output;
};

void Example3RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 4>& weights, const std::array<float, 2>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 4>& deltaCost_deltaWeights, std::array<float, 2>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {

	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// the actual output of the network is the activation of the neurons
	actualOutput[0] = O0;
	actualOutput[1] = O1;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = O0 - target0
	float deltaCost_deltaO0 = O0 - desiredOutput[0];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = O1 - target1
	float deltaCost_deltaO1 = O1 - desiredOutput[1];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO0/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO/deltaZ1 * deltaZ0/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ/deltaWeight1
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example3 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example3.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 520000;

	// training data: OR/AND
	// input: 00, output: 00
	// input: 01, output: 10
	// input: 10, output: 10
	// input: 11, output: 11
	const std::array<SExample3Training, 4> c_trainingData = { {
		{{0.0f, 0.0f}, {0.0f, 0.0f}},
		{{0.0f, 1.0f}, {1.0f, 0.0f}},
		{{1.0f, 0.0f}, {1.0f, 0.0f}},
		{{1.0f, 1.0f}, {1.0f, 1.0f}},
	} };

	// starting weight and bias values
	std::array<float, 4> weights = { 0.2f, 0.8f, 0.6f, 0.4f };
	std::array<float, 2> biases = { 0.5f, 0.1f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		//float avgCost = 0.0f;
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaBiases = { 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 4> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaBiases = { 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example3RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example3 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

// ===== Example 4 - Two layers with two neurons in each layer =====

void Example4RunNetwork (
	const std::array<float, 2>& input, const std::array<float, 2>& desiredOutput,
	const std::array<float, 8>& weights, const std::array<float, 4>& biases,
	float& error, float& cost, std::array<float, 2>& actualOutput,
	std::array<float, 8>& deltaCost_deltaWeights, std::array<float, 4>& deltaCost_deltaBiases, std::array<float, 2>& deltaCost_deltaInputs
) {
	// calculate Z0 and O0 for neuron0
	float Z0 = input[0] * weights[0] + input[1] * weights[1] + biases[0];
	float O0 = 1.0f / (1.0f + std::exp(-Z0));

	// calculate Z1 and O1 for neuron1
	float Z1 = input[0] * weights[2] + input[1] * weights[3] + biases[1];
	float O1 = 1.0f / (1.0f + std::exp(-Z1));

	// calculate Z2 and O2 for neuron2
	float Z2 = O0 * weights[4] + O1 * weights[5] + biases[2];
	float O2 = 1.0f / (1.0f + std::exp(-Z2));

	// calculate Z3 and O3 for neuron3
	float Z3 = O0 * weights[6] + O1 * weights[7] + biases[3];
	float O3 = 1.0f / (1.0f + std::exp(-Z3));

	// the actual output of the network is the activation of the output layer neurons
	actualOutput[0] = O2;
	actualOutput[1] = O3;

	// calculate error
	float diff0 = desiredOutput[0] - actualOutput[0];
	float diff1 = desiredOutput[1] - actualOutput[1];
	error = std::sqrt(diff0*diff0 + diff1*diff1);

	// calculate cost
	cost = 0.5f * error * error;

	//----- Neuron 2 -----

	// calculate how much a change in neuron 2 activation affects the cost function
	// deltaCost/deltaO2 = O2 - target0
	float deltaCost_deltaO2 = O2 - desiredOutput[0];

	// calculate how much a change in neuron 2 weighted input affects neuron 2 activation
	// deltaO2/deltaZ2 = O2 * (1 - O2)
	float deltaO2_deltaZ2 = O2 * (1 - O2);

	// calculate how much a change in neuron 2 weighted input affects the cost function.
	// This is deltaCost/deltaZ2, which equals deltaCost/deltaO2 * deltaO2/deltaZ2
	// This is also deltaCost/deltaBias2 and is also refered to as the error of neuron 2
	float neuron2Error = deltaCost_deltaO2 * deltaO2_deltaZ2;
	deltaCost_deltaBiases[2] = neuron2Error;

	// calculate how much a change in weight4 affects the cost function.
	// deltaCost/deltaWeight4 = deltaCost/deltaO2 * deltaO2/deltaZ2 * deltaZ2/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * deltaZ/deltaWeight4
	// deltaCost/deltaWeight4 = neuron2Error * O0
	// similar thing for weight5
	deltaCost_deltaWeights[4] = neuron2Error * O0;
	deltaCost_deltaWeights[5] = neuron2Error * O1;

	//----- Neuron 3 -----

	// calculate how much a change in neuron 3 activation affects the cost function
	// deltaCost/deltaO3 = O3 - target1
	float deltaCost_deltaO3 = O3 - desiredOutput[1];

	// calculate how much a change in neuron 3 weighted input affects neuron 3 activation
	// deltaO3/deltaZ3 = O3 * (1 - O3)
	float deltaO3_deltaZ3 = O3 * (1 - O3);

	// calculate how much a change in neuron 3 weighted input affects the cost function.
	// This is deltaCost/deltaZ3, which equals deltaCost/deltaO3 * deltaO3/deltaZ3
	// This is also deltaCost/deltaBias3 and is also refered to as the error of neuron 3
	float neuron3Error = deltaCost_deltaO3 * deltaO3_deltaZ3;
	deltaCost_deltaBiases[3] = neuron3Error;

	// calculate how much a change in weight6 affects the cost function.
	// deltaCost/deltaWeight6 = deltaCost/deltaO3 * deltaO3/deltaZ3 * deltaZ3/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * deltaZ/deltaWeight6
	// deltaCost/deltaWeight6 = neuron3Error * O0
	// similar thing for weight7
	deltaCost_deltaWeights[6] = neuron3Error * O0;
	deltaCost_deltaWeights[7] = neuron3Error * O1;

	//----- Neuron 0 -----

	// calculate how much a change in neuron 0 activation affects the cost function
	// deltaCost/deltaO0 = deltaCost/deltaZ2 * deltaZ2/deltaO0 + deltaCost/deltaZ3 * deltaZ3/deltaO0
	// deltaCost/deltaO0 = neuron2Error * weight4 + neuron3error * weight6
	float deltaCost_deltaO0 = neuron2Error * weights[4] + neuron3Error * weights[6];

	// calculate how much a change in neuron 0 weighted input affects neuron 0 activation
	// deltaO0/deltaZ0 = O0 * (1 - O0)
	float deltaO0_deltaZ0 = O0 * (1 - O0);

	// calculate how much a change in neuron 0 weighted input affects the cost function.
	// This is deltaCost/deltaZ0, which equals deltaCost/deltaO0 * deltaO0/deltaZ0
	// This is also deltaCost/deltaBias0 and is also refered to as the error of neuron 0
	float neuron0Error = deltaCost_deltaO0 * deltaO0_deltaZ0;
	deltaCost_deltaBiases[0] = neuron0Error;

	// calculate how much a change in weight0 affects the cost function.
	// deltaCost/deltaWeight0 = deltaCost/deltaO0 * deltaO0/deltaZ0 * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * deltaZ0/deltaWeight0
	// deltaCost/deltaWeight0 = neuron0Error * input0
	// similar thing for weight1
	deltaCost_deltaWeights[0] = neuron0Error * input[0];
	deltaCost_deltaWeights[1] = neuron0Error * input[1];

	//----- Neuron 1 -----

	// calculate how much a change in neuron 1 activation affects the cost function
	// deltaCost/deltaO1 = deltaCost/deltaZ2 * deltaZ2/deltaO1 + deltaCost/deltaZ3 * deltaZ3/deltaO1
	// deltaCost/deltaO1 = neuron2Error * weight5 + neuron3error * weight7
	float deltaCost_deltaO1 = neuron2Error * weights[5] + neuron3Error * weights[7];

	// calculate how much a change in neuron 1 weighted input affects neuron 1 activation
	// deltaO1/deltaZ1 = O1 * (1 - O1)
	float deltaO1_deltaZ1 = O1 * (1 - O1);

	// calculate how much a change in neuron 1 weighted input affects the cost function.
	// This is deltaCost/deltaZ1, which equals deltaCost/deltaO1 * deltaO1/deltaZ1
	// This is also deltaCost/deltaBias1 and is also refered to as the error of neuron 1
	float neuron1Error = deltaCost_deltaO1 * deltaO1_deltaZ1;
	deltaCost_deltaBiases[1] = neuron1Error;

	// calculate how much a change in weight2 affects the cost function.
	// deltaCost/deltaWeight2 = deltaCost/deltaO1 * deltaO1/deltaZ1 * deltaZ1/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * deltaZ2/deltaWeight2
	// deltaCost/deltaWeight2 = neuron1Error * input0
	// similar thing for weight3
	deltaCost_deltaWeights[2] = neuron1Error * input[0];
	deltaCost_deltaWeights[3] = neuron1Error * input[1];

	//----- Input -----

	// As a bonus, calculate how much a change in the inputs affect the cost function.
	// A complication here compared to Example1 and Example2 is that each input affects two neurons instead of only one.
	// That means that...
	// deltaCost/deltaInput0 = deltaCost/deltaZ0 * deltaZ0/deltaInput0 + deltaCost/deltaZ1 * deltaZ1/deltaInput0
	//                       = neuron0Error * weight0 + neuron1Error * weight2
	// and
	// deltaCost/deltaInput1 = deltaCost/deltaZ0 * deltaZ0/deltaInput1 + deltaCost/deltaZ1 * deltaZ1/deltaInput1
	//                       = neuron0Error * weight1 + neuron1Error * weight3
	deltaCost_deltaInputs[0] = neuron0Error * weights[0] + neuron1Error * weights[2];
	deltaCost_deltaInputs[1] = neuron0Error * weights[1] + neuron1Error * weights[3];
}

void Example4 ()
{
	#if LOG_TO_CSV_NUMSAMPLES() > 0
		// open the csv file for this example
		FILE *file = fopen("Example4.csv","w+t");
		if (file != nullptr)
			fprintf(file, ""training index","error","cost"n");
	#endif

	// learning parameters for the network
	const float c_learningRate = 0.5f;
	const size_t c_numTrainings = 5000;

	// training data: 0.05, 0.1 in = 0.01, 0.99 out
	const std::array<SExample3Training, 1> c_trainingData = { {
		{{0.05f, 0.1f}, {0.01f, 0.99f}},
	} };

	// starting weight and bias values
	std::array<float, 8> weights = { 0.15f, 0.2f, 0.25f, 0.3f, 0.4f, 0.45f, 0.5f, 0.55f};
	std::array<float, 4> biases = { 0.35f, 0.35f, 0.6f, 0.6f };

	// iteratively train the network
	float avgError = 0.0f;
	for (size_t trainingIndex = 0; trainingIndex < c_numTrainings; ++trainingIndex)
	{
		std::array<float, 2> avgOutput = { 0.0f, 0.0f };
		std::array<float, 8> avgDeltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 4> avgDeltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
		std::array<float, 2> avgDeltaCost_deltaInputs = { 0.0f, 0.0f };
		avgError = 0.0f;
		float avgCost = 0.0;

		// run the network to get error and derivatives for each training example
		for (const SExample3Training& trainingData : c_trainingData)
		{
			float error = 0.0f;
			std::array<float, 2> output = { 0.0f, 0.0f };
			float cost = 0.0f;
			std::array<float, 8> deltaCost_deltaWeights = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 4> deltaCost_deltaBiases = { 0.0f, 0.0f, 0.0f, 0.0f };
			std::array<float, 2> deltaCost_deltaInputs = { 0.0f, 0.0f };
			Example4RunNetwork(trainingData.m_input, trainingData.m_output, weights, biases, error, cost, output, deltaCost_deltaWeights, deltaCost_deltaBiases, deltaCost_deltaInputs);

			avgError += error;
			avgCost += cost;
			for (size_t i = 0; i < avgOutput.size(); ++i)
				avgOutput[i] += output[i];
			for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
				avgDeltaCost_deltaWeights[i] += deltaCost_deltaWeights[i];
			for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
				avgDeltaCost_deltaBiases[i] += deltaCost_deltaBiases[i];
			for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
				avgDeltaCost_deltaInputs[i] += deltaCost_deltaInputs[i];
		}

		avgError /= (float)c_trainingData.size();
		avgCost /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgOutput.size(); ++i)
			avgOutput[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaWeights.size(); ++i)
			avgDeltaCost_deltaWeights[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaBiases.size(); ++i)
			avgDeltaCost_deltaBiases[i] /= (float)c_trainingData.size();
		for (size_t i = 0; i < avgDeltaCost_deltaInputs.size(); ++i)
			avgDeltaCost_deltaInputs[i] /= (float)c_trainingData.size();

		#if LOG_TO_CSV_NUMSAMPLES() > 0
			const size_t trainingInterval = (c_numTrainings / (LOG_TO_CSV_NUMSAMPLES() - 1));
			if (file != nullptr && (trainingIndex % trainingInterval == 0 || trainingIndex == c_numTrainings - 1))
			{
				// log to the csv
				fprintf(file, ""%zi","%f","%f"n", trainingIndex, avgError, avgCost);
			}
		#endif

		// adjust weights and biases
		for (size_t i = 0; i < weights.size(); ++i)
			weights[i] -= avgDeltaCost_deltaWeights[i] * c_learningRate;
		for (size_t i = 0; i < biases.size(); ++i)
			biases[i] -= avgDeltaCost_deltaBiases[i] * c_learningRate;
	}

	printf("Example4 Final Error: %fn", avgError);

	#if LOG_TO_CSV_NUMSAMPLES() > 0
		if (file != nullptr)
			fclose(file);
	#endif
}

int main (int argc, char **argv)
{
	Example1();
	Example2();
	Example3();
	Example4();
	system("pause");
	return 0;
}

Closing & Links

The sample code outputs csv files showing how the values of the networks change over time. One of the reasons for this is because I want to show you error over time.

Below is example 4’s error over time, as we do it’s 5,000 learning iterations.

The other examples show a similarly shaped graph, where there is a lot of learning in the very beginning, and then there is a very long tail of learning very slowly.

When you train neural networks as I’ve described them, you will almost always see this, and sometimes will also see a slow learning time at the BEGINNING of the training.

This issue is also due to the activation function used, just like the unstable gradient problem, and is also an active area of research.

To help fix this issue, there is something called a “cross entropy cost function” which you can use instead of the mean squared error cost function I have been using.

That cost function essentially cancels out the non linearity of the activation function so that you get nicer linear learning progress, and can get networks to learn more quickly and evenly. However, it only cancels out the non linearity for the LAST layer in the network. This means it’s still a problem for networks that have more layers.

Lastly, there is an entirely different thing you can use backpropagation for. We adjusted the weights and biases to get our desired output for the desired inputs. What if instead we adjusted our inputs to give us the desired outputs?

You can do that by using backpropagation to calculate the dCost/dInput derivatives and using those to adjust the input, in the exact same way we adjusted the weights and biases.

You can use this to do some interesting things, including:

finding images that a network will recognize as a familiar object, that a human wouldn’t. Start with static as input to the network, and adjust inputs to give the desired output.
Modifying images that a network recognizes, into images it doesn’t recognize, but a human would. Start with a well recognized image, and adjust inputs using gradient ASCENT (add the derivatives, don’t subtract them) until the network stops recognizing it.

Believe it or not, this is how all those creepy “deep dream” images were made that came out of google as well, like the one below.

Now that you know the basics, you are ready to learn some more if you are interested. If you still have some questions about things I did or didn’t talk about, these resources might help you make sense of it too. I used these resources and they were all very helpful! You can also give me a shout in the comments below, or on twitter at @Atrix256.

A Step by Step Backpropagation Example
Neural Networks and Deep Learning
Backpropogation is Just Steepest Descent with Automatic Differentiation
Chain Rule
Deep Vis