# Dice, Distributions & Noise Colors

Thank you Uncle Chris and Aunt Lily for sending me this gigantic 120 sided dice!

A while back I came across an interesting page that talks about some fundamental concepts relating to noise. It isn’t very long or hard to read, but has some great info.

Dither noise probability density explained
https://www.digido.com/ufaqs/dither-noise-probability-density-explained/

The main points it makes are:

1. White noise just means the noise values are drawn independently and have no correlation to each other. Blue and red noise (and other noise colors) are correlated samples.
2. The distribution (histogram) the values are generated / drawn from is completely unrelated to the “color”.

I wanted to go through the experiments it mentioned to understand what it was talking about a bit better. This post does that and explores a few other related experiments and algorithms.

You can find the code that goes with this at https://github.com/Atrix256/NoiseDims

# Intro Image Explanation

Thanks to my Uncle Chris and Aunt Lily for sending me that gigantic 120 sided dice. That thing is epic.

You’d need to roll those 20 dice next to it to equal the massive THAC0 power than it has. But as you’ll see in this post, doing so would give you a Gaussian distribution, and not all numbers would come up with equal probability or even be possible. Try rolling a 1 with 20 six sided dice!

The black dice in the background are some cool diablo themed dice I got one year from Blizzcon.

Seeing as the dice are not magnetic, or quantumly entangled, rolling these dice should give you uncorrelated values (white noise), but I can’t vouch for their histograms (are they fair dice or not?).

BTW i am taking this picture from inside my hotel room in LA during SIGGRAPH, instead of going out and being social. This is what real programmers do right? Ditch social engagements to take pictures of dice and compare histograms and DFTs? This post has been sitting in my drafts for far too long and just really needed to get finished 😛

Onto the post!

# Testing Methods

There are two main tools used in this post to understand noise patterns better.

The first is the histogram:

You do your process N times to get N random numbers and keep track of how many times each number came up. For a small number of rolls, the histogram will likely be erratic, but for a larger number of rolls, it will take the shape of the distribution of the numbers.

The second is the magnitude component of the Fourier transform of the rolls:

You do your process N times to get N random numbers. Do a discrete Fourier transform of that sequence of numbers to see the frequencies involved and extract the magnitude of each frequency (ignoring the phase). Just like the histogram, A small number of rolls will likely be erratic, but a larger number of rolls will show more clearly the actual frequencies present in the number sequence.

So in short: The histogram shows the distribution, and the Fourier transform shows the noise color.

# White Rectangular Noise

White rectangular noise is “regular old random numbers”. Roll a die, write down the value. Rinse and repeat.

It has the properties that…

1. The samples have no correlation to each other.
2. Every possible value has the same probability of being chosen.

The example code rolls a 6 sided dice a number of different times. Here is the result it gets for rolling the dice 16 times:
2, 0, 2, 2, 0, 3, 2, 4, 2, 2, 0, 5, 0, 3, 1, 3

There sure are a lot of 2’s aren’t there? You might not look at this sequence and think it’s very random, but it is.

This is just how white noise works; it has clumps and voids and is in general pretty uneven.

At the limit (large sample counts) it evens out. Check out the image below to see how the histogram gets flatter as the number of rolls gets higher. This shows that it is a rectangular distribution. (Note: The 8192 x 100 line is the average of 100 different 8192 runs)

From a frequency point of view, since it’s white noise, it should have all frequencies present in equal amounts.

At 16 samples, the frequency magnitudes are a bit wild:

At 1024, they are flattening a bit:

At 8192, even more so:

When averaging 100 runs of 8192, it cleans up a lot and shows roughly the same magnitude for all frequencies, which shows that the noise color is white and the rolls are uncorrelated.

# White Triangular and Gaussian Noise

If you roll two dice and add them, you get white triangular noise. You can also subtract one from the other and get white triangular noise.

If you involve more than two dice and do additions and subtractions of various amounts, the triangle rounds out and starts to look more like a normal aka Gaussian distribution. If you rolled an infinite number of dice, it would be Gaussian. This is due to the central limit theorem (https://en.wikipedia.org/wiki/Central_limit_theorem)

Doing these things, the dice rolls are not correlated with each other so they are still white noise.

Adding two dice (0-5) together, gives a result 0-10. Here’s the histogram of 1024 rolls:

Averaging 100 different 8192 sets of rolls gives this much cleaner histogram:

And when looking at the frequencies (averaging 100 different 8192 rolls), it still looks white, despite having a different histogram / distribution.

Adding three dice together gives results 0-15. The histogram for 1024 rolls still looks pretty lumpy:

But it cleans up when averaging 100 different 8192 rolls and looks pretty gaussian, since there are 3 dice involved instead of only 2.

The frequencies look the same (white) so I won’t even bother showing them.

Here’s the histogram of 100 different 8192 rolls, using 20 dice each roll and summing them up. The frequencies look white still, so i won’t bother showing them.

Rolling two dice and subtracting one from the other looks the exact same as adding them together, except the histogram goes from -5 to +5, instead of from 0 to 10.

You can generalize this subtraction to a larger number of dice by summing up all the dice, but multiplying half of them by -1 before doing the summing.

If doing an odd number of dice in this setup, it looks different in that one side of the Gaussian tail is chopped off, like below, but still looks white in frequency space. 3 dice were used here, where two dice were multiplied by -1 before summing.

Sure you can roll a lot of dice and add them together to get white Gaussian noise, but you can also generate Gaussian numbers directly (check this out to see one way: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform). If you do this and write down the numbers, the histogram looks Gaussian as you’d expect, and amazingly (or not amazingly), it still looks perfectly white in frequency space. I won’t bother showing the images, but I did write the code for it and verify, so you can check out the output of the code that goes with this post for yourself if you’d like!

# Red Triangular and Gaussian Noise

Let’s say you roll two dice and add them up. This is triangular white noise like we saw above.

If you name your dice A and B, you can now reroll A and add them again to get the second number. You then reroll B and add them again to get the third number. You then reroll A and add them again to get the fourth number, and so on. This is triangular red noise!

By only rerolling one of the dice, you are low pass filtering the white noise and getting red noise.

In more plain terms, by only re-rolling one of the dice each time, the sum is going to tend to be more similar to the last sum. That makes for random numbers that have stronger low frequency components.

Here are the results from the program of doing this 16 times:
2, 4, 2, 3, 5, 6, 6, 4, 2, 5, 5, 3, 4, 4, 8, 7

As you can see, the numbers are usually pretty similar to the numbers around them, which is different than the white noise case. This is red noise.

Doing this 8192 times, the histogram looks like a wiggly triangle as the image below shows, but averaging 100 runs of 8192 samples, it looks like a perfect triangle (not shown).

Looking at it in frequency space, here’s 16 samples:

1024:

8192:

8192 x100 averaged:

So, the histogram shows a triangular distribution, and the frequency space shows low frequency content – the part in the middle is the low frequencies and that’s where all the amplitude is. We have triangular red noise!

What if we use a different number of dice?

Here is the histogram and DFT for having 3 dice, and rerolling one each time (cycling through which die is rerolled). These are the averaged 8192 x100 rolls. The histogram shows a more gaussian distribution, and something odd is going on with the frequencies.

Here is the same for 4 dice.

Here it is for 10 dice.

And here it is for 20 dice.

So, as we use more dice, the histogram gets more Gaussian, but for every dice added, we also get “half a hump” in the frequency space. The noise is undeniably red though, as the largest peaks are in the middle, where the low frequencies are.

If you are curious, here’s 16 rolls of 20 dice being used to make red noise, which means only re-rolling one dice each roll. It is very “random walk”-ish and the numbers don’t stray too far from 50, which is the expected value.
49, 49, 51, 51, 53, 55, 56, 55, 54, 54, 57, 57, 62, 61, 61, 62

You might think that those numbers are on their way up, but here are the next 16 values it gives to show you that isn’t the case.
59, 59, 57, 52, 51, 55, 55, 54, 55, 51, 52, 53, 57, 58, 59, 58

All this time, we’ve only been rerolling one die. What if we re-roll a different number of dice?

Here we roll 20 dice, but reroll 5 each time (cycling through the dice). We did this 8192 times, and averaged the results of doing that 100 times. In frequency space it looks a lot like our 4 dice case. I guess that isn’t too surprising if you think about us re-rolling 1/4th of the dice in both cases.

Here we reroll 10 dice each time. In frequency space it looks like when we rolled 2 dice. Again, maybe not surprising, because we are rerolling 1/2 of the dice now just like before. We do get the nicer Gaussian distribution though of using 20 dice though.

Rerolling 15 dice each time, the histogram is unaffected, but here are the frequencies:

Here we reroll 19 dice:

So, taking some empirical observations it looks like when summing dice…

Using more dice makes the distribution more Gaussian (thanks central limit theorem!)

Rerolling less of the dice each time makes for stronger low frequency content, while rerolling more dice makes the result look more like white noise. That makes some intuitive sense i think.

The color of the noise here seems to be directly related to what PERCENTAGE of the dice you re-roll, without caring how many actual dice you used.

Using these 2 things together, you can craft the distribution (histogram) separately from how red you want the noise to be, which is pretty neat.

It’s strange too though, that as you re-roll more dice, it looks sort of like you are zooming into the center of the graph, in frequency space. In this way, you only get a flat line (white noise) at the limit (infinity). That makes sense to me because if you are only rerolling one dice out of N dice total, it’s only at the limit of N approaching infinity, that your noise will be purely white. Anything short of that will make slightly red noise.

Before moving on, I wanted to mention that this works very similarly when using Gaussian distributions. Here is 20 Gaussian numbers summed, with 1 “rerolled” each time. 8192 rolls, results are the average of 100 runs. Mean of 0 and Sigma of 1.

By the way, if you ever wanted to generate red rectangular noise, you should look into Brownian motion (aka Brown noise. Named after Robert Brown. Total coincidence that his name is a color too. Brown noise is red), which is just a random walk. You’d use a rectangular distribution random number generator to make small perturbations on a path. In effect, you would be low pass filtering the white noise, which gives you red noise.

# Blue Triangular and Gaussian Noise

Let’s say you roll two dice and subtract one from the other. This is triangular white noise like we saw above.

If you name your dice A and B, you can now reroll A and and take your second number as A-B. You can then reroll B and take B-A as your third number. You then reroll A and take A-B as your fourth number, and so on. This is triangular blue noise!

By only rerolling one of the dice, and flipping which dice is subtracted from the other, you are high pass filtering the white noise and getting blue noise.

In more plain terms, by only re-rolling one of the dice each time, but flipping the sign of the dice, the sum is going to tend to be more different to the last value. That makes for random numbers that have stronger high frequency components.

Here are the results from the program of doing this 16 times:
2, 0, -2, 3, -1, 2, -2, 0, -2, 5, -5, 3, -2, 2, 2, -3

As you can see, the numbers are usually pretty different to the numbers around them, which is different than the white noise case. This is blue noise.

Just like red noise, where you add the dice instead of subtract, this also gives you a triangle distribution, but the triangle is centered on zero. At 8192 rolls, the histogram looks like a wobbly triangle like below, but averaging 100 different runs of 8192 rolls it looks like a perfect triangle (not shown).

Here it is in frequency space for 16 samples:

Here is 1024:

Here is 8192:

Here is the average of 100 different 8192 runs:

So, the histogram shows a triangular distribution, and the frequency space shows high frequency content. We have triangular blue noise!

What if we use a different number of dice, like we did for red?

Here’s what we are going to do… we’ll roll N dice to start out, multiply every even numbered die by -1 and sum them up to get our first number. We will then reroll die 0, multiply every odd numbered die by -1 and sum them up to get our second number. We will then reroll die 1, multiply every even numbered die by -1 and sum them up to get our third number. Repeat this process to get as many numbers as you want.

The histogram gets more Gaussian for higher numbers of dice, just like it did for red noise, so I will only show the frequency information for different numbers of dice. All results are the average of 100 different 8192 sequences.

Here it is with 3 dice:

Here it is with 4 dice:

Here it is with 10 dice:

Here it is with 20 dice:

Just like with red noise, for every dice added, we also get “half a hump” in the frequency space. The noise is undeniably blue though, as the largest peaks are on the sides, where the high frequencies are.

If you are curious, here’s 16 rolls of 20 dice being used to make blue noise, which means only re-rolling one dice each roll. Every other number flips sign just about, except near the end where a 3 and a 2 are next to each other.
-9, 9, -7, 7, -5, 7, -6, 5, -6, 6, -3, 3, 2, -3, 3, -2

If you think the numbers are trending toward zero, you are right, but it’s just a short term anomaly. Here are the next 16:
-1, 1, -3, -2, 1, 3, -3, 2, -1, -3, 4, -3, 7, -6, 7, -8

If you wanted the numbers to be strictly positive, or to be within specific values, you could shift (add) and scale them to be in the range you wanted.

All this time, we’ve only been rerolling one die. What if we re-roll a different number of dice, like we looked at with red noise?

Here we roll 20 dice, but reroll 5 each time (cycling through the dice). We did this 8192 times, and averaged the results of doing that 100 times. In frequency space it looks a lot like our 4 dice case. We had that same result with red noise.

Here it is rerolling 10 dice each time, which looks a lot like our 2 dice case.

Here we reroll 15 dice each time.

Here we reroll 19 out of the 20 dice each time, which looks nearly like white noise.

The histogram was unaffected by all of this – at 20 dice it is very Gaussian –
but the frequency make up changed quite a bit. As we rerolled more dice, it started to look more like white noise, which is the same as we saw with the red noise.

I won’t show it, because it’s redundant and there are a lot of images on this post already, but this method of making blue noise also works when sampling from a Gaussian distribution, just like it did with red noise.

If you ever wanted to generate blue rectangular noise, you should look into Mitchell’s best candidate algorithm, using it in 1D. (https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/)

You could also look into low discrepancy sequences, which have some similar properties to blue noise (depending on what you are looking for) but are deterministic instead of randomized.

# Purple Triangular and Gaussian Noise

What happens if we make blue noise and add red noise to it? Would we get purple noise? Yes we would 😛

Purple noise is band stop filtered white noise. It has low and high frequency components but doesn’t have much in the way of middle frequency components.

We’ll be doing 20 dice red noise with 1 reroll, and 20 dice blue noise with 1 reroll, and adding them together to get purple noise.

Here are 32 purple noise values made with the process above:
52, 47, 50, 52, 53, 59, 52, 59, 50, 55, 55, 59, 61, 58, 66, 57, 59, 59, 57, 51, 51, 54, 59, 52, 54, 50, 53, 50, 61, 53, 63, 56

Now we’ll do this 8192 times to get that many numbers, and average the result of 100 such runs, to show you the histograms and frequency amplitudes.

Here is the histogram and frequency amplitudes of 16 rolls:

Here is 1024 rolls:

Here is 8192 rolls:

Here is the average of 100 different 8192 rolls:

The histogram is very rough, but does settle into a Gaussian curve like it should. The frequency components show strong low and high frequencies, but not much in the middle.

We could play around with re-rolling different amounts of dice, and different amounts of dice for red vs blue, but I’ll leave that to you if you are curious about it 😛

# Purplish Noise

So interestingly, if you generate N blue noise numbers and N red noise numbers, you can lerp between the number sets. Like the first number could be 25% blue and 75% red.

Doing this lets you control how red or how blue the numbers are.

Following are the histograms and DFT magnitude data for red noise made with 20 dice and 1 reroll, lerped towards blue noise made with 20 dice and 1 reroll. These are the average of 100 different 8192 runs.

First, 1% red, 99% blue. There is a strange spike near zero in the histogram I can’t explain, but I don’t think is a bug in the program. If you know what is causing it, let me know!

Next is 25% red, 75% blue.

Next is 50/50, which looks like regular old purple noise.

Here is 75% red, 25% blue.

Lastly here is 99% red, 1% blue.

Before moving on, what happens if we take the max of blue and red noise? This is kind of a throw back to the post that talks about what happens when you take the max of two (uniform white) random numbers. https://blog.demofox.org/2019/06/01/taking-the-max-of-uniform-random-numbers/

Sadly, it just makes red noise with a Gaussian distribution. Nothing special there are far as I can see.

# Green Triangular and Gaussian Noise

Green noise is the opposite of purple noise and is band pass filtered white noise. Green noise doesn’t have much high or low frequency to it, but it has middle frequencies.

I wasn’t sure how to generate green noise so asked online to see if anyone knew how or had any ideas.

https://twitter.com/eigenbom (Ben Porter) had an idea of basically mixing blue and red techniques together.

Rolling 20 dice (0 to 5), rerolling one die each roll, and flipping the signs of dice every TWO rolls instead of every roll, you do get a nice green noise as you can see in this DFT below. The histogram was Gaussian.

Trying to generalize this idea, I tried changing the “sign flip period” to other values. That didn’t generalize it very well and I got some strange results. If you are curious, check out the output of the program!

https://twitter.com/thingcreator (@Ray) also had some thoughts on how to generate blue noise, which was more based on some deep DSP ideas.

The first idea is to interleave zeros into blue noise, to decrease the high frequency amplitudes of blue noise to lower frequencies.

Doing that screws with the histogram, since half of the values are zeros, and also screws with the randomness since every other value is zero.

But, it does make some nice green noise!

Ray had another idea. Instead of zeros, you could just interleave two independent blue noise streams. This doesn’t screw with the histogram or the randomness like adding zeros does.

I do that by generating blue noise, chopping it in half, and then interleaving the halves. Doing that gives the expected Gaussian histrogram, and still gives a very nice green noise DFT.

What if you do that “chop in half and interleave” multiple times? The histogram will stay the same because the numbers used are all the same, but the frequencies should change.

Here it is being done twice instead of once:

Here it is 3 times:

Here it is 10 times:

Here are 32 green noise numbers using Ray’s technique of interleaving two blue noise streams. You can see how it has both rapid change (blue noise qualities) and similar numbers (red noise qualities).
-9, 1, 9, 4, -7, -5, 7, 1, -5, 0, 7, -2, -6, 1, 5, -5, -6, 5, 6, -6, -3, 2, 3, -3, 2, 7, -3, -9, 3, 8, -2, -8

Green noise apparently has some usage in dithering physical objects (like shirt designs). The idea seems to be that green noise is more stable in the physical world without tiny high frequency blue noise features that are more likely to be damaged. https://www.semanticscholar.org/paper/Green-noise-digital-halftoning-Lau-Arce/3e9b673900f86546db94b0922932ce13db84bf75

# Pink Noise

I’ve known pink noise as “red noise mixed with white noise” for a long while but didn’t think much of it as a noise color. I didn’t know of a usage case for it.

It turns out that pink noise is a pretty important noise color and that many natural and biological processes have this situation where there is some low frequency (red noise) trend over time, but also a small scale randomness (white noise) added on top.

There’s a famous paper by Voss that describes an algorithm to generate pink noise.

You start by rolling some number of dice and adding them up.

From there, you re-roll dice 0 every roll, dice 1 every 2 rolls, dice 2 every 4 rolls, dice 3 every 8 rolls and so on. If you know how to count in binary you should see a pattern emerging.

Doing that, the histogram is just Gaussian, but the DFT looks like a bit like a fuzzy red noise, due to the white noise added into the red noise to make pink noise:

Here are 32 pink noise values made with that algorithm:
6, 4, 9, 10, 6, 9, 5, 6, 15, 17, 10, 11, 14, 15, 14, 13, 15, 13, 13, 11, 7, 8, 9, 8, 14, 16, 17, 18, 16, 11, 11, 14

McCartney has a variation on this algorithm where you only reroll one die each roll. The number of leading zeros in your roll number in binary is the number of the dice to reroll. That makes it a more even workload for how many random numbers you generate each time.

It doesn’t affect the histogram, but here is the DFT, which is a little bit sharper in the middle:

Here are 32 pink noise values made with this algorithm:
6, 4, 7, 9, 11, 11, 10, 8, 11, 11, 12, 13, 12, 16, 15, 15, 15, 13, 11, 12, 11, 9, 14, 15, 13, 11, 8, 10, 13, 15, 15, 11

The last variation of the voss algorithm for generating pink noise I’ll show is a stochastic one. It uses a geometric distribution random number each roll to figure out which die to reroll.

This makes for a very sharp frequency spike:

Here are 32 values made with that algorithm:
6, 8, 8, 6, 7, 9, 4, 6, 9, 10, 9, 9, 7, 9, 6, 10, 9, 9, 8, 9, 9, 14, 10, 9, 11, 9, 8, 8, 5, 4, 4, 3

https://www.dsprelated.com/showarticle/908.php

# Closing

This stuff may seem a bit strange, ad hoc, and magical, but thanks to Ray for pointing this out, this is all just “Finite Impulse Response” DSP stuff. Rerolling dice is like numbers falling out of the delay line, and changing the signs of numbers and summing them (etc) is just filter coefficients for the N order filter used for the N dice.

There’s definitely some deeper DSP stuff here that unifies it all and makes it make a lot more sense that I don’t understand. (yet. Hopefully will one day.)

Here is some interesting info linking this stuff to convolution
https://stats.stackexchange.com/questions/331973/why-is-the-sum-of-two-random-variables-a-convolution/331983#331983

# Generating Blue Noise Textures With Void And Cluster

Blue noise comes in two main flavors: sample points and masks. I’ll be calling blue noise masks “blue noise textures” in this post.

The code that accompanies this post, implementing the void and cluster algorithm, and running tests on it, can be found at:
https://github.com/Atrix256/VoidAndCluster

Update: Jonathan Dupuy has a GPU implementation that runs much more quickly than this CPU implementation. https://github.com/jdupuy/BlueNoiseDitherMaskTiles

# Blue Noise Sample Points

Blue noise sample points can be used when you are taking samples for integration, like when sampling shadow maps, taking AO samples, or when doing path tracing. Blue noise doesn’t converge as fast as other sampling patterns, but it is really good at hiding error by evenly distributing it, as if error diffusion had been done.

Here are some blue noise samples and their frequency magnitudes:

Here’s an example result of blue noise vs white noise sampling, from https://www.arnoldrenderer.com/research/dither_abstract.pdf

If you can afford enough samples to converge to the right result, you should do that, and you should use a sequence that converges quickly, such as Martin Robert’s R2 sequence (a generalization of the golden ratio, which is competitive with sobol!): http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/

If you can’t afford enough samples to converge, that’s when you should reach for blue noise, to minimize the appearance of the error that will be in the resulting image.

To get blue noise samples, my go to method is “Mitchell’s Best Candidate Algorithm”:
https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/

That algorithm is easy to understand and quick to implement, and while it makes decent results, it isn’t the best algorithm. The best blue noise sample point algorithms all seem to be based on capacity constrained Voronoi diagrams (CCVD). One benefit of Mitchell’s best candidate algorithm though is that the sample points it makes are progressive. That means if you generate 1024 sample points, using the first N of them for any N will still be blue noise. With CCVD algorithms, the sample points aren’t blue unless you use them all. This notion of progressive samples is something that will come up again in this post later on.

# Blue Noise Textures

The second kind of blue noise is blue noise masks or textures.

One thing blue noise textures are good for is dithering data: adding random noise before quantization. Quantization lets you store data in fewer bits, and dithering before quantization gives a result that makes it harder (sometimes virtually impossible) to tell that it is even using fewer bits.

Here is a blue noise texture and it’s DFT

This is an example of white vs blue noise used for dithering, from http://johanneskopf.de/publications/blue_noise/

Using blue noise gives much better results than white noise, for the same amount of error. It does this by more evenly distributing the error, which gives results as if there was error diffusion done, without actually having to do the error diffusion step. The results are especially good when you animate the noise well. For more information on that, check out these posts of mine.
Part 1: https://blog.demofox.org/2017/10/31/animating-noise-for-integration-over-time/
Part 2: https://blog.demofox.org/2017/11/03/animating-noise-for-integration-over-time-2-uniform-over-time/

If that’s an area that you are interested in, you should give this a read too, making sure to check out the triangular distributed noise portion!
http://loopit.dk/banding_in_games.pdf

Interestingly though, blue noise textures can also be used for sampling as this paper talks about: https://www.arnoldrenderer.com/research/dither_abstract.pdf

Here is another link between the two types of blue noise, where it shows how to turn blue noise textures into blue noise sample points of the desired density (like for making sample density be based on importance / the PDF): http://drivenbynostalgia.com/#frs

For blue noise textures, my go to has been to get them from this website where the author used the void and cluster algorithm to generate blue noise textures: http://momentsingraphics.de/?p=127

It’s been on my todo list for a while to understand and implement the void and cluster algorithm myself so that if that website ever goes down, I wouldn’t lose my only source of “the good stuff”. This post is the result of me finally doing that!

It’s possible to make blue noise (and other colors of noise too) by repeatedly filtering white noise: https://blog.demofox.org/2017/10/25/transmuting-white-noise-to-blue-red-green-purple/

Doing it that way has some problems though depending on what it is you want to use the blue noise for: https://blog.demofox.org/2018/08/12/not-all-blue-noise-is-created-equal/

# Using Both Types of Blue Noise Together

I tend to use blue noise a lot in my rendering techniques, because it’s a nice way to get the look of higher sample counts (the appearance of low error), without spending the cost of higher sample counts.

It’s fairly common that I will use both types of blue noise together even!

For example, if I’m taking multiple PCF samples of a shadow map, I’ll make some blue noise sampling points to sample the shadow map per pixel. I’ll also use a 64×64 blue noise texture tiled across the screen as a per pixel random number for rotating those sample points per pixel.

As a bonus, i’ll use the golden ratio (conjugate) to make the noise be low discrepancy over time, as well as being blue over space by doing this after reading the value from the blue noise texture:

blueNoiseValue = fract(blueNoiseValue + 0.61803398875f * (frameNumber % 64));

The end result is really nice, and looks as though many more samples were taken than actually were.

(Beware: doing random sampling in cache friendly ways can be a challenge though)

# Void And Cluster

The void and cluster algorithm was introduced in 1993 by Robert Ulichney and was previously under patent, but it seems to have expired into the public domain now.

Here is the paper:
http://cv.ulichney.com/papers/1993-void-cluster.pdf

The algorithm has four steps:

1. Generate Initial Binary Pattern – Make blue noise distributed sample points.
2. Phase 1 – Make those points progressive.
3. Phase 2 – Make more progressive sample points, until half of the pixels are sample points.
4. Phase 3 – Make more progressive sample points, until all pixels are sample points.

Since the points are progressive, it means they have some ordering to them. This is called their rank. The final blue noise texture image shades pixels based on their rank: lower numbers are darker, while higher numbers are brighter. The end goal is that you should have (as much as possible) an even number of each pixel value; you want a flat histogram.

Something interesting is that Phase 1 turns non progressive blue noise sample points into progressive blue noise sample points. I don’t know of any other algorithms to do this, and it might be common knowledge/practice, but it seems like this would be useful for taking CCVD blue noise sample points which are not progressive and making them progressive.

Before we talk about the steps, let’s talk about how to find voids and clusters.

Finding Voids and Clusters

Each step of this algorithm needs to identify the largest void and/or the tightest cluster in the existing sample points.

The tightest cluster is defined as the sample point that has the most sample points around it.

The largest void is defined as the empty pixel location that has the most empty pixel locations around it.

How the paper finds voids and clusters is by calculating a type of “energy” score for each pixel, where the energy goes up as it has more 1’s around it, and it goes up as they are closer to it.

Let’s say we are calculating the total energy for a pixel P.

To do that, we’ll loop through every pixel and calculate how much energy they contribute to pixel P. Add these all up and that is the energy of the pixel P.

If we are looking at how much energy a specific pixel S contributes to pixel P, we use this formula:

Energy= e^(-distanceSquared/(2*sigma*sigma))

The void and cluster paper uses a “sigma” of 1.5 which means the divisor 4.5. The “free blue noise textures” implementation uses a sigma of 1.9 and i have to agree that I think that sigma gives better results. A sigma of 1.9 makes the divisor be 7.22.

Distance squared is the squared distance between S and P, but using a “torroidal” wrap around distance which makes the resulting texture tileable as a side effect of satisfying “frequency space needs” of the pattern needing to be tileable. (I wrote a post about how to calculate torroidal distance if you are curious! https://blog.demofox.org/2017/10/01/calculating-the-distance-between-points-in-wrap-around-toroidal-space/)

Now that you know how to calculate an energy for each pixel, the hard part is done!

The pixel that has the highest total energy is the tightest cluster.

The pixel that has the lowest total energy is the largest void.

Generate Initial Binary Pattern

To start out, you need to make a binary pattern that is the same size as your output image, and has no more than half of the values being ones. In the code that goes with this post, I set 1/10th of the values to ones.

Then, in a loop you:

1) Find the tightest cluster and remove the one.
2) Find the largest void and put the one there.

You repeat this until the tightest cluster found in step 1 is the same largest void found in step 2.

After this, the initial binary pattern will be blue noise distributed sample points but they aren’t progressive because they have no sense of ordering.

This current binary pattern is called the “prototype binary pattern” in the paper.

Phase 1 – Make The Points Progressive

Phase 1 makes the initial binary pattern progressive, by giving a rank (an ordering) to each point.

To give them ranks you do this in a loop:

1) Find the tightest cluster and remove it.
2) The rank for that pixel you just removed is the number of ones left in the binary pattern.

Continue this until you run out of points.

Phase 2 – Make Progressive Points Until Half Are Ones

Phase 2 resets back to the prototype binary pattern and then repeatedly adds new sample points to it, until half of the binary pattern is ones.

It does this in a loop:

1) Find the largest void and insert a one.
2) The rank for the pixel you just added is the number of ones in the binary pattern before you added it.

Continue until half of the binary pattern is ones.

Phase 3 – Make Progressive Points Until All Are Ones

Phase 3 continues adding progressive sample points until the entire binary pattern is made up of ones.

It does this by reversing the meaning of zeros and ones in the energy calculation and does this in a loop:

1) Find the largest CLUSTER OF ZEROS and insert a one.
2) The rank for the pixel you just added is the number of ones in the binary pattern before you added it.

Continue until the entire binary pattern is ones.

You now have a binary pattern that is full of ones, and you should have a rank for every pixel. The last step is to turn this into a blue noise texture.

Finalizing The Texture

To make a blue noise texture, you need to turn ranks into pixel values.

If your output texture is 8 bits, that means you need to turn your ranks that are 0 to N, into values that are 0 to 255. You want to do this in a way where as much as possible, you have an even number of every value from 0 to 255.

I do this by multiplying the rank by 256 and dividing by the number of pixels (which is also the number of ranks there are).

# Optimization Attempts

I tried a variety of optimizations on this algorithm. Some worked, some didn’t.

What I ultimately came to realize is that this algorithm is not fast, and that when you are using it, it’s because you want high quality results. Because of this, the optimizations really ought to be limited to things that don’t hurt the quality of the output.

For instance, if the algorithm takes 10 minutes to run, you aren’t going to shave off 30 seconds to get results that are any less good. You are running a ~10 minute algorithm because you want the best results you can get.

Success: Look Up Table

The best optimization I did by far was to use a look up table for pixel energy.

I made a float array the same size as the output, and whenever I put a 1 in the binary pattern, i also added that pixel’s energy into the energy look up table for all pixels. When removing a 1, i subtracted that pixel’s energy from the look up table for all pixels.

Using a look up table, finding the highest number is all it takes to find the tightest cluster, and finding the lowest number is all it takes to find the largest void.

I further sped this up by using threaded work to update the pixels.

I first used std::thread but didn’t get much of a win because there are a lot of calls to updating the LUT, and all the time moved from doing LUT calculation to thread creates and destroys. Switching to open mp (already built into visual studio’s compiler!) my perf got a lot better because open mp uses thread pools instead of creating/destroying threads each invocation.

Failure: Limiting to 3 Sigma

The equation to calculate pixel energy is actually a Gaussian function. Knowing that Gaussians have 99.7% of their value withing +/- 3 sigma, i decided to try only considering pixels within a 3*sigma radius when calculating the energy for a specific pixel.

While this tremendously sped up execution time, it was a failure for quality because pixels were blind to other pixels that were farther than 6 pixels away.

In the early and late stages of this blue noise generation algorithm, the existing sample points are pretty sparse (farther apart) and you are trying to add new points that are most far apart from existing points. If the farthest away you can see is 6 pixels, but your blue noise texture is 512×512, that means samples are not going to be very well distributed (blue noise distributed) at those points in time.

That had the result of making it so low and high threshold values used on the blue noise texture gave very poor results. (Thanks to Mikkel Gjoel for pointing out these edges to test for blue noise textures!)

Didn’t Try: DFT/IDFT for LUT

The “free blue noise textures” website implementation does a neat trick when it wants to turn a binary sample pattern into a look up table as I described it.

It does a DFT on the binary sample pattern, multiplies each pixel by the frequency space representation of a gaussian function of the desired size, and then does an inverse DFT to get the result as if convolution had been done. This uses the convolution theorem (a multiplication in frequency space is convolution in image space) to presumably get the LUT made faster.

I didn’t try this, but it doesn’t affect quality as far as I can tell (comparing my results to his), and i’m pretty sure it could be a perf win, but not sure by how much.

How I’ve written the code, there are two specific times that i want to make a LUT from scratch, but many, many more times i just want an incremental update to the LUT (at 1024×1024 that’s roughly 1 million LUT updates I would want to do) so i didn’t pursue this avenue.

Probable Success: Mitchell’s Best Candidate

Noticing the result of “initial binary pattern” and “phase 1” was to make a progressive blue noise sample, I thought it might be a good idea to replace those with an algorithm to directly make progressive blue noise samples.

I grabbed my goto algorithm and gave it a shot and it is noticeably faster, and seems to give decent results, but it also gives DIFFERENT results. The difference is probably harmless, but out of abundance of caution for correctness, I wouldn’t consider this a for sure success without more analysis.

# Results

The code that goes with this post can be found at: https://github.com/Atrix256/VoidAndCluster

Speed

At 256×256, on my machine with my implementation, it took 400.2 seconds to run the standard void and cluster algorithm. Using Mitchell’s best candidate instead of the initial binary pattern and phase 1, it took 348.2 seconds instead. That’s a 13% savings in time.

At 128×128, standard void and cluster took 24.3 seconds, while the Mitchell’s best candidate version took 21.2 seconds. That’s a 13% time savings again.

At 64×64, it was 1.3 seconds vs 1.1 seconds which is about 13% time savings again.

Quality

From top to bottom is void and cluster, void and cluster using mitchell’s algorithm, and the bottom is a blue noise texture taken from the “free blue noise textures” site.

Here the same textures are, going through a “threshold” test to make sure they make good quality blue noise of every possible density. Same order: void and cluster, void and cluster using mitchell’s algorithm, and the bottom one is taken from the other website. You can definitely see a difference in the middle animation vs the top and bottom, so using the best candidate algorithm changes things, but it seems to be better at low sample counts and maybe not as good at low to medium sample counts. Hard to say for sure…

# Red Noise

Lately I’ve been having a fascination with red noise as well.

Where blue noise is random values that have neighbors being very different from each other, red noise is random values that have neighbors that are close to each other.

Red noise is basically a “random walk”, but stateless. If you had a red noise texture, you could look some number of pixels forward and would get the position of the random walk that many steps away from your current position.

The simplest Markov Chain Monte Carlo algorithms use a random walk to do their work, but I’m not sure that a higher quality random walk would be very helpful there.

I’m sure there must be some usage case for red noise but I haven’t found it yet.

In any case, I do believe that the void and cluster algorithm could be adapted to generate red noise, but I’m not sure the specifics of how you’d do that.

If i figure it out, i’ll make a follow up post and link from here. If you figure it out, please share details!

Some more cool stuff regarding noise colors is coming to this blog, so be on the lookout 🙂

# Taking the Max of Uniform Random Numbers

There is 80 lines of simple standalone C++ code that generated the data for this post at:

Let’s say you generate 1,000,000 random numbers from 0 to 255 and count how many times each number came up in a histogram. I did that and here it is:

In a previous post I talked about how adding dice rolls together, counting bits in a random number, and similar, would approach a normal / Gaussian distribution:
https://blog.demofox.org/2017/07/25/counting-bits-the-normal-distribution/

The same thing happens if you average random numbers, but there’s a nice side effect of getting values in the same range. That is to say: if you roll N 6 sided dice and average them, you will still get values between 1 and 6, but the more dice there are, the more the distribution will approach Gaussian. It shapes up pretty quickly too. Here is the histogram:

Something interesting to note is that adding two uniform random numbers together – or averaging them – makes something called “triangle distributed noise”. Looking at the histogram for averaging two values, hopefully you can see why it’s called triangle distributed! Triangle noise some cool properties for noise in graphics and is orthogonal to eg white noise vs blue noise vs low discrepancy sequences.

http://www.loopit.dk/banding_in_games.pdf

I recently saw some code that was taking the maximum of two random numbers and that made me wonder what sort of distribution that might give.

Being a better programmer than a mathematician, i made a histogram, then took a guess as to what the formula behind the shape might be.

It turns out that taking the max of N uniform random numbers is the same as using y=x^(N-1) as a PDF. (You of course would need to normalize it to be a real PDF)

Here are the histograms:

These are apparently beta distributions:
https://en.wikipedia.org/wiki/Beta_distribution

If we take the min rather than the max, what happens then? Well, if for example the numbers are between 0 and 1, taking the min will give you the same count as if you took the max of 1-x. So, the graph of the histogram should look the same, just flipped on the x axis.

It turns out that it does:

# Generating Random Numbers From a Specific Distribution With The Metropolis Algorithm (MCMC)

There is ~400 lines of standalone C++ code that implements the main ideas in this post. You can find it at: https://github.com/Atrix256/MetropolisMCMC

In previous posts I showed how to generate random numbers from a specific distributing by using two techniques:

This post will show how to do it using a Markov Chain Monte Carlo method called “The Metropolis Algorithm”. This post also talks about using it for numerical integration.

If you want an intro or review to either Markov Chains or Monte Carlo, these two posts can help you out.

# Overview

The Metropolis algorithm lets you generate random numbers that follow a distribution given by any function y=f(x), where y is the probability of choosing x.

Rejection sampling does this as well, but you have to throw away an unknown number of bad samples before getting each good sample.

Inverting the CDF doesn’t throw out any samples, but it’s limited in the type of distributions it can do: It can be mathematically complex, or impossible, to find the inverted CDF for a given function analytically.

In these ways, the Metropolis algorithm is the best of both worlds. You can sample from a distribution defined from any function, and you don’t have to throw out any samples while doing it.

Another interesting thing about the Metropolis algorithm is that it can work blindly. It never actually has to KNOW what function describes the probability distribution. If there is a black box you can give an x to get a y, the Metropolis algorithm can generate random numbers from the distribution hidden in that black box.

The Metropolis algorithm also works in any dimension: you can use it with functions like z=f(x,y) or t = f(x,y,z,w,s).

In higher dimensions such as z=f(x,y), the z is the probability of a 2d random number (x,y) being chosen. It sounds weird but this situation is like if you rolled two dice, the value that one die came up as affected the probability of the other die. Maybe when the first die was a larger number, it made it be more probable for the other die to be a larger number too.

As weird as that is, if you can describe the relationship as a z=f(x,y) function, this can generate (x,y) random numbers from that distribution for you.

It’s worth noting that the Metropolis algorithm is a simpler special case of the Metropolis-Hastings algorithm, and these are just two of many Markov Chain Monte Carlo algorithms.

# The Metropolis Algorithm

The metropolis algorithm is pretty simple.

You start with an x value and calculate y which is just f(x). This is the initial sample and hopefully is a location where y is greater than 0.

To get the next sample, we first need to calculate a candidate sample, and then choose whether to take it or not.

To make a candidate sample, take a small random step from the current x point to get a new x, either in the negative or positive direction. Calculate y which is just f(x) using the new x.

Calculate a value A (for acceptance value) which is the candidate y divided by the last y value. This is the percentage chance you should take the new sample as the current sample, otherwise you take the old sample as the current sample.

To make the decision, you just generate a random number between 0 and 1 and accept the new sample if it’s below the A value.

Rinse and repeat for as many samples as you want.

Here’s a simple but fully featured implementation that you can find in the code that goes with this post(https://github.com/Atrix256/MetropolisMCMC)

The x axis of each sample is a random number drawn from the distribution described by y=f(x). If you keep track of the average x value seen, you’ll get the expected value of the PDF.

The y axis of each sample isn’t that useful directly. It’s the pdf(x) but scaled up by an unknown amount – the normalization constant for the function.

The convergence rate of Metropolis MCMC isn’t as well understood as monte carlo integration, since Metropolis has dependent samples (a random walk that knows where it was last step) vs independent samples (a stateless random sample).

In higher dimensions, you just take a random step on each axis, instead of only the x axis. The rest of the algorithm remains the same.

Regarding the random walk, it’s possible to use many different types of random numbers to take a step, but it’s most common to use a normal distribution.

# Metropolis Burn in and 0.234

Metropolis is stateless, and has no memory of the past. Despite this, it’s common to do a “burn in” with MCMC where you throw out some number of samples before you start.

This might sound weird, but the reason for it is that there are good places to sample from and bad places to sample from, and this is an attempt to have the random walk find a better place to sample from.

For instance, if you were sampling from y=sin(x)*sin(x)*x from 0 to 2 pi, you’d have a pdf that had a shape like the bimodal function below:

If you started the random walk at 2, you’d be doing a random walk in the left, less probable hump, and it would be hard to break out into the right side which was supposed to be more probable.

You should eventually get into the larger hump and stay there longer, but your initial guesses may get stuck in a local minimum and not do a very good job of following the distribution.

While burn in can help situations like these, this is also an example of how the Metropolis algorithm can fail.

It’s worth noting that using a normal distribution for the random walk can help it not get stuck in bad places. If you have a uniform random distribution that can generate numbers between -k and +k, you can get stuck in situations where you need to take a larger than k step to get to a better (more probable) location. If instead you use a normal distribution, the sigma allows you to have a good idea of how big most of the steps will be, but there will always be a greater than zero chance that you can take a step of ANY size, which could get you out of a local minimum.

There is a rule of thumb that the step size you use (the sigma in the case of a normal distribution) should make it so you are accepting a new step on your random walk 23.4% of the time. This supposedly is a good balance between exploration (finding better places elsewhere) and exploitation (staying in a good place) in many situations.

This isn’t fully settled though, as this is only true some of the time, and counter research has come up. (https://www.sciencedirect.com/science/article/pii/S0304414907002177)

I experimented at having a “Burn in” phase where it also adjusted the sigma to try and reach that 23.4% acceptance rate over the previous N samples. I didn’t play with it very long though, and was unable to get it to reliably reach that acceptance rate.

Even beyond the 0.234 acceptance rate goal, tuning the step size for your specific situation can help you get better or worse results. I didn’t play around with that much in my experimentation though, and found a sigma of 0.2 worked pretty well when working with functions in the 0-pi and 0-2pi range.

The initial starting point of your random walk can affect performance too obviously, since the burn in stage is supposed to find a better starting position.

# Limiting the Function’s Range

In one of the tests I did, I used the function y=sin(x) with x going from 0 to pi/2.

At first i tried clamping x to 0 to pi/2, to keep it in range but doing that made the technique fall apart. There were plenty of times the random walk would try to step out of bounds, but instead of taking that step, it would clamp to the border. This meant that the random walk had a significantly higher chance of reaching the boundary than anywhere else on the graph, and that broke the algorithm.

The correct thing to do was to just make the function return 0 when x was out of range. In this way, it would end up taking any out of range location with 0% probability, but there was no bias about more or less likely places visited by the random walk, other than it preferring higher (more probable) parts of the function.

Something else worth noting about the function is that if the function ever returns a negative number, it ends up being the same as if it returned zero probability.

If use Metropolis MCMC for integration, this can be an important fact, because it will basically ignore the negative values, and treat them as zero.

# Discrete Case

As described, this algorithm works with continuous random numbers, drawing them from a PDF.

The same concepts work for discrete states though too (a more traditional looking markov chain), drawing from a PMF instead.

When handling the discrete case, it needs to be possible to be in any state at any point in time. A usual way to avoid the edge case of probabilities being such that you can only be in some nodes on even steps and others on odd steps, is to have a “self loop” on at least one node, which has a greater than zero probability of staying at the same node.

Markov Chain Monte Carlo Without all the Bullshit

# Integration

Using the Metropolis algorithm for numerical integration is possible, but is not as straightforward as Monte Carlo integration.

In Monte Carlo integration, to get a single estimate of an integral you calculate f(x) / pdf(x). f(x) is the function value at the random location x, and pdf(x) is the probability of that x value being chosen. You take the average of N such estimates and as N approaches infinity, the error of the average of estimates approaches 0.

In Metropolis MCMC we do have N number of samples, and it almost seems like we have enough data to do this, but it turns out that we don’t.

For f(x) which is the function value at the random location, you literally do have f(x). It’s the y component of each sample generated.

The problem is that we don’t have pdf(x).

The probability of choosing x is in fact based on the function we are evaluating f(x), but the function is essentially an un-normalized pdf, but we are able to draw random numbers from the pdf without knowing the normalization constant.

So, pdf(x) is some scalar multiple of f(x), but we have no idea what the multiplier is. That multiplier, the normalization constant, turns out to be the integration value we want to search for.

So we’ve gone in a circle and are no closer to being able to integrate with Metropolis.

There are some ways to deal with this though.

One way is to do mathematical tricks to make it so things “cancel out” and leave the normalization constant.

There is something called the “Harmonic Mean Estimator” that does this, but has infinite variance so is called “The Worst Monte Carlo Method Ever”.

There is another way though, that I use in the code that goes with this post.

Imagine that while you are doing your Metropolis MCMC you have some interval [a,b] that whenever you get a random number drawn in that range, you increment a counter.

After N total samples, you’ll have M samples that fell in this interval. An estimate of the integral of the normalized pdf over this interval is M/N.

Now, you can do regular Monte Carlo integration of the function over this range to get the integral of the UN-normalized pdf over this interval.

When you divide the unnormallized pdf value by the normalized pdf value, you’ll get the normalization constant aka the integral of the function.

A smaller interval size is better for the Monte Carlo integration because it will converge faster (better results in fewer samples), but it’s worse for the Metropolis integration because a smaller interval is less likely to be accurate with the random walk.

My intuition tells me that if you keep a histogram of the x values you’ve seen be generated from the Metropolis algorithm, that the ones with higher counts are more likely to be accurate. So I just integrate over whatever histogram bucket has the highest count. I haven’t done any real analysis of whether or not this is true, or how good this integration estimate is in general.

For the Monte Carlo integration, I used white noise (regular old random numbers) to integrate, but in reality you’d get much better results from something like sobol. I used white noise because i made the code generic for N dimensions and white noise generalizes to any dimension.

# Experiment Results

I didn’t play around much with initial guesses, sigmas, trying to reach the 0.234 acceptance rate, or burn in, but here’s some results from the code that goes with the post.

In the below, the blue line – normalized function value – is the actual desired PDF . The red line – Percentage – is how many samples we actually got in that histogram bucket. When these lines match up, we are happy and everything worked like we wanted it to.

y=sin(x) x in [0,pi]

y=|sin(x)| x in [0, 2pi]

y=sin(x)*sin(x) x in [0, 2pi]

It’s interesting to see the last one be so far from the real PDF. That function must trap the random walk in one side or the other a bit too much.

There is also a 2d function z=f(x,y) that is tested in the c++ code that goes with this post. I don’t know of any easy ways to make a 2d histogram so don’t have any results to show.

The Metropolis algorithm is pretty neat but it’s just the beginning of MCMC methods. I’ve heard that Hamiltonian Monte Carlo can give much better results by using derivatives to make more intelligently sized step sizes.

Something I find interesting is that plain Monte Carlo uses white noise, quasi Monte Carlo uses low discrepancy sequences (and i think blue noise would fit in here), while Metropolis MCMC uses a random walk, which is red noise.

I’m not sure what to make of that, but my brief reading about Hamiltonian Monte Carlo was that it allows the samples to be less dependent, and that’s why it improves things. Maybe there are some secrets to red noise, like there are for blue noise? I’m not really sure but will keep looking 😛

A great write up on Metropolis MCMC
https://stephens999.github.io/fiveMinuteStats/MH_intro.html

Another small but useful write up
http://www.pmean.com/07/MetropolisAlgorithm.html

A 35 minute video about Metropolis MCMC

A mathier set of videos about Metropolis MCMC that is actually very easy to understand:

“A Zero-Math Introduction to Markov Chain Monte Carlo Methods”
https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50

A more mathy overview of Metropolis
https://ermongroup.github.io/cs323-notes/probabilistic/mh/

A series of posts aimed at being a gentle introduction to MCMC
https://theclevermachine.wordpress.com/tag/monte-carlo-integration/

“Introduction to MCMC”

Introduction to MCMC

“MCMC Burn In”

MCMC burn-in

# Markov Chain Text Generation

This post includes a standalone (only standard headers, no external libs) ~400 line C++ source file that can analyze text and use an order N Markov chain to randomly generate new text in the same style. The Markov code itself is fairly generic / re-usable and a template parameter to the class lets you specify the order of the chain as well as the type of state data to use. That code is on github at: https://github.com/Atrix256/TextMarkovChain

When I see material on Markov chains, it usually comes in two flavors:

1. Very Mathy
2. Pretty impressive results light on explanation

It turns out the reason for this is because they CAN be very mathy but they can also be extremely simple.

Without knowing this, I decided it was time to learn about Markov chains. I leveled up my linear algebra knowledge a bit, finally getting a solid grasp on eigen vectors, and learning things like how to put a matrix into an eigen basis form to be able to make matrix exponentiation a trivial operation. There are links at bottom of post if you want to learn this stuff too.

Then, I sat down to learn Markov chains and nearly flipped my table over! Yes, Markov chains can be mathy (and matrix exponentiation is one way to find a Markov chain steady state, but not the best), but that stuff isn’t really required for most uses.

# Markov Chains

A Markov chain is just any situation where you have some number of states, and each state has percentage chances to change to 0 or more other states.

You can get these percentages by looking at actual data, and then you can use these probabilities to GENERATE data of similar types / styles.

# Example

This post uses Markov chains to generate text in the style of provided source text.

The first step it does is analyze source text.

To analyze the source text, it goes through text, and for each word it finds, it keeps track of what words came next, and how many times those words came next.

When analyzing the story “The Tell-Tale Heart” by Edgar Allan Poe for instance (https://poestories.com/read/telltaleheart , also is data/telltale.txt in the code that goes with this post), here are the words that came after “when” and their counts.

• all – 1
• enveloped – 2
• he – 1
• i – 4
• my – 2
• overcharged – 1
• the – 1

Here are the counts for the words that appear after “is”:

• but – 1
• impossible – 1
• merely – 1
• nothing – 1
• only – 1
• the – 2

After all these counts have been gathered up, the next step is to convert them into probabilities. You do this by summing up the words that come after a specific word, and dividing the count of each word by that total sum.

The above examples then turn from counts to probabilities. Here is “when”:

• all – 8%
• enveloped – 16%
• he – 8%
• i – 33%
• my – 16%
• overcharged – 8%
• the – 8%

Here is “is”:

• but – 14%
• impossible – 14%
• merely – 14%
• nothing – 14%
• only – 14%
• the – 28%

Note: The code that goes with this post spits out these counts and percentages in the “out/stats.txt” file if you ever want to see the data.

Once the probabilities are known, you can start generating text. The first thing you do is pick a word purely at random, this is the first word in the text.

Next, you use the probabilities of what words come after that word to randomly choose the next word.

You then use the probabilities of what words come after that word to randomly choose the next word.

This repeats until you’ve generate as much text as you want.

The code with this post generates 1000 words into the “out/generated.txt” file.

That is literally all there is to it. You could do this same process with sheet music to generate more music in the same style, you could do it with weather forecasts to generate realistic weather forecasts (or even try to use it to predict what weather is next). You can do this with any data you can imagine.

# Example Generated Output

Here is 100 words of generated text from various sources.

First is text generated from “The Tell-Tale Heart” by Edgar Allan Poe (https://poestories.com/read/telltaleheart):

…About trifles, and with perfect distinctness — very slowly, my sagacity. I then took me, louder — you cannot imagine how stealthily — with what caution — cautiously — would have told you may think that no longer i knew that no blood – spot. He would not even his room, to do the hour had made up my whole week before him. I knew what dissimulation i showed them causeless, undisturbed. Now a hideous heart, no — wide open — all and the old man, and he would have…

Here is text generated from “The Last Question” by Isaac Asimov (http://hell.pl/szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm):

…Glory that. Man said, it into a meaningful answer. Granted, said, might be kept from the entire known to restore the universe for meaningful answer. Mq – talkie robot, ac learned how many stars are dying. The boys appreciated that not. Cosmic ac that, how may be able to reach the small station, said at half the same. He shrugged. We’ll have enough to be alone. And lose itself aloof. When any other kind of universal ac. He consisted of individuals were self – contact…

Here is text generated from a research paper “Projective Blue-Noise Sampling” (http://resources.mpi-inf.mpg.de/ProjectiveBlueNoise/ProjectiveBlueNoise.pdf):

…Numerical integration. Mj patterns to vector multiplication to achieve a way that the above question whether there exist distributions have addressed anisotropic classic lloyd relaxation green and rotated pattern significantly worse than the j 1, where each site: our projective blue – noise point distributions along both axes. Previous work sampling when undergoing one after a certain number of common blue noise patterns, but at the publisher s ., cohen – left constructs a quality of latinizing the non – sample counts however, as a set only in a theory 28, this shrinkage…

Here is text generated from an example (not real, but representative) psych report from my wife who is a school psychologist:

…Brother had to mildly impaired body movement, the school and placement after a 90 probability that student: adapting to struggle as video games. Student’s planning and he request, spelling subtest scores. This time. The student: this time and accurately with both, including morphology, 2013. Administrators should consider participation in the following are student as intellectually disabled specific auditory comprehension of reading: mr. Mrs. The two subtest is designed to use of or economic disadvantages, gestures, vitality or economic disadvantages, picking at approximately 5th grade prior…

Here we generate a markov chain using ALL the above source texts, to get a mash up of all of them.

…Restore the sphere packing radius is likely an adaptive skills. Please see inset in the conner s problems, we’ll just have well and visualization and he is computed on 1 2 was contacted by things, and restricted number of his abilities. We can simply like them, as well as a s difficulty interacting with a closer to cry, the process based on the standards – appropriate to spurious aliasing artefacts mit87, making a meaningful answer. Finally, 11 months through hyperspace to try his eye contact. Jerrodine’s eyes were going out if…

Lastly, here is only Poe and Asimov combined:

…Could not forever, and continually increased. And stood for a sudden springing to get back and the eighth night i to that man, 2061, but the original star and made trips. A very, and fell full youthfulness even to feel — i then stop someday in five words on a while i heard all the noise steadily for us, calling him to pluto and now a galaxy alone pours out, quick sound would think of individuals. He stirred his hideous veil over the ceiling. Twenty billion years ago, man, …

# Nth Order Markov Chains

Using one word to generate the next word works somewhat well – the generated Poe text definitely seemed like Poe for instance – but there are plenty of times when things don’t make much sense.

A markov chain can become higher order when you don’t just look at the current state to transition to the next state, but you look at the last N states to transition to the next state.

In the text generation case, it means that a 2nd order Markov chain would look at the previous 2 words to make the next word. An order 3 markov chain would look at the previous 3 words to make the next word.

Interestingly, an order 0 Markov chain looks at NO WORDS to generate the next word, so is purely random word generation, with similar word counts (by percentage) as the original text.

The code that goes along with this post lets you specify the order on the Markov chain.

Here is “The Tell-Tale Heart” with an order two markov chain.

…Dark as midnight. As the bell sounded the hour, there came to my ears: but he had been too wary for that. A tub had caught all — ha ha when i describe the wise precautions i took for the concealment of the old man sprang up in bed, crying out — no blood – spot whatever. I removed the bed and examined the corpse. Yes, he was stone, stone dead. I knew that he had been lodged at the police. A watch’s minute hand moves more quickly than did…

If you compare that to the actual story, you can find fairly large sections of that are taken verbatim from the source text, but the arrangement of those larger chunks are different.

The reason for this is that when you have two words mapping to the next word, the number of these go up, which makes it so on average, there are going to be fewer choices for “next words”, which make the results less random, and more deterministic.

If you gave it more text (like, maybe, all of Edgar Allan Poe’s work), there would be more options for the next word after specific 2 word pairs, but with a single short story, it doesn’t have very many choices. If you look at the out/stats.txt file and compare order 1 vs order 2, you can see that order 2 has a lot more situations where a current state maps to a single next state.

At order 3 there are even fewer choices, and it hits a pattern loop:

…Had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to fear there entered three men, who introduced themselves, with perfect suavity, as officers of the police. A shriek had been heard by a neighbor during the night; suspicion of foul play had been aroused; information had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to…

Here is an order 2 mashup of Poe and Asimov:

…Crossing the floor, and still chatted. The universal ac interrupted zee prime’s own. It had to be contrary, and jerrodette i. Ask multivac. As the passage through hyperspace was completed in its place, each cared for by perfect automatons, equally incorruptible, each with its dreadful echo, the real essence of men was to be contrary now, now, honeys. I’ll ask microvac. Don’t shout. When the sun, and their only concern at the visiplate change as the frightened technicians felt they could hold their breath no…

Lastly, here’s an order 2 mashup of all 4 source texts:

…Mathematics: student does not require special education and related services, the radius of each other, indistinguishable. Man said, ac organized the program. The purpose of this report provides information about the child s educational performance. Other pertinent future work includes the extension of our projective lloyd patterns against other patterns on a role not based on his scores on this scale is different for the sake of visual clarity, we specify all spaces via a set x. In a way, man, i undid it just so much that a single…

# Other Implementation Details

When combining the texts, it might make sense to “normalize” the percentages for each source text. How it works now with raw counts makes it so longer documents have more of their style preserved in the final output document.

You may also want to give weightings to different text so you can have a sliding scale between Poe and Asimov for instance, by basically scaling the counts from their files higher or lower to give more or less representation in the results.

When analyzing the text, I had to think about what to do with punctuation. I chose to treat punctuation as words in themselves, but ignored some punctuation that was giving weird results – like double quotes. I’ve only just now realized that I incorrectly ignore question marks. Oops.

When generating text, i made it so some words don’t put a space before themselves (like, a period!), and i also made it so words would have their first letter capitalized after a period or similar. There seems to need ad hoc, domain specific massaging to get reasonable results.

It’s possible (especially with higher order markov chains) that you can get into a situation where your current state has nothing to transition to. You’d have to figure out what to do in this case. One idea would be to choose a next word at random. Another idea would be to fall back to a lower order markov chain maybe?

I feel like once you understand the algorithm, it’s an art form to teach and tune the Markov chain to get good results. I bet there are some interesting techniques beyond the simple things I’ve done here.

Mathy Markov Chain Info

If you want to dive into the mathy side of markov chains, here are some great resources you can follow to get there…

A great linear algebra online “text book”, that is very easy to read and understand: http://immersivemath.com/ila/index.html

Some great videos on linear algebra: https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

A 9 part series on markov chains. It’s this long because it’s very explicit and works through the details by hand. I watched it at like 1.5x speed and was fine 😛

Some “mathy” notes about Markov chains, including higher order ones:
http://personal.psu.edu/jol2/course/stat416/notes/chap4.pdf

Q Learning

Related to markov chains, Q learning is essentially is a way to learn a Markov chain from data – for instance learning how to play tic tac toe, or how to traverse a maze.

I would like to learn Q learning better and make a post (and code!) at some point.

Q Learning Explained With HTML5
https://blockulator.github.io/Q-Learning-Explained-With-HTML5/

An introduction to Q-Learning: reinforcement learning
https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc

Reinforcement Learning Tutorial Part 1: Q-Learning
https://blog.valohai.com/reinforcement-learning-tutorial-part-1-q-learning

Reinforcement Learning Tutorial Part 2: Cloud Q-learning
https://blog.valohai.com/reinforcement-learning-tutorial-cloud-q-learning

Reinforcement Learning Tutorial Part 3: Basic Deep Q-Learning
https://towardsdatascience.com/reinforcement-learning-tutorial-part-3-basic-deep-q-learning-186164c3bf4

Other

Here is a twitter conversation about some compelling uses of Markov chains

Here’s a video “Markov Chain Monte Carlo and the Metropolis Algorithm” which uses Markov chains to help calculate integrals numerically.

# Code

Again, the code for this post is up on github at https://github.com/Atrix256/TextMarkovChain

The code is written for readability and runs plenty fast for this demo (nearly instant in release, a couple seconds in debug) but There are lots of string copies etc that you would want to fix up if using this code seriously.

# Linear Fit Search

Binary search looks in the middle of a list to make a guess about where a search value is. If that guess is wrong, it can eliminate half of the list (based on whether the search value is less than or greater than the guess location) and try again. It repeats until it’s either found the search value, or runs out of list.

This algorithm works well but it is blind to the actual values it got when making guesses, beyond just checking if they were greater or less than the search value.

I recently wondered: If we knew the min and max value stored in the list, couldn’t we make a more intelligent guess as to where the search value might be? We could fit the data with a line, figure out where our guess would be on that line, and make that be our initial guess. As we iterate, we could use our incorrect guesses as new min or max values of the line as appropriate, updating our line fit as we went, and perhaps arrive at an answer more quickly.

Another way of looking at this: If the guess a binary search made is VERY far from the search value, maybe it should go farther than the midpoint when making the next guess? Or, if it was pretty close to the search value, maybe it shouldn’t go as far as the midpoint? Close vs far measurements depend on the overall magnitude of the numbers in the list, so you’d need to know what sort of values are stored. A min and a max value of the list can give you a rough idea of that, especially if you update those min / max values as you repeatedly cut the list with guesses.

This post explores that idea. The result is something that could be more attractive than binary search, depending on what kind of trade offs are being looked for. While I haven’t heard of this technique , I wouldn’t be surprised if it’s been tried before and written about. (Know of a source? let me know!).

UPDATE: @thouis from twitter mentioned the basic idea is called “interpolation search”. This post goes beyond that basic idea but you can read more about it here if you’d like 🙂 https://www.techiedelight.com/interpolation-search/. He has a paper about interpolation search that you can read here (it has some relation to discrepancy, as in low discrepancy sequences, oddly!) https://erikdemaine.org/papers/InterpolationSearch_SODA2004/

The post goes a step further to address a problem that is encountered when using this algorithm, and also talks about other ways this algorithm might be extended or generalized.

An implementation, and the code that generated all the data for this post, can be found here: https://github.com/Atrix256/LinearFitSearch

# Initial Problem / Other Possible Avenues

(Feel free to skip this section if you get lost. You won’t miss anything important about the algorithm itself)

If you are wise in the ways of numbers, you might be saying to yourself that this only works if you have roughly evenly distributed numbers – basically, a flat PDF, or a flat histogram. This is because by only knowing the min and max, you are doing a linear fit of the data, and making guesses as if your data is well represented by that line. The less like a line your data actually is, the less good this ought to work.

That is true, and I thought up this idea while trying to think of how to generate 1d blue noise more quickly, which is random but roughly evenly spaced values. For that usage case it does well, but there are many types of non linear data out there that you might want to search through.

Really what you want to do is learn the distribution of the values in the list, and use that knowledge to know where the value you are searching for is likely to be.

I didn’t go that direction in these experiments, but it seems like a data scientist would have plenty of tools in their tool box to attempt something like that. Markov chain Monte Carlo type algorithms come to mind.

There’s another way to look at the problem of searching for a value in a list, and that’s to look at it as strictly a function inversion problem.

If you look at your sorted list as a lookup table, where the index is the x value, and the value stored is the y value, a search tries to tell you the x value for a specific y value that you are searching for.

In this context you only care about integer values of x, and there might be duplicate values in the list, making it not a strictly monotonic function – not having each y value be larger than the last y value – but has a more relaxed version where each y value is >= the last y value.

Thinking about the search problem as a function inversion problem, ignoring the monotocity issue, there are far too many data points to do an analytic inverse, so you would be looking at numerical inverse solutions.

I also didn’t really explore that direction, so it’s another way to go that might yield some better fruit.

Lastly, you could see searching a sorted list as a root finding problem. If you are looking for where the function minus the search value equals zero, numerical root finding functions could maybe help you here. I also did not try anything in that direction.

If anyone ends up exploring any of the alternative avenues, I’d love to hear what kind of techniques you used and what your results were!

# Linear Fit Search

The algorithm works like this…

1. Start with a sorted list, and the minimum and maximum value stored in that list.
2. Calculate a line fitting the min and max. For an equation y=mx+b, you are calculating m and b.
3. Using the inverse of the function, which is x=(y-b)/m, make a guess for what index (x) the search value (y) is at by plugging the search value into that equation as y and getting an x. That x is the index you are guessing the value is at.
4. If your guess was correct, you are done so exit. Otherwise, if the guess was too high, this is your new max. If the guess was too low, this is your new min. If you’ve run out of list to search, the value isn’t there, so exit.
5. Goto 2

This algorithm assumes the sorted list looks like a line if you were to graph it, so it does better when the sorted list actually looks like a line.

Let’s see how it does for a linear list with values in it between 0 and 2000. (Click to see full size image)

The left image shows the items in the array.

In the middle image, x axis is the number of items in the list, and y axis is how many guesses it took to search for a random value. This shows the average of 100 runs.

In the right image, it shows the minimum and maximum guesses it took for each list size, for those same 100 runs.

The linear fit did pretty well didn’t it? At minimum it took zero guesses (the search value was less or equal to min or greater or equal to max), and at maximum it took 2 guesses to find the search value, regardless of list size.

Binary search took about the usual log2(N), as expected.

Let’s try a list made up of random numbers between 0 and 2000.

That looks pretty similar to the linear case, but the line fit search doesn’t beat binary search by quite as much. The randomness of the list makes it so the guesses are more often wrong, and so it takes a few extra guesses to find the right place.

Let’s try a quadratic function: y=2000x^2:

The average for line fit search still beats binary search, but if you look at the min/max graph, the line fit min and max entirely encompasses the binary search min and max. That means there is a ton of variance about whether it will be faster or slower than binary search, even though on average it will be faster.

Let’s try a cubic function: y=2000x^3:

While the average still (barely) beats binary search, the maximum for line fit search has gotten REALLY erratic.

Let’s try a log function:

Ouch, the line fit is actually doing worse now than the binary search.

Lastly, let’s go back to the linear list, but let’s make the last entry in the table be 200,000 instead of 2000:

Ouch! Linear fit search is super awful now. What happened?!

It turns out that this uneven histogram type of list is really a worst case scenario for the line fit search.

What is happening here is that it sees the min as 0 and the max as 200,000 so it thinks the line is very steep. On it’s first guess, everything it could search for (it searches for a random value between 0 and 2000), it will think the value is at index 0. It will very likely be wrong, and elminate index 0. The next round, it will choose index 1, be very likely wrong again, and repeat by picking 2 then 3 then 4 and so on. This data layout nearly forces this search to a more computationally expensive version of linear search. Binary search doesn’t have this problem because it doesn’t care what the values are, it just cuts the list in half repeatedly until it’s done.

Wouldn’t it be nice if we could know whether it’d be better to use binary search or linear fit search for a data set?

We’d have to analyze the data set to figure that out, and if we are going to go to all that trouble, we probably should just learn the shape of the data set in general and use that knowledge to make a better guess than either binary search or linear fit.

I think going that route could be fruitful, but I didn’t try it. Instead I came up with a Hybrid Search.

Here is my more readable, less optimized code for the linear fit search.

TestResults TestList_LineFit(const std::vector<size_t>& values, size_t searchValue)
{
// The idea of this test is that we keep a fit of a line y=mx+b
// of the left and right side known data points, and use that
// info to make a guess as to where the value will be.
//
// When a guess is wrong, it becomes the new left or right of the line
// depending on if it was too low (left) or too high (right).
//
// This function returns how many steps it took to find the value
// but doesn't include the min and max reads at the beginning because
// those could reasonably be done in advance.

// get the starting min and max value.
size_t minIndex = 0;
size_t maxIndex = values.size() - 1;
size_t min = values[minIndex];
size_t max = values[maxIndex];

TestResults ret;
ret.found = true;
ret.guesses = 0;

// if we've already found the value, we are done
if (searchValue < min)
{
ret.index = minIndex;
ret.found = false;
return ret;
}
if (searchValue > max)
{
ret.index = maxIndex;
ret.found = false;
return ret;
}
if (searchValue == min)
{
ret.index = minIndex;
return ret;
}
if (searchValue == max)
{
ret.index = maxIndex;
return ret;
}

// fit a line to the end points
// y = mx + b
// m = rise / run
// b = y - mx
float m = (float(max) - float(min)) / float(maxIndex - minIndex);
float b = float(min) - m * float(minIndex);

while (1)
{
// make a guess based on our line fit
ret.guesses++;
size_t guessIndex = size_t(0.5f + (float(searchValue) - b) / m);
guessIndex = Clamp(minIndex + 1, maxIndex - 1, guessIndex);
size_t guess = values[guessIndex];

// if we found it, return success
if (guess == searchValue)
{
ret.index = guessIndex;
return ret;
}

// if we were too low, this is our new minimum
if (guess < searchValue)
{
minIndex = guessIndex;
min = guess;
}
// else we were too high, this is our new maximum
else
{
maxIndex = guessIndex;
max = guess;
}

// if we run out of places to look, we didn't find it
if (minIndex + 1 >= maxIndex)
{
ret.index = minIndex;
ret.found = false;
return ret;
}

// fit a new line
m = (float(max) - float(min)) / float(maxIndex - minIndex);
b = float(min) - m * float(minIndex);
}

return ret;
}


# Hybrid Search

Since binary search and linear fit search both have situationally good properties, I decided to try a hybrid of the two where it switches between the two for each guess. The first guess is a linear fit, the next is a binary search guess, then back to linear fit, and so on.

Here’s where that puts things with the previous worst case scneario: the linear data with a single huge outlier. New graph on top, old on bottom for comparison. Apologies that the colors aren’t consistent between old and new! 😛

There’s quite a bit of variance, and the linear fit min and max contains the binary search min and max, but on average it does beat the binary search now, which is kind of neat.

Let’s analyze the line fit worst performers to best performers and see how the hybrid search compares.

Here’s the log function:

The variance has decreased compared to line fit. The average beats binary search too, where the non hybrid test didn’t.

Next is the cubic function:

With the non hybrid approach, cubic on average was barely beating binary search and had a huge amount of variance. The hybrid average is beating binary search by a larger margin and the variance has dropped a lot.

The line fit search beat binary search, like the hybrid search does. It even beats it by roughly the same amount. The hybrid search has a lot less variance though, which is a nice property. You’ll have more consistent timings as you search.

Here’s random:

The hybrid search does a little worse both for average, and variance, than the linear fit search did.

Last is linear:

it’s impossible to see where the hybrid max line is, but it went up to 3, from the 2 that line fit max was at, which also brings the average up just a little bit. In my opinion, that isn’t so bad that we slightly damaged the perfectly linear and random cases in favor of making it much more robust in the general case.

Here is my more readable, less optimized code for the hybrid search. The only meaningful difference is on line 48 where it chooses to do a linear fit or binary search step, and line 72 where it toggles which one it does next.

TestResults TestList_HybridSearch(const std::vector<size_t>& values, size_t searchValue)
{
// On even iterations, this does a line fit step.
// On odd iterations, this does a binary search step.
// Line fit can do better than binary search, but it can also get trapped in situations that it does poorly.
// The binary search step is there to help it break out of those situations.

// get the starting min and max value.
size_t minIndex = 0;
size_t maxIndex = values.size() - 1;
size_t min = values[minIndex];
size_t max = values[maxIndex];

TestResults ret;
ret.found = true;
ret.guesses = 0;

// if we've already found the value, we are done
if (searchValue < min)
{
ret.index = minIndex;
ret.found = false;
return ret;
}
if (searchValue > max)
{
ret.index = maxIndex;
ret.found = false;
return ret;
}
if (searchValue == min)
{
ret.index = minIndex;
return ret;
}
if (searchValue == max)
{
ret.index = maxIndex;
return ret;
}

// fit a line to the end points
// y = mx + b
// m = rise / run
// b = y - mx
float m = (float(max) - float(min)) / float(maxIndex - minIndex);
float b = float(min) - m * float(minIndex);

bool doBinaryStep = false;
while (1)
{
// make a guess based on our line fit, or by binary search, depending on the value of doBinaryStep
ret.guesses++;
size_t guessIndex = doBinaryStep ? (minIndex + maxIndex) / 2 : size_t(0.5f + (float(searchValue) - b) / m);
guessIndex = Clamp(minIndex + 1, maxIndex - 1, guessIndex);
size_t guess = values[guessIndex];

// if we found it, return success
if (guess == searchValue)
{
ret.index = guessIndex;
return ret;
}

// if we were too low, this is our new minimum
if (guess < searchValue)
{
minIndex = guessIndex;
min = guess;
}
// else we were too high, this is our new maximum
else
{
maxIndex = guessIndex;
max = guess;
}

// if we run out of places to look, we didn't find it
if (minIndex + 1 >= maxIndex)
{
ret.index = minIndex;
ret.found = false;
return ret;
}

// fit a new line
m = (float(max) - float(min)) / float(maxIndex - minIndex);
b = float(min) - m * float(minIndex);

// toggle what search mode we are using
doBinaryStep = !doBinaryStep;
}

return ret;
}


## Random Odds and Ends

Just like binary search, the linear fit and hybrid search algorithms can return you the index to insert your value into the list, if not present.

Some folks may balk at the idea of having the min and max value of the list before you do a search, from the point of view that it’s sort of like 2 guesses that aren’t being counted against the graph. If that’s your point of view, you can add 2 to the values graphed and you can see that the hybrid search is still compelling. I think it’s perfectly reasonable that you’d know the min and max of a sorted list though. After all, we store the length, why not also the min and max?

It may not be optimal to do 1 step of line fit search and 1 step of binary search in the hybrid search method. It might be that by doing something like 1 binary step then 3 line fit steps, and repeating that pattern, may give you better results. It may also be a better idea to just do line fit search, but if you aren’t making good enough progress, throw in a binary search step. I didn’t explore this at all due to the “nice enough” results i got switching off every time.

I had a thought that it might be good to try doing an “online linear squares fit” while making guesses so that you learned the shape of the list while searching it. If that sounds interesting to you, give this a read: https://blog.demofox.org/2016/12/22/incremental-least-squares-curve-fitting/. I suspect that having a more localized fit (like in this post) performs better, but I might be wrong. I could also see doing a least squares fit of the data offline in advance so you had that data available, like a min and a max, before you started the search. A problem with doing a fit in general though is that you have to be able to invert the function of whatever you fit the data with. Quadratic or cubic seem like they are probably the limit of what you’d want to try to avoid ringing and the complexity of higher order function inversion.

You can make binary searches more cache friendly by putting them into binary trees stored in arrays. This makes it so for instance, that when you test index 0, you are really testing the half way point. If the search value is less than index 0, you look at index 1, else you look at index 2. The left and right child of an index is just index*2 and index*2+1. I bring this up, because the “fixed guess points” of a binary search make this possible. A linear fit search doesn’t have fixed guess points, which makes it not possible to do the same thing. I’m betting with some creativity, some better cache friendliness could be figured out for a linear fit search.

Following in that idea, is the concept of a cache oblivious b-tree. Check it out here: https://github.com/lodborg/cache-oblivious-btree

Another nice property of binary searching is that you can make it branchless and very SIMD friendly, or very friendly for simple hardware implementations. A linear fit search doesn’t seem as well suited for that, but again, maybe some creativity could help it be so. Here’s more about binary search operating like I just described: https://blog.demofox.org/2017/06/20/simd-gpu-friendly-branchless-binary-search/

Lastly, you might have noticed that the graph for the linear data set showed that the line fit and hybrid searches were taking fewer guesses as the list got larger. It looks impossible, and lets me make this dank meme:

What the heck is going on there?

The x axis of those graphs shows how large the list is, and the y axis is how many guesses are taken, but in all those linear lists of each size, the list linearly breaks up the range [0,2000]. It’s also always searching for random numbers in [0,2000]

In smaller lists, the numbers are more sparse, while in larger lists the numbers are more dense.

If you have a linear data set, and are using a linear fit to look for a number in that list that may or may not be there, a denser list will have the values there more often, and the first guess is going to more often be the correct location of the search value.

That’s what is happening, and that’s why it’s showing an improvement in the linear case as the list gets larger, because it’s also getting more dense.

Here’s a graph for a version of the test where the density is kept the same for each list. The lists are between [0,5*count] and the search values are in the same range.

It’s interesting and kind of cool that both the average and min/max are flat, but this is a best case scenario for the line fit (and hybrid) search, with the data actually being linear.

## Performance

Ok finally we get to performance. Many of you fine folks were probably looking at the guess count graphs and thinking “So what? Where’s the perf measurements?” TL;DR I think this is a pareto frontier advancement but i’ll explain more.

here are the perf results but don’t be too quick to say “aha!”, because they need some explanation and context. These results are on my modern-ish gaming laptop.

Results:

• Linear search takes ~1.5 nanoseconds per guess. (eg, increment the index and read the next value from the array)
• Binary search takes ~5 nanoseconds per guess.
• Both linear fit and hybrid search takes ~12 nanoseconds per guess.

So, from my tests, binary search would need to take 2.5 times as many guesses as linear fit or hybrid searching to break even. The only case where that is true in my tests is the purely linear list.

Now that I’ve said that, I don’t think the tests I’ve done are really a good apples to apples comparison.

What I did as a test was generate lists of the various types described above, generated a list of random numbers to search for in them, then had each search algorithm do all the searches and i divided the total time by the total number of guesses done to get a time per guesses for each algorithm.

It is true that the linear fit is slightly more complicated logic than a binary search, or the linear search, so computationally I do expect it to take longer, and the 2.5x as long seems like a fair measurement.

HOWEVER, searching the same list over and over is an unrealistic pattern for most applications. More of the list would be likely to be in the cache when doing multiple searches back to back like this, so memory reading would be under-reported in the profiling.

Because the linear fit (and hybrid) searches are more computationally expensive, but end up doing fewer guesses, they use more cpu, but less memory bandwidth. That means that the wins they give would show up in times when memory reads (or wherever the list was stored) were slower. Having the list in the cache is not a time when the reads are going to be slower, so I think the testing is stacked against the linear fit and hybrid testing.

That said, I can’t think of a better “canned performance test” to compare apples to apples. You really would need to drop it in, in a realistic usage case for searching in an application, and see if it was better or worse for that specific usage case.

If you were memory bandwidth bound, and thus had some compute to spare, this search seems like it could possibly be a nice option. Or, in exotic situations where reading a list was VERY VERY slow (remote servers, homomorphic encryption, data stored on disk not in memory?) this could be a better algorithm. In those exotic situations where reads are way more expensive that computation, you’d probably want to go further though, and use more advanced algorithms to really make every guess count, using a lot more CPU to do so.

Lastly on perf: none of this code has been optimized. I wrote it for clarity, not speed. It’s possible that the comparison landscape could change (either for better or worse) with optimized code.

If anyone investigates perf more deeply, I’d love to hear results and in what context those results were found. Thanks!

## Quadratic Fit Search and Beyond?

An obvious questions is: can this search technique extend to quadratic and beyond?

I do think so. Let’s look at how that might work, and then i’ll point out some complications that make it more challenging.

Let’s think about the quadratic case. You’d need to start with a quadratic fit of the data, which would require 3 data samples from the list. Two data samples would be the first and last index just like the linear search, but where should the third data point be from?

One place it could be is in the middle of the list. If you can afford more processing time than that, you might consider picking whatever index gives the lowest error between the quadratic fit and the actual data stored in the array.

Now you have a quadratic fit of the data in the array and can begin searching. You have some y=f(x) function that is quadratic, and you invert it to get a x=f(y) function. All is well so far.

You make your first guess by pluggin your search value in for y and getting an x out which is your first guess for where the number is. When you read that number, if it is the search value, you are done. If it doesn’t match though, what do you do?

Your guess point is going to be between your min and max, but it might be to the left or the right of the third point you have in the quadratic fit. That is two possibilities.

Your guess may also be too low, or too high. That is two more possibilities, making for four possible outcomes to your guess.

Let’s say your guess was to the left of the “third point” and deal with these two outcomes first:

• If your guess was less than the search value, it means that your guess is the new minimum.
• If your guess was greater that the search value it means that your guess is the new maximum. A problem though is that your “third point” is now to the right of the search maximum. This isn’t so bad because it still fits real data on the curve but it seems a little weird.

If your guess was on the right of the “third point”, we have these two outcomes to deal with:

• If your guess was less than the search value, the guess is the new minimum, and the “third point” in the quadratic fit is to the left and is less than the minimum.
• If your guess was greater than the search value, the guess is the new maximum.

Are you with me so far? the “third point” seems oddly stationary at this point, but the next round of searching fixes that.

On the second step of searching (and beyond), we have some new possibilities to add to the previous four. The “third point” can either be less than the minimum or greater than the maximum. That is two possibilities.

And once again, we have two possibilities in regards to what our guess found: The guess value could be lower than the search value, or it could be higher.

Due to symmetry, let’s just consider the “third point” to be greater than our max, and then we can just consider the less than and greater than case:

• If our guess was too small, it’s the new minimum.
• If our guess was too large, it’s the new maximum, but the old maximum becomes the new “third point”. This moves the “third point” to be more local, giving us a more local quadratic fit of our data, which should help the search make better guesses.

So now, the “third point” moves around, and the quadratic fit is updated to be a localized fit, like we want it to be.

For the cubic case and above, I’ll leave that to you to sort out. It just is updating the minimum and maximums based on the guess value vs search value, and then doing a dance to make sure and keep the most local points around for the curve fit of the data, and throwing out the less local points to make room. I am pretty sure it’s extendable to any degree you want, and that one algorithm could be written to satisfy arbitrary degrees.

Now onto a complication!

Our very first step is to make an initial fit of data of whatever degree and then invert it. To invert the function, it needs to be monotonically increasing – aka there is no part on the graph where if you look at the point to the left, it’s higher. Each point on the graph should be higher than the point to the left.

The bad news is that if even looking at the quadratic case, making a quadratic curve pass through 3 data points A, B, C where A <= B <= C, the result is very often NOT going to be monotonic.

That means you are going to have a bad time trying to invert that function to make a guess for where a search value should be in the list.

I think a good plan of attack would be to fit it with a monotonic quadratic function that didn't necessarily pass through the 3 data points. That would affect the quality of your guess, but it might (probably should??) do better at guessing than a line fit, at the cost of being more computationally expensive. I'm not sure how to do that specifically, but I'd be surprised if there wasn't an algorithm for it.

For details on how even quadratic often isn't monotonic:

Some possibly good leads to dealing with this:

https://math.stackexchange.com/questions/3129051/how-to-restrict-coefficients-of-polynomial-so-the-function-is-strictly-monotoni

https://en.wikipedia.org/wiki/Monotone_cubic_interpolation

## Closing

Thanks for reading. Hopefully you found it enjoyable.

If you use this, or do any related experimentation, I’d love to hear about it.

# Blending an HDR color into a U8 Buffer

I stumbled on something that I found interesting, so wanted to share in case it was useful for other people too.

The c++ code that generated this images can be found on github at https://github.com/Atrix256/U8HDRPMA

I was implementing Inigo Quilez’ “Better Fog” which is REALLY REALLY cool. It looks way better than even the screenshots he has on his page, especially if you have multiple types of fog (distance fog, height fog, fog volumes):
http://www.iquilezles.org/www/articles/fog/fog.htm

I first had it implemented as a forward render, so was doing the fogging in the regular mesh rendering shader, with all calculations being done in 32 bit floats, writing out the final result to a RGBAU8 buffer. Things looked great and it was good.

I then decided I wanted to ray march the fog and get some light shafts in, so it now became a case where I had a RGBAU8 color render target, and I had the depth buffer that I could read to know pixel world position and apply fog etc.

The result was that I had a fog color that has an HDR fog color (it had color components greater than 1 from being “fake lit”) and I knew how opaque the fog was, so I just needed to lerp the existing pixel color to the HDR fog color by the opacity. The usual alpha blending equation (The “over” operator) is actually a lerp so I tried to use it as one.

Source Blend: Source Alpha
Dest Blend: 1 – Source Alpha

That becomes this, which is the same as a lerp from DestColor to SrcColor using a lerp amount of SourceAlpha.

$\text{DestColor} = \text{DestColor} * (1 - \text{SourceAlpha}) + \text{SrcColor} * \text{SrcAlpha}$

BAM, that’s when the problem hit. My image looked very wrong, but only where the fog was thickest and brightest. I was thinking maybe it how i was integrating my fog but it wasn’t. So maybe it was an sRGB thing, but it wasn’t. Maybe it was how i was reconstructing my world position or pixel ray direction due to numerical issues? It wasn’t.

This went on and on until i realized: You can’t say “alpha blend (1.4, 0.3, 2.4) against the color in the U8 buffer using an alpha value of 0.5”. The HDR color is clamped before the alpha blend and you get the wrong result.

You can’t alpha blend an HDR color into a U8 buffer!

… or can you?!

# Doing It

As it turns out, premultiplied alpha came to the rescue here, but let’s look at why. As we go, we are going to be modifying this image:

Mathematically speaking, alpha blending works like this:

$\text{DestColor} = \text{DestColor} * (1 - \text{SourceAlpha}) + \text{SrcColor} * \text{SrcAlpha}$

Using the X axis as alpha, and an overlaid solid color of (1.6, 1.4, 0.8), that gives us this:

However, if you output a float4 from your shader that is $\text{float4}(\text{SrcColor}, \text{SrcAlpha})$, alpha works like the below, where $\text{sat}()$ clamps values to be between 0 and 1:

$\text{DestColor} = \text{DestColor} * (1 - \text{SourceAlpha}) + \text{sat}(\text{SrcColor}) * \text{SrcAlpha}$

So what happens, is that SrcColor gets clamped to be between 0 or 1 before the lerp happens, which makes the result much different:

However, using pre-multiplied alpha changes things. The float4 we return from the shader is now $\text{float4}(\text{SrcColor*SrcAlpha}, \text{SrcAlpha})$.

Our blend operations are now:

Source Blend: One
Dest Blend: 1 – Source Alpha

That makes the blending equation become this:

$\text{DestColor} = \text{DestColor} * (1 - \text{SourceAlpha}) + \text{sat}(\text{SrcColor} * \text{SrcAlpha}) * 1$

The $\text{sat()}$ function changed to encompass the whole second term, instead of just SrcColor! That gives this result that matches the one we got when we did the lerp in shader code:

# Quick Math

So visually things look fine, but let’s look real quick at the math involved.

If you lerp from 0.5 to 10.0 with a lerp factor of 0.2, you’d get 2.4. The equation for that looks like this:

$0.5 * 0.8 + 10.0 * 0.2 = 0.4 + 2.0 = 2.4$

This is what happens when doing the math in the forward rendered shader. You then write it out to a U8 buffer, which clips it and writes out a 1.0.

If you use alpha blending, it clamps the 10 to 1.0 before doing the lerp, which means that it lerps from 0.5 to 1.0 with a lerp factor of 0.2. That gives you a result of 0.6 which is VERY incorrect. This is why the HDR color blending to the U8 buffer didn’t work.

If you use premultiplied alpha blending instead, it clamps the 10.0*0.2 to 1, which means that it was 2 but becomes 1, and the result becomes 1.4. That gets clipped to 1.0 so gives you the same result as when doing it during the forward rendering, but allowing you to do it during a second pass.

$0.5 * 0.8 + \text{sat}(10.0 * 0.2) = 0.4 + \text{sat}(2.0) = 0.4 + 1.0 = 1.4$

This doesn’t just work for these examples or some of the time, it actually works for all inputs, all of the time. The reason for that is, the second term of the lerp is clipped to 0 to 1 and is added to the first term which is always correct. Both terms are always positive. That means that the second term can add the full range of available values (0 to 1) to the first term, and it is correct within that range. That means this technique will either give you the right answer or clip, but will only clip when it is supposed to anyways.

# Closing

While I found this useful in a pinch, it’s worth noting that you may just want to use an HDR format buffer for doing this work instead of working in a U8 buffer. The reason why is even though this gives the same answer as doing the work in the shader code, BOTH implementations clip. That is… both implementations SHOULD be writing out values larger than 1.0 but the colors are clamped to being <= 1.0. This is important because if you are doing HDR lit fog (and similar), you probably want to do some sort of tone mapping to remap HDR colors to SDR colors, and once your colors clip, you've lost information that you need to do that remapping.

The red pixels below show where clipping happens:

# Monte Carlo Integration Explanation in 1D

Let’s say that you have a function $y=\sin(x)^2$ and you want to know what the area is under the curve between 0 and pi.

We could solve this specific problem by doing some algebra and calculus to get the exact answer analytically (which is $\frac{\pi}{2}$), but let’s pretend like we can’t, or don’t want to solve it that way.

Another way to solve this problem is to use Monte Carlo integration, which lets you solve it numerically and get an approximated answer.

How you would do that is like this:

1. Pick a random number between 0 and pi.
2. Plug that value into the function $y=\sin(x)^2$ as x to get a y value.
3. Do this multiple times and take the average to get the average y value of the function.
4. Pretending that the function is a rectangle, you can use the average y as the height of the rectangle, and use pi as the width because we are looking between 0 and pi.
5. Multiply that width and height to get the area of a rectangle, which is the estimated area under the curve.

That’s all you need to do!

Monte Carlo integration is pretty powerful in how simple it is, and how it works really well even in extremely high dimensions.

As you might imagine, the more samples you take to get your average y value, the better your estimate is going to be. Unfortunately though, you have to quadruple the number of samples you have to cut the error in half, so it can take a while to get the correct answer (converge) if you need a high level of accuracy. (https://en.wikipedia.org/wiki/Monte_Carlo_method#Integration)

Here’s a C++ code snippet doing this process with 10,000 samples. Each time you run the program you’ll get a different estimate. If you take more samples, you’ll more reliably get a better answer.

double SimpleMonteCarlo()
{
double rangeMin = 0;
double rangeMax = 3.14159265359;

size_t numSamples = 10000;

std::random_device rd;
std::mt19937 mt(rd());
std::uniform_real_distribution<double> dist(rangeMin, rangeMax);

double ySum = 0.0;
for (size_t i = 1; i <= numSamples; ++i)
{
double x = dist(mt);
double y = sin(x)*sin(x);
ySum += y;
}
double yAverage = ySum / double(numSamples);

double width = rangeMax - rangeMin;
double height = yAverage;

return width * height;
}


Below is the output of the code ran 5 times. Note that the real answer is $\frac{\pi}{2}$ which is 1.57079632679.

1. 1.548451
2. 1.554312
3. 1.576727
4. 1.578759
5. 1.598686

(I’m actually a bit disturbed that the 5 runs are actually sorted from low to high but whatever …)

A problem with this being based on regular old random numbers (white noise) is that sometimes the numbers will clump, giving too much weighting to one area of the function, and leave empty space where another part of the function wasn’t sampled at all.

There are many different ways to deal with this situation but two of my favorites are…

Both of those things give more even coverage over the sampling space which means that you won’t have as large gaps of missing information from your samples.

Another way to help this is stratified sampling, where you break the sampling space up into some number of sections, and choose random numbers within each section, making sure to have samples in each of the sections. That keeps the randomness, but gives more even coverage over the sampling space.

You might be tempted to just say “If I’m taking 100 samples, i’ll just sample every 1/100th of the space evenly”. That uniform / regular sampling has some problems including aliasing, but also loses some of the positive mathematical properties that random numbers can give you (like, being able to sample from non rational numbered locations!).

A variation on stratified sampling is a technique invented by Pixar called “jittered grid” where you do even sampling, but add a small random value to each sample.

There are lots and lots of other techniques which could make up a long list of blog posts, so we’ll stop there! 🙂

# More General Monte Carlo Integration

The last section was actually a simplified version of a Monte Carlo integration which was able to be simplified because it was using uniform random numbers.

Monte Carlo integration works with random numbers that have arbitrary distributions as well, not just uniform random numbers.

The process works mostly the same but there are a couple differences.

In the previous section, we got an average height and then multiplied by the width to get an estimate of the area under the curve, pretending that it was a rectangle.

The first change is to move the multiplication by the width into the loop. Instead of calculating an average height, we are instead calculating average rectangle areas.

Mathematically you get the same answer, so there’s nothing crazy there.

The second change is that instead of multiplying by the width, you divide by the probability of the number being chosen, that you plugged into the equation.

In the case of our function that we are taking samples of between 0 and pi, the probability of any single number being chosen in that range is $\frac{1}{\pi}$. When we divide by that, it means we end up just multiplying by pi, so it’s mathematically equivalent to what were were doing before!

Here’s the steps for the more generalized monte carlo integration:

1. Pick a random number between 0 and pi using any random number distribution you’d like to.
2. Plug that value into the function $y=\sin(x)^2$ as x to get a y value.
3. Divide that y value by the probability of having chosen that number (otherwise known as PDF(x)) to get an estimated area of the function.
4. Do this multiple times and take the average to get your result.

Here is some code to do the more general Monte Carlo integration, still using uniformly distributed random numbers.

double GeneralMonteCarlo()
{
size_t numSamples = 10000;

std::random_device rd;
std::mt19937 mt(rd());
std::uniform_real_distribution<double> dist(0.0f, 1.0f);

auto InverseCDF = [](double x) -> double
{
return x * c_pi;
};

auto PDF = [](double x) -> double
{
return 1.0f / c_pi;
};

double estimateSum = 0.0;
for (size_t i = 1; i <= numSamples; ++i)
{
double rnd = dist(mt);
double x = InverseCDF(rnd);
double y = sin(x)*sin(x);
double pdf = PDF(x);
double estimate = y / pdf;

estimateSum += estimate;
}
double estimateAverage = estimateSum / double(numSamples);

return estimateAverage;
}


Interestingly, dividing by the PDF is the same mathematically as multiplying by width in the last section – it literally ends up being a multiplication by pi (the width). The only difference is that we pulled the multiply into the loop, instead of leaving it until the end.

As an optimization, you could definitely move the divide out again (and turn it into a multiply), but I wanted to present the code as close to the core concepts as possible.

# Non Uniform Random Number Distributions

Let’s try sampling from a different random number distribution. Let’s generate random numbers which have a distribution of $y=\sin(x)$. You can see it compared to the function we are integrating $y=\sin(x)^2$ below. They are fairly similarly shaped!

To use $y=\sin(x)$ as a random number distribution for monte carlo integration, we’ll need to calculate the normalized PDF and we’ll also need to calculate the inverse CDF.

If you want to know more about PDFs and “whatever an inverse CDF may be”, give this a read: Generating Random Numbers From a Specific Distribution By Inverting the CDF

• The function $y=\sin(x)$ is normalized to this PDF: $\mathit{PDF}(x) = \frac{\sin(x)}{2}$
• To generate numbers from that PDF, you take a random number $x$ that is between 0 and 1 and plug it into this function, which is the inverse CDF: $\mathit{CDF}^{-1}(x) = 2 \cdot \sin^{-1}(\sqrt{x})$

Here is a code snippet doing monte carlo integration with this PDF and inverse CDF:

double ImportanceSampledMonteCarlo()
{
size_t numSamples = 10000;

std::random_device rd;
std::mt19937 mt(rd());
std::uniform_real_distribution<double> dist(0.0, 1.0);

auto InverseCDF = [](double x) -> double
{
return 2.0 * asin(sqrt(x));
};

auto PDF = [](double x) -> double
{
return sin(x) / 2.0f;
};

double estimateSum = 0.0;
for (size_t i = 1; i <= numSamples; ++i)
{
double rng = dist(mt);
double x = InverseCDF(rng);
double y = sin(x)*sin(x);
double pdf = PDF(x);
double estimate = y / pdf;

estimateSum += estimate;
}
double estimateAverage = estimateSum / double(numSamples);

return estimateAverage;
}


To compare this versus uniform random sampling, I'll show the progress it makes over 50,000,000 samples first using uniform random numbers, then using the $y=\sin(x)$ shaped PDF.

Uniform aka 1/pi:

sin(x):

You may notice that every 4x samples, the standard deviation (which is the square root of variance) drops in half, like we talked about before. This is why path tracing takes so long. If you don’t know what path tracing is, this is why modern animated movies take so long to render.

In the results, you can see that the variance of the estimates is a lot lower using this PDF that is shaped more like the function we are trying to integrate. We got a better, more reliable answer with fewer samples. Is that pretty cool? You bet it is! When you use a PDF shaped like the function you are integrating, to get better results faster, that is called importance sampling.

If you use a PDF which is shaped very differently from the function you are trying to integrate, you will get more variance and it will take longer to converge, which is a total bummer.

Let’s try $y=(\frac{x}{\pi})^5$, which doesn’t look much like the function we are trying to integrate at all:

Here is the PDF and inverse CDF:

• $\mathit{PDF}(x)=(\frac{x}{\pi})^5 \cdot \frac{6}{\pi}$
• $\mathit{CDF}^{-1}(x)= (x*\pi^6)^{\frac{1}{6}}$

Here it is with 50,000,000 samples:

And here is the uniform sampling again as a comparison:

As you can see, it is approaching the right answer, but is taking about 10 times as long to get the same results (amount of variance) compared to uniform sampling. Ouch!

# Perfect Random Number Distributions

Let’s say that we got really lucky and somehow got the PDF and inverse CDF for a function that perfectly matched the function we were trying to integrate. What would happen then?

Let’s check it out by integrating the function $y=\sin(x)$ by using a random number distribution which has the form $y=\sin(x)$.

We already calculated the PDF and inverse CDF of that function earlier:

• $\mathit{PDF}(x) = \frac{\sin(x)}{2}$
• $\mathit{CDF}^{-1}(x) = 2 \cdot sin^{-1}(\sqrt{x})$

Here we do that with 50,000,000 samples:

WOW! As you can see, it had the right answer from the first sample, with zero variance (randomness) and it kept steady at that answer for all 50,000,000 samples.

This is a pretty neat concept, and if you know about “cosine weighted hemisphere sampling”, that does this exact thing.

Cosine weighted hemisphere samples are weighted such that you can remove the $\cos(\theta)$ from the lighting calculations, because the random number distribution handles it for you.

It basically removes that part of randomness from the equations.

Unfortunately there are more variables and randomness in path tracing than just that term, but it helps.

Beyond this, you’d start look at other variance reduction techniques if you were interested, including multiple importance sampling.

# Closing

Going into this blog post I thought “hey no sweat, i’ll make a few simple functions, calculate their PDFs, inverse CDFs and be on my way”.

I can’t believe how almost all the simple functions I tried ended up being impossible to take through the process.

for instance, you can take $x=\sin(y)$ and solve for y to get $y=\sin^{-1}(x)$, but if you try to solve $x=\sin(y)+y$ for y, you are going to have a bad day!

I think in the future if I need to do something like this, I’d like to try fitting a curve to the (x,y) data points reordered as (y,x) data points, but there are many other methods for doing this sort of thing as well.

BTW if wondering how I was calculating std dev (aka square root of variance) while integrating, variance is “The average of the squared differences from the mean”. That means that if you know the correct answer of what you are trying to integrate, you can calculate the std dev like this:

        // Variance is "The average of the squared differences from the mean"
double difference = integration - actualAnswer;
double differenceSquared = difference * difference;
averageDifferenceSquared = Lerp(averageDifferenceSquared, differenceSquared, 1.0 / double(i));
double stdDev = sqrt(averageDifferenceSquared);

• integration is the current average estimate (if you have taken 100 samples, it’s the average of the 100 samples)
• averageDifferenceSquared is also the variance
• i is the number of samples you have taken, including the current one (aka start at 1, not 0)
• If you are confused about me doing a lerp to calculate an average, give this a read: Incremental Averaging

Hope you enjoyed this write up!

Anders Lindqvist (@anders_breakin) is writing up a blog post explaining monte carlo, importance sampling, and multiple importance sampling that you might be interested in if you enjoyed this. Give him a follow, and it’ll be coming out soon 🙂

# Taking a Stroll Between The Pixels

This post relates to a paper I wrote which talks about (ab)using linear texture interpolation to calculate points on Bezier curves. Extensions generalize it to Bezier surfaces and (multivariate) polynomials. All that can be found here: https://blog.demofox.org/2016/02/22/gpu-texture-sampler-bezier-curve-evaluation/

The original observation was that if you sample along the diagonal of a 2×2 texture, that as output you get points on a quadratic Bezier curve with the control points of the curve being the values of the pixels like in the image below. When I say you get a quadratic Bezier curve, I mean it literally, and exactly. One way of looking at what’s going on is that the texture interpolation is literally performing the De Casteljau algorithm. (Note: if the “B” values are not equal in the setup below, the 2nd control point will be the average of these two values, which an extension abuses to fit more curves into a smaller number of pixels.)

An item that’s been on my todo list for a while is to look and see what happens when you sample off of the 45 degree diagonal between the pixel values. I was curious about questions like:

• What if we sampled across a different line?
• What if we samples across a quadratic curve like by having $y=x^2$?
• What if we sampled on a circle or a sine wave?
• How does the changed sampling patterns work in higher dimensions – like trilinear or quadrilinear interpolation?

After accidentally coming across the answer to the first question, it was time to look into the other ones too!

PS – if wondering “what use can any of this possibly have?” the best answer I have there is data compression for data on the GPU. If you can fit your data with piecewise rational polynomials, the ideas of this technique could be useful for storing that data in a concise way (pixels in a texture) that are also quickly and easily decoded by the GPU. The ideas from this post allows for more curve types when fitting and storing your data, beyond piecewise rational polynomials. It’s also possible to store higher order curves and surfaces into smaller amounts of texture data.

# Quick Setup: Bilinear Interpolation Formula

Bilinear interpolation is available on modern GPUs as a way of getting sub-pixel detail. In the olden days, when zooming into a texture, the square pixels just got larger because nearest neighbor filtering was used. In modern times, when looking at the space between pixel values, bilinear interpolation is used to fill in the details better than nearest neighbor does.

You can describe bilinear interpolation as interpolating two values across the x axis and interpolating between the results across the y axis (reversing the order of axes also works). Mathematically, that can look like this:

$z = (A(1-x) + Bx)(1-y) + (C(1-x)+Dx)y$

Where x and y are values between 0 and 1 describing where the point is between the pixels, and A,B,C,D are the values of the 4 nearest pixels, which form a box around the point we are calculating. A = (0,0), B = (1,0), C = (0,1) and D = (1,1).

With some algebra, you can get that equation into a power series form which is going to be easier to work with in our experiments:

$z = (A-B-C+D)xy + (B-A)x + (C-A)y + A$

Now that we have our formula, we can begin! 🙂

# Sampling Along Other Lines

So, if we sample along the diagonal from A to D, we know that we get a quadratic equation out. What happens if we sample along other lines though?

My guess before I knew the answer to this was that since the 45 degree angle line is quadratic (degree 2), and that horizontal and vertical lines were linear (degree 1), that sampling along other lines must be a fractional degree polynomial between 1 and 2. It turns out that isn’t the answer, but I wonder if there’s a way to interpret the “real answer” as a fractional polynomial?

Anyways, wikipedia clued me in: https://en.wikipedia.org/wiki/Bilinear_interpolation#Nonlinear

The interpolant is linear along lines parallel to either the x or the y direction, equivalently if x or y is set constant. Along any other straight line, the interpolant is quadratic

What that means is that if you walk along a horizontal or vertical line, it’s going to be linear. Any other line will be quadratic.

Let’s try it out.

Remembering that the equation for a linear function is $y=mx+b$ let’s literally replace $y$ with $mx+b$ and see what we get out.

$z = (A-B-C+D)xy + (B-A)x + (C-A)y + A$

Which becomes this after substitution:

$z = (A-B-C+D)x(mx+b) + (B-A)x + (C-A)(mx+b) + A$

After some expansion and simplification we get this:

$z = (Am-Bm-Cm+Dm)x^2+(Ab-Bb-Cb+Db+Cm-Am+B-A)x+Cb-Ab+A$

This formula tells us the value we get if we have a bilinear interpolation of values A,B,C,D (aka a bilinear surface defined by those points), and we sample along the x,y line defined by $y=mx+b$.

It’s a very generalized function that’s hard to reason about much, but one thing is clear: it is a quadratic function! Whatever constant values you choose for A,B,C,D,m and b, you will get a quadratic polynomial (or lower degree, but never higher).

Here’s a shadertoy that shows curves generated by random sub pixel line segments on a random (white noise) RGB texture: https://www.shadertoy.com/view/XstBz7

(note that the rough edges of the curve are due to the fact that interpolation happens in X.8 fixed point format, so has pretty limited precision. Check the paper for more information and ways to address the issue.)

Let’s explore a bit by plugging in some values for $m$ and $b$ and see what happens for different types of lines.

m=0, b=0

Let’s see what happens when m is 0 and b is 0. In other words, lets see what happens when we sample along the line $y=0$.

Plugging those values in gives:

$z = (B-A)x + A$

interestingly, this is just a linear interpolation between A and B, which makes sense when looking at the graph of where we are sampling on the bilinear surface.

This goes along with what wikipedia told us: when one of the axes is constant (it’s a horizontal or vertical line) the result is linear.

m=1, b=0

Let’s try m = 1 and b = 0. That is the line: $y=x$. This graph shows where that is sampling from on the bilinear surface:

Plugging in the values gives us this equation:

$z = (A-B-C+D)x^2+(C+B-2A)x+A$

We get a quadratic out! This shouldn’t be too surprising. This is the original insight in the technique. This is also the formula for a quadratic Bezier curve with control points $A$, $(B+C)/2$, $D$.

m=2, b=1

Let’s try the line $y=2x+1$. Here’s the graph of where we are sampling on the bilinear surface:

Plugging in the values give us the equation:

$z = (2A-2B-2C+2D)x^2+(C+D-2A)x+C$

Once again we got a quadratic function when sampling along a line.

You might think it’s strange that the equation ends it “+C” instead of “+A”, but if you look at the graph it makes sense. The line literally starts at C when x is zero.

x=2u, y=3u

In the above examples we are only modifying the y variable, to be some function of x. What if we also want to modify the x variable?

One way to do this is to make a 3rd variable $u$ that goes from 0 to 1. Then we can make $x$ and $y$ be based on that variable.

Let’s see what happens when we use these two equations:

$y=2u$

$x=3u$

That makes us sample this line on the bilinear surface.

Plugging the functions of u in for x and y we get:

$z = (6A-6B-6C+6D)u^2+(2B+3C-5A)u+A$

So we now know that when moving along a straight line on a bilinear surface, that you will get a quadratic function as output, except in the case of the line being horizontal or vertical. Note: if the bilinear surface is a plane, all lines on that surface will be linear functions, so this is another way to get a linear result. It could also be degenerate and give you a point result. You will never get a cubic result (or higher) when going along a straight line though.

What would happen though if instead of sampling along straight lines, we sampled on other shapes, like quadratic curves?

y=x*x

Let’s start with the function $y=x^2$. The path that is sampled is:

Going back to the power series form of bilinear interpolation, let’s plug $x^2$ in for y and see what we get out.

The starting equation:

$z = (A-B-C+D)xy + (B-A)x + (C-A)y + A$

becomes:

$z = (A-B-C+D)x(x^2) + (B-A)x + (C-A)(x^2) + A$

Which becomes:

$z = (A-B-C+D)x^3 + (C-A)x^2 + (B-A)x + A$

It’s a cubic equation!

Here is a shadertoy which follows this sampling path on random pixels: https://www.shadertoy.com/view/4sdBz7

Something neat about this example specifically is that a cubic equation has 4 coefficients, which are basically 4 control points. This example makes use of the values of the 4 pixels involved to come up with the 4 coefficients, so “doesn’t leave anything on the table” so to speak.

This is unlike sampling along line segments where you have 3 control points stored in 4 pixel values. One is a bit redundant in that case.

You can make use of that fact (I have for instance!), but sampling along a quadratic path to get a cubic curve feels like a natural fit.

x=u*u, y=u*u

Let’s see what happens when we move along both x and y quadratically.

Just like in the linear case, we have our 3rd variable u that goes from 0 to 1 and we have x and y be based on that variable. We will use these equations:

$x=u^2$

$y=u^2$

The sampling path looks like this:

When we plug those in we get this quartic function:

$z = (A-B-C+D)u^4 + (B+C-2A)u^2 + A$

You might be surprised to see what looks like a linear path. It’s just because at all times, x is the same value as y, even though they travel down the line non linearly.

## Higher Order Curves: x=3u^2, y=2u^4

Let’s get a little more wild, using these equations:

$x=3u^2$

$y=u^4$

Which makes a sampling path of this:

Plugging in the equations, the bilinear interpolation equation:

$z = (A-B-C+D)xy + (B-A)x + (C-A)y + A$

becomes a hexic equation:

$z = (3A-3B-3C+3D)u^6 + (C-A)u^4 + (3B-3A)u^2 + A$

The shadertoy visualizes it on random pixels as per usual, but with u going from 0 to 1, it means that x goes from 0 to 3 (y is still 0-1), which makes some obvious discontinuities at the boundaries of pixels. In our pure math formulation, we wouldn’t have any of those, but since we are sampling a real texture, when we leave the safety of our (0,1) box, we enter a new box with different control points. https://www.shadertoy.com/view/4dtfz7

## Trigonometric Function: y = sin(2*pi*x)

Let’s try $y=sin(2\pi x)$, which takes this path on the bilinear surface:

The bilinear interpolation equation becomes a trigonometric polynomial:

$z = (A-B-C+D)x*sin(2\pi x) + (B-A)x + (C-A)*sin(2\pi x) + A$

That has disconuities in it when texture sampling again, due to leaving the original pixel region, so here’s a better looking shadertoy, which is for $y=sin(2\pi x)*0.5+0.5$. It scales and shifts the y values to be between 0 and 1. https://www.shadertoy.com/view/4stfz7

## Circle

Lastly, here’s sampling on a circle.

$x=sin(2 \pi u)*0.5+0.5$

$y=cos(2 \pi u)*0.5+0.5$

It follows this path:

Plugging the functions into the power series bilinear equation gives:

$z = (A-B-C+D)*(sin(2 \pi u)*0.5+0.5)*(cos(2 \pi u)*0.5+0.5) + (B-A)*(sin(2 \pi u)*0.5+0.5) + (C-A)*(cos(2 \pi u)*0.5+0.5) + A$

Something neat about sampling in a circle is that it’s continuous – note how the left side of the curves line up with the right side seamlessly. That seems like a pretty useful property.

## Moving On

We went off into the weeds a bit, but hopefully you can see how there are a ton of possibilities for encoding and decoding data in a very small number of pixels by carefully crafting the path you sample along.

Compared to the simple “sample along the diagonal” technique, there is some added complexity and shader instructions though. Namely, any work you do to modify x or y before passing them to the linear texture interpolator needs to happen in shader code. That means this technique takes more ALU, but can mean it takes even less texture memory than the other method.

The last question from the top of the post is “What does this all mean in higher dimensional interpolation, like trilinear or quadrilinear?”

Well, it works pretty much the same was as bilinear but there are more dimensions to work with.

We saw that in 2 dimensional bilinear interpolation that when we made x and y be functions (either of each other, or of a 3rd variable u), that the resulting polynomial had a degree that was the degree of x plus the degree of y.

In 3 dimensions with trilinear interpolation, the resulting polynomial would have a degree that is the degree of x, plus the degree of y, plus the degree of z.

In 4 dimensions with quadrilinear, add to that the degree of w.

Let’s consider the case when we don’t want a single curve though, but want a surface or (hyper) volume.

As we’ve seen in the extension dealing with surfaces and volumes, if you have a degree N polynomial, you can break it apart into a multivariate polynomial (aka a surface or hyper volume) so long as the sum of the degrees of each axis adds up to N.

It’s basically what we were just talking about but in reverse.

One thing I think would be interesting to explore further would be to see what the limitations are when you take this “too far”.

For instance, a 2×2 texture can give you a quadratic if you sample along any straight line in the uv coordinates. If you first put the u coordinate through a cubic function, and put the v coordinate through a different cubic function, I think you should be able to make a bicubic surface.

The surface will be constrained to a subset of what a general bicubic surface is able to be shaped like, but you will get a bicubic surface. (basically there will be implicit control points that you don’t have control over unless you add more pixels, and do more sampling, or higher dimensional linear interpolation)

I’d like to see what the constraints there are and see if there’s any chance of getting any real use out of something like that.

Anyhow, thanks for reading! Any ideas, corrections, usage cases you have, whatever, hit me up!

@Atrix256

# Prefix Sums and Summed Area Tables

Prefix sums and summed area tables let you sum up regions of arrays or grids in constant time.

If that sounds like it might not have many uses, that is another way of saying that it does discrete integration in constant time, and can also be made to do some kinds of convolution.

These things come up quite a bit in game development and graphics so is pretty interesting for things like depth of field, glossy reflections, and maybe image based lighting. Check the links at the end of the post to see these things in action in some pretty interesting ways.

# One Dimension – Prefix Sums

Say that you have 10 numbers:

$\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{index} & \textbf{0} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} & \textbf{6} & \textbf{7} & \textbf{8} & \textbf{9} \\ \hline \textbf{value} & 8 & 3 & 7 & 4 & 12 & 6 & 4 & 10 & 1 & 2 \\ \hline \end{array}$

To sum up numbers in a given range you have to manually add up the numbers in that range.

Summing the numbers at index 2 through 5 inclusively takes 3 adds and gives you the answer 29. (index 2 + index 3 + index 4 + index 5)

Summing the numbers at index 0 through index 9 inclusively (the whole table) takes 9 adds to get the answer 57.

Interestingly there is a way to preprocess this data such that summing any range takes only a single subtraction. The technique is called a prefix sum table and you make the table by having the number at each index be the sum from index 0 to that index inclusively.

Here is the prefix sum table for the numbers above:

$\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{index} & \textbf{0} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} & \textbf{6} & \textbf{7} & \textbf{8} & \textbf{9} \\ \hline \textbf{value} & 8 & 11 & 18 & 22 & 34 & 40 & 44 & 54 & 55 & 57 \\ \hline \end{array}$

Now, to find the sum of range a to b inclusively, you start with the value at index b, and subtract the value at index (a-1).

So, to sum the numbers at index 2 through 5 like we did before, we’d start with the value at index 5 which is 40, and we subtract the value at index (2-1) aka index 1, which is 11. That gives us a result of 29 like our manual summing did before.

To sum the numbers at index 0 through index 9, we’d start with the value at index 9, which is 57, and subtract the value at index -1. Since we don’t have anything before index 0, the sum for anything before index 0 is 0. That makes our result be 57-0 or 57, which we calculated before.

Let’s move on to 2D!

## Two Dimensions – Making a Summed Area Table

In two dimensions, the same technique is called a summed area table, and things get only a little more complicated.

$i = \begin{array}{|c|c|c|c|} \hline 3 & 2 & 1 & 8 \\ \hline 9 & 11 & 15 & 0 \\ \hline 8 & 4 & 7 & 6 \\ \hline 12 & 7 & 8 & 3 \\ \hline \end{array}$

Then you make a grid of the same size, where the value at a location is the sum of all the values in the rectangle going from (0,0) to (x,y) inclusive. Assuming that (0,0) is in the top left, that would give us this summed area table:

$I = \begin{array}{|c|c|c|c|} \hline 3 & 5 & 6 & 14 \\ \hline 12 & 25 & 41 & 49 \\ \hline 20 & 37 & 60 & 74 \\ \hline 32 & 56 & 87 & 104 \\ \hline \end{array}$

You can literally sum up all the values for each index to make the table if you want to, but you can also use this formula which lets you iteratively create the table by starting at (0,0) and expand outwards from there. As before, when reading out of bounds values, just use zero.

$I(x,y)=i(x,y)+I(x,y-1)+I(x-1,y)-I(x-1,y-1)$

## Two Dimensions – Using a Summed Area Table

So we know that $I(x,y)$ is the sum of all the values in the rectangle from $(0,0)$ to $(x,y)$ inclusively, but what if we want to find the sum of a different rectangle? What if we have 4 points A,B,C,D and we want to know the sum of the numbers within that sub-rectangle?

With some cleverness we can calculate the sum inside this exact region.

First we get the value at point D, which gives us the sum of this rectangle:

Next, we subtract the value at point B, which gives us the sum of this rectangle:

The next step is to subtract the value at point C. The red area is a problem though as it has been subtracted out twice.

This is a problem that’s easily solved by adding the value at point A in, to give us our final result:

So, to summarize, using a summed area table to get the sum of all values in the rectangle defined by the points A,B,C,D is done by reading the values at points A,B,C,D and calculating: A+D-B-C

## Storage Costs

When you want to store numbers added together, you are going to need storage larger than what you are storing the numbers in.

For instance, if you have the table below using 3 bits per value:
$I = \begin{array}{|c|c|} \hline 7 & 7 \\ \hline 7 & 7 \\ \hline \end{array}$

Turning that into a summed area table, you are going to hit overflow problems:
$I = \begin{array}{|c|c|} \hline 7 & 6 \\ \hline 6 & 4 \\ \hline \end{array}$

For summing up N items, you need $log_2{(N)}$ more bits of storage which means we would need 2 more bits of storage in this case for the 2×2 grid (4 samples), making it be 5 bits total per value (3 bits of storage + 2 extra bits to hold the sum of 4 values). That would let us store the proper table:

$I = \begin{array}{|c|c|} \hline 7 & 14 \\ \hline 14 & 28 \\ \hline \end{array}$

Using the previously shown 2×2 table of 3bit 7’s as an example, what this means is that if you are only ever going to want to ask about 1×1 ranges (which is pointless to use summed area tables for, but makes a nice simple example), you don’t need 2 extra bits, and in fact don’t need any extra bits in this case since a 1×1 range is just 1 sample, and $log_2{(1)}$ is 0.

Looking back at the summed area table that had roll over problems:
$I = \begin{array}{|c|c|} \hline 7 & 6 \\ \hline 6 & 4 \\ \hline \end{array}$

Let’s ask about the range (1,1) to (1,1). So we start with the value at index (1,1) which is 4. Next we add in the value at index (0,0) which is 7 and get 11. Keeping that in 3 bits (eg mod 8), that gives us a value of 3. Next we subtract the value at index (0,1) aka 6, which keeping it in 3 bits gives us 5. Subtracting index (1,0) from that (6 again) and keeping it in 3 bits gives us 7.

So, the sum of the numbers from (1,1) to (1,1) – aka the VALUE in the original table at (1,1) – is 7. Since we made the table, we know this is correct.

It works interestingly!

If we did a 2×2 lookup instead, it would fall apart. we’d need those 2 extra bits since we’d be summing 4 samples, and $log_2{(4)}$ is 2.

So, just to re-iterate… summed area tables do need increased storage per data item to store the sums. However, while most descriptions base that increased storage on the size of the image being made into a summed area table, it is actually based on the largest range you want to sum from that table, which may be smaller than the total size.

I have an idea I’d like to try (next blog post?) where instead of storing the sum of the rectangle at each position, you store the sum divided by the area. In other words, you store the average value for the rectangle.

Calculating the sum for a specific rectangle then becomes getting the 4 values, multiplying by their area, and then doing the usual math.

Apparently this is similar to an idea of using floating point numbers in SAT, which also sounds interesting! Thread from Bart Wronski (https://twitter.com/BartWronsk):

While my idea is similar to using floating point, a handful of people (especially Tom Forsyth! https://twitter.com/tom_forsyth) have made sure I know that using floating point with large textures (~screen sized and above) is not a good idea.

Tom says:
“The entries in the bottom-right of the table start having very similar magnitudes, so the difference between them is very noisy. This is super obvious with float16s where you only have 10 bits of precision, which is less than most current screen widths.”

## Other Stuff

Bilinear Interpolation
If you are wondering whether you should use bilinear interpolation when using this technique (sample between pixels) or not, the answer is that you should. Bilinear interpolation is compatible with this technique and gives you the correct values for sub pixel sample points.

Higher Dimensions
This technique extends to 3 dimensions and beyond. The table still contains the sum of the numbers for the (hyper)rectangle from the origin to that specific index. The way you calculate the sum of a specific range is different in each dimension, but it’s similar, and you should be able to figure it out using the logic described in the 2d case!

Integrating / Summing Over Other Shapes

I had a thought on this that might not be so bad.

My thought was that if you had some shape you wanted to sum values over (aka integrate values over), that you could sum over the bounding box of the shape, divide by the area of the bounding box to get an average sum per unit for that area, and then multiply by the area of the shape you want to sum over.

This makes the assumption that the bounding box is representative of the data inside of the shape, so that makes this an approximation, but it might be good enough depending on your needs.

You might even try having a couple different summed area tables made from rotated versions of the image. That would allow you to get a tighter fitting bounding box in some situations.

I’m definitely not the first to think about how to do this though, and this is not the only way to do it. There is a link in the next section that talks about a different way to do it “Fast and Exact Convolution With Polygonal Filters” that also references a few other ways to do it.

Uses in Graphics / Other Links

Here is the paper from Franklin Crow in 1984 that introduces summed area tables as a way to get box filtered mipmapping on the fly without having to generate mipmaps in advance:
http://www.florian-oeser.de/wordpress/wp-content/2012/10/crow-1984.pdf

Here is a neat paper that talks about how to generate summed area tables efficiently on the GPU, and some interesting ways to use them for things like depth of field, glossy reflections, and refraction through frosted glass:

Here are some great reads from Fabien Giesen (https://twitter.com/rygorous) on doing fast blurs when the radius is very large. The second post also shows you how to do repeated box blurs to get tent filters, quadratic filters, cubic, etc and how they tend towards Gaussian. I’m sure there is some way to mix this concept with summed area tables to get higher order filters, but I haven’t found or worked out the details yet.
https://fgiesen.wordpress.com/2012/07/30/fast-blurs-1/
https://fgiesen.wordpress.com/2012/08/01/fast-blurs-2/

Here are some blog posts I made up explaining and demonstrating box blurs and Gaussian blurs:
https://blog.demofox.org/2015/08/18/box-blur/
https://blog.demofox.org/2015/08/19/gaussian-blur/

Bart also shared these really interesting links

“Fast and Exact Convolution With Polygonal Filters”
https://www.researchgate.net/publication/269699690_Fast_and_Exact_Convolution_with_Polygonal_Filters

“Fast Filter Spreading and its Applications”
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-54.pdf

“Filtering by repeated integration”
https://www.researchgate.net/publication/220721661_Filtering_by_repeated_integration

“Cinematic Depth Of Field: How to make big filters cheap”