# A Fun 2d Rotation Matrix Derivation

A few weeks ago I joined NVIDIA as part of the graphics dev tech team.

The dev tech team’s scope is pretty wide, but it encompasses nearly all real time graphics programming needs that NVIDIA may have: making code samples for new tech, helping developers optimize their games and make them look prettier, working with graphics researchers, and contributing to graphics research as well.

Sometimes it’s like being a meta engine programmer (making stuff for engine programmers who make stuff for people making games), and other times it’s like working on an early prototype dev kit, where the dev kit is the PC itself.

I’m enjoying it quite a bit so far. It’s a bit surreal to be working along side really respectable folks I’ve seen present at siggraph, or who have written papers I’ve read, but at the same time it feels appropriate because I’m interested in the same areas and am looking forward to mixing in my own contributions and getting to collaborate on cool stuff.

Speaking of which, yes, my first contribution involved blue noise, and I’ll point it out when the thing it’s part of is public!

My new boss is a guy named Rahul, and I could tell things were going to be great, because when we were talking about math, he became visibly dettached from the real world while explaining something. Alternately, he was interested in learning about some obscure topics I had, like dual numbers, or abusing texture sampling to calculate polynomials.

He is the one that showed this logic chain to me.

# Act 1 – The 2D Cross Product

So there is this thing that people call the “2d cross product” which is not really a cross product but if you have a 2d vector, it will give you a vector perpendicular to that vector. This is a really valuable tool in the toolbox for game developers, as you are often working in 2d coordinates, even though the world may be 3d, or you can sometimes simplify problems to 2d and bring them back into 3d after solving them.

You calculate the 2d cross product by flipping the x and y components of a vector and flipping the sign of the new x component:
 newVec.x = -oldVec.y; newVec.y = oldVec.x; 

You could negate the new y instead of the new x to get the other perpendicular vector (there are 2!) but for this derivation, the way I have it above is important to the result. (I’ll revisit this at the end of the post)

As a simple example, if we take the vector (1,0), flip x and y, and negate the new x, we get (0,1), which is indeed perpendicular.

This works with arbitrary 2d vectors though. Doing it to (1,2) gives you (-2,1) which you can see in the image below is definitely perpendicular.

You can actually express this operation as a matrix:

$\begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix}$

which you can see results in the same:

$\begin{bmatrix}x & y\end{bmatrix} \cdot \begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix} = \begin{bmatrix}-y & x\end{bmatrix}$

# Act 2 – Complex Numbers

Way back in 2014 I wrote a blog post on how to use imaginary (complex) numbers to do vector rotation: https://blog.demofox.org/2014/12/27/using-imaginary-numbers-to-rotate-2d-vectors/

The key take away relevant here is that if you multiply a vector by the imaginary number i, it’s a 90 degree rotation counter clockwise.

Strangely, that’s exactly what our matrix does too… could this matrix be i? Well, i*i is -1, so let’s see if that happens when we multiply this matrix by itself:

$\begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix} \cdot \begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix} = \begin{bmatrix}-1 & 0\\0 & -1\end{bmatrix} = -1 * I$

So yeah, that checks out… this matrix, which represents the “2d cross product” is also the imaginary number i, the square root of -1. That’s fun. (the I above is just the identity matrix)

That means that multiplying by this matrix is the same as if we were multiplying a vector by the complex number (0+1i).

We can see this is true by re-doing the example (1,2), and multiplying the values together using FOIL (first, outer, inner, last).

$(1 + 2i)*(0+i) = \\ (1*0)+(1*i)+(2i*0)+(2i*i) = \\ 0 + i + 0 -2 = \\ -2 + i = \\ (-2, 1)$

# Act 3 – Complex Exponentials

Euler’s formula is a way of calculating points on a circle on the complex plane and is given as:

$e^{i\theta} = \cos{\theta} + i\sin{\theta}$

The value (0+1i) is the just the above formula when theta is 90 degrees, which is the amount of rotation we got when multiplying. We can easily verify that this is 90 degrees by remembering that cosine of 90 is 0, and sine of 90 is 1.

So, let’s replace our multiplication of (0+1i) with the right side of Euler’s formula. This way we can rotate by arbitrary angles, not just 90 degrees.

$(x+iy) * (\cos{\theta}+i\sin{\theta}) = \\ (x \cos{\theta}) + (x i\sin{\theta}) + (iy \cos{\theta}) + (iy i\sin{\theta}) = \\ (x \cos{\theta} - y \sin{\theta}) + i(x\sin{\theta}+y\cos{\theta})$

From here you can extract it back into matrix form to get:

$(x \cos{\theta} - y \sin{\theta}) + i(x\sin{\theta}+y\cos{\theta}) = \\ \begin{bmatrix}x & y\end{bmatrix} \cdot \begin{bmatrix}\cos{\theta} & sin{\theta}\\-\sin{\theta} & \cos{\theta}\end{bmatrix}$

Bam, there’s the 2d rotation matrix.

# Other Fun Matrices

This post showed the matrix form of the imaginary number i, where i*i=-1.

$\begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix}$

Dual numbers are another fun type of number where there is an $\epsilon$ that is not zero, but $\epsilon \cdot \epsilon$ is zero.

Dual numbers let you do forward mode automatic differentiation, and you can read some more about them at https://blog.demofox.org/2014/12/30/dual-numbers-automatic-differentiation/

A matrix form for them is:

$\begin{bmatrix}0 & 1\\0 & 0\end{bmatrix}$

Lastly are hyperbolic numbers (or split complex numbers), where you have an i that is not 1, but i*i is 1. I’m not sure what they are useful for but you can read more here: https://en.wikipedia.org/wiki/Split-complex_number

A matrix form for them is:

$\begin{bmatrix}-1 & 0\\0 & 1\end{bmatrix}$

By the way, when I introduced the 2d cross product i said flip x and y and negate the new x, but said the other way is possible as well. If you do it the other way, that operation is multiplying by -i, instead of multiplying by i. It isn’t a special case that breaks any of the things in this post, it’s just a different value. It’s a rotation by -90 degrees.

# FIR Audio & Data Filters

There is an interactive demo that goes with this post where you can play with filter parameters, hear the resulting filter applied to audio samples, and copy/paste the simple formula to use the filter in your own applications:

There is also some simple standalone C++ code (~170 lines) that shows how to apply filters, as well as calculate phase/frequency response and zero locations:

https://github.com/Atrix256/DSPFIR

The math behind frequency filters can look pretty daunting, but it’s actually not that bad!

This post is going to talk about a simple, real time friendly filter called a “Finite Impulse Response” or FIR filter. They are also known as feed forward filters.

An order 1 FIR filter looks like this:
$y_n = a_0x_n + a_1x_{n-1}$

$y_n$ is the filtered sample, $x_n$ is the current unfiltered sample, $x_{n-1}$ is the previous unfiltered sample, and $a_0$ and $a_1$ are values to multiply the samples by to control the filtering.

For instance, the below averages the samples, which is also called a box filter. This is in fact a low pass filter, meaning that it filters away higher frequencies while letting low frequencies pass through.
$y_n = 0.5*x_n + 0.5*x_{n-1}$

If you change the addition to a subtraction, you get a high pass filter, which filters out low frequencies, and lets high frequencies pass through.
$y_n = 0.5*x_n - 0.5*x_{n-1}$

An order 2 FIR filter looks like the below and the pattern continues for higher order filters.
$y_n = a_0x_n + a_1x_{n-1} + a_2x_{n-2}$

These equations are called difference equations and this post is going to show you how to analyze them to understand what they do, and how to craft your own that behave in specific ways.

# Gain Parameter

If you made a machine that applied an order 1 FIR filter, where you gave a knob to control $a_0$ and $a_1$, you’d find that the knobs were pretty unintuitive.

$y_n = a_0x_n + a_1x_{n-1}$

Specifically, if you had the knobs set to a specific setting that you liked, but you wanted the overall filtered sound to be louder or quieter, it would be difficult to make that happen. (spoiler: the ratio between them would need to stay the same but if you made them larger magnitude numbers the result would get louder)

This problem only gets worse with higher order filters as you add more knobs.

To deal with this, you can factor out the $a_0$ parameter which makes the order 1 equation into the below:

$y_n = a_0(x_n + \frac{a_1}{a_0}x_{n-1})$

We can replace $\frac{a_1}{a_0}$ with $\alpha_1$ to get this equation:

$y_n = a_0(x_n + \alpha_1x_{n-1})$

Now, you can have two knobs for those two parameters and things are much easier to deal with. You can adjust $a_0$ to adjust the volume, and you can adjust $\alpha_1$ to adjust how frequencies are affected by the filter.

If you wanted to get $a_1$ out again, like for use in the difference equation, you would just calculate it like this:

$a_1 = \alpha_1 * a_0$

You can do the same process to the order 2 FIR filter to get this equation:

$y_n = a_0(x_n + \alpha_1x_{n-1} + \alpha_2x_{n-2})$

where

$\alpha_1 = \frac{a_1}{a_0}$

and

$\alpha_2 = \frac{a_2}{a_0}$

# From Difference Equation to Transfer Function

The difference equation is how you filter data, but it doesn’t really help you understand what the filter is actually doing.

To understand what the filter does, and how it affects various frequencies, we need something called a transfer function. We can transform a difference equation into a transfer function with a handful of steps.

First, instead of putting a signal $x_n$ through our filter, we need to put through a complex sinusoid $e^{i \omega t}$, which will let us plug in different frequency values for omega ($\omega$) to see how the filter affects those frequencies.

While we can replace $x_n$ with $e^{i \omega t}$, what do we replace $x_{n-1}$ with? We replace it with $e^{i \omega (t-1)}$. Another way to write that is $e^{i \omega t}e^{-i \omega}$. That shows that we can multiply any point in our complex sinusoid signal by $e^{-i \omega}$ to get the previous sample, which is super handy, as it delays the signal by one sample.

So, let’s go back to our difference equation and plug in the complex sinusoid.

$y_n = a_0(x_n + \alpha_1x_{n-1}) \\ y(t) = a_0(e^{i \omega t} + \alpha_1e^{i \omega t}e^{-i \omega})$

Then let’s factor out the complex sinusoid.

$y(t) = a_0e^{i \omega t}(1 + \alpha_1e^{-i \omega})$

That sinusoid is what we input into the equation so we can replace it with $x(t)$.

$y(t) = a_0x(t)(1 + \alpha_1e^{-i \omega})$

The transfer function is $y(t)$ divided by $x(t)$ so let’s put it in that transform function form:

$H(\omega) = \frac{y(t)}{x(t)} = a_0 (1 + \alpha_1e^{-i \omega})$

The next step is to do a replacement in the function. We are going to define $z=e^{i \omega}$, which makes our transfer function look like the below.

$H(z) = a_0 (1 + \alpha_1z^{-1})$

We are done! That is the transfer function of our order 1 FIR filter which is controlled by two parameters:

• $a_0$ – This is the gain (volume) of the filter
• $\alpha_1$ – This controls how the frequencies are filtered. Multiply this by $a_0$ to get the $a_1$ parameter if you ever want to (like to use in the difference equation), but let the user control this value directly.

For order 2, if we do the same, we end up with this equation:

$H(z) = a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2})$

The parameters are the same but order 2 adds in a $\alpha_2$ which also controls how the frequencies are filtered. You can also multiply $\alpha_2$ by $a_0$ if you want to calculate the $a_2$ value for use in the difference equation.

# Complex Sinusoids, Complex Unit Circle, Phase and Frequency Response

There’s a bizarre looking equation called Euler’s identity which looks like this:

$e^{i\pi}=-1$

This is actually just a specific value (pi) plugged into the more general Euler’s formula:

$e^{ix}=\cos x+i\sin x$

In those formulas, you can see that x is an angle, so when you plug in pi, you are plugging in 180 degrees. The result you get out is a complex number where the real part is the cosine of the angle, and the imaginary part is the sine of the angle.

The cosine of pi is -1, and the sine of pi is 0, so that’s why Euler’s identity gives you -1 as a result. It’s really -1 + 0i.

Something interesting you may remember is that if you want to draw a point on a unit circle, x is cosine theta, and y is sine theta. (Theta being the angle from the origin to the point)

So, what this tells us is that Euler’s formula is a way of drawing a circle where the x axis is the real number axis, and the y axis is the imaginary number axis.

In the last section, we plugged a similar looking – but different – complex sinusoid into the equation that looked like this:

$e^{i \omega t}$

The difference is that we have omega times t as our x value.

t is the time variable, so is what we increase to move around the circle.

Omega says how many radians around the circle the point travels every time t increases by 1.

If omega is zero, that means the point is stationary and doesn’t move. It also means it has a frequency of zero. This is 0hz or DC.

If omega is pi/2, that means the point moves pi/2 radians (90 degrees) around the circle every time t increase by 1. If this were a cosine wave, and t was an integer index for samples, it would repeat: 1, 0, -1, 0. This is also 1/2 nyquist frequency.

if omega is pi, that means the point moves pi radians (180 degrees) around the circle every time t increase by 1. As a cosine wave it would repeat: 1, -1. This is nyquist frequency.

Anything higher frequency than nyquist has aliasing problems and can’t be reconstructed from the samples, so it makes sense to stop there.

Similarly, if at 0 radians, you go negative, such as -pi/2, that is the same as 3*pi/2 which is above nyquist, so we stay away from the negative frequencies as well.

In any case, we are plugging a circle into the transfer function where the angle on the circle just means “how many radians does this wave form advance each sample”, and as output, we get that circle but distorted. A filter where output = input without any filtering happening would leave the circle unaffected.

The output point for a given input point is another point on the complex plane (x = real, y = imaginary).

The distance of this point from the origin (or, the magnitude of this vector) tells us how much the amplitude (volume) would be altered for a signal with that frequency. A value less than 1 means that it gets smaller/quieter, while a value greater than 1 means that it gets larger/louder.

The angle that the output point is on the circle (use atan2 to get this value) tells you how much this signal will be “phase shifted” or how much it will be delayed. For instance, if the signal is advancing at 90 degrees per sample, and it has a phase shift of -90 degrees, that means the signal is effectively delayed by 1 sample. A positive phase shift means that the signal is moved forward in time, and is an anti-delay (it’s from the future kinda).

This stuff is all very interesting, because real signals / sounds are commonly made up of several different frequencies. The filter treats all those frequencies independently and does the things it says it will – as far as wave amplitude and phase shift – to each frequency that may be present, independent of what it does to the other frequencies.

Frankly, it feels a bit magic that it’s able to do that, but that’s what it does.

You can actually plug angles between 0 and pi into the transfer function, get complex numbers out, calculate the amplitude and phase response, and graph them. This gives you something where you can see how a filter actually behaves. We do 0 to pi, because that is 0hz to nyquist.

Below is the before mentioned “box filter” aka low pass filter, that you get when you average the current sample with the last. You can see that the low frequencies are left alone (0db) but that higher frequencies drop down in amplitude a lot as they approach nyquist. The x axis labels are % of nyquist, so at the very right is the nyquist frequency – the highest frequency that is able to be expressed.

Note: this and the other screenshots came from the interactive demo that goes with this post. You can find it at http://demofox.org/DSPFIR/FIR.html

# Pole Zero Plot of Order 1 FIR Filter

Lets take a look at the transfer function of our order 1 FIR filter again:

$H(z) = a_0 (1 + \alpha_1z^{-1})$

Now i have a question for you… what value(s) of z make this function equal to zero? You can work through some simple algebra to find that it’s where $z = -\alpha_1$:

$a_0 (1 + \alpha_1z^{-1}) = 0\\ 1 + \alpha_1z^{-1} = 0\\ \alpha_1z^{-1} = -1\\ z^{-1} = -\frac{1}{\alpha_1}\\ z = -\alpha_1$

This just means that on the complex plane, the function equals zero at that location.

As points in the function get closer to there, they will approach zero as well.

What this means is that frequencies on the complex sinusoid unit circle that are nearer to this zero will be attenuated (shrunken, made smaller / quieter), while frequencies farther away won’t be, or won’t be attenuated as much.

So, you control what frequencies get filtered just by where the zero is!

If the zero is on the left, at angle pi (180 degrees), where the highest frequencies live (nyquist!), the high frequencies will get attentuated, making it a low pass filter. Below is the box filter again, which does this.

If the zero is on the right, at angle 0, where the low frequencies live, the low frequencies will get attenuated, making it a high pass filter.

If you put the zero in the middle, you get a filter that doesn’t do anything, other than possibly a0 adjusting the volume, like in this filter.

You don’t have to keep the zero inside the unit circle though. Here it is going outside of the unit circle to the right. You can see that low frequencies are still attenuated, so it’s a high pass filter, but they are not attenuated as sharply.

It’s also worth noting that the last filter was the first filter where the phase response was not a line. The other filters had “linear phase response”. Linear phase filters have some nice properties that are sometimes desired, such as not distorting the shape of a wave form, other than adjusting frequency amplitude.

You’ve now essentially seen everything that an order 1 FIR filter can do. There’s a zero you can move around on the real axis (the x axis) and any frequencies near the zero get quieter when passed through the filter.

It’s important to note that the zero can only make things quieter. This means that FIR filters are stable. They never suffer from feedback issues where they make something louder and louder and louder into infinity or start spitting out NaNs. That makes them pretty attractive if you really want a stable filter, but that stability is also tameness.

# Pole Zero Plot of Order 2 FIR Filter

Let’s take a look at the transfer function of 2nd order filters.

$H(z) = a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2})$

Just like with the order 1 filters, we need to find out where the zeroes of the function are. With some algebra we can get it to something more manageable:

$a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2}) = 0 \\ 1 + \alpha_1z^{-1} + \alpha_2z^{-2} = 0 \\ \frac{z^2 + \alpha_1z^1 + \alpha_2}{z^2} = 0$

Looking at that, when z is 0 (the origin), it shoots to infinity. That’s called a “pole” which is the opposite of a zero. We’ll talk about them lots more in the next post, but for now, since it’s at the origin we can ignore it as a “trivial pole”.

So, we need to figure out when the top part of the fraction is equal to zero:

$z^2 + \alpha_1z^1 + \alpha_2 = 0$

Luckily this is a standard quadratic polynomial, just like if we had $Ax^2+Bx+C$. In our case, x is z, A is 1, B is $\alpha_1$ and C is $\alpha_2$.

To solve those equations we normally would use the quadratic equation:

$\frac{-B \pm \sqrt{B^2-4AC}}{2A}$

For our purposes, it is more convenient to re-arrange that a bit to be like the below:

$\frac{-B}{2A} \pm \frac{\sqrt{B^2-4AC}}{2A}$

The reason for this is because we are going to have two zeroes (roots) when we solve this thing. The left side is the x axis (real axis) middle point between the two roots. The right side tells us how much to move on each side of that middle point to get to the zeroes. If the number under the square root is positive, it will give us a real number answer, that we will use to move left and right on the x axis. If the number under the square root is negative, it will give us an imaginary number answer, that we will use to move up and down on the y axis.

We can plug in our values for A, B and C and get our roots:

$z = \frac{-\alpha_1}{2} \pm \frac{\sqrt{\alpha_1^2-4\alpha_2}}{2}$

It isn’t as clear where the roots are as it was for an order 1 filter, but it will tell us where they are if we plug values in.

Firstly, the center between the zeroes will be at $\frac{-\alpha_1}{2}$. So, if $\alpha_1$ is 1.0, the zeroes will be evenly spaced away from -0.5 on the real axis. If it’s -2.0, the zeroes will be evenly spaced away from 1.0 on the real axis.

From there, you just modify $\alpha_2$ to change if the zeroes move on the real axis (x axis) or the imaginary axis (y axis) and by how much. Pro Tip: If you give $\alpha_2$ a value of 1.0, the zeroes will be on the unit circle.

Here is an order 2 high pass filter:

Here is an order 2 low pass filter:

Lastly, here’s an order 2 notch filter, which means it cuts a “notch” out of the frequency spectrum. It can sound pretty cool to play a sound (in the acompanying demo) and slide the notch up and down the frequency range.

# Estimating Frequency Response From a Pole Zero Plot

If you have a pole zero plot of an FIR, here is how you estimate how much a specific frequency will be attenuated by that filter.

Draw a line from that frequency on the unit circle to every zero on the plot and get the lengths of those lines.

Multiply all the length of the lines together.

That final result is an estimate of how much the amplitude of a sinusoid wave in that frequency will be attenuated.

A value smaller than 1 means it will get smaller (quieter) while a value greater than 1 means it will get bigger (louder).

I’m not sure why it’s only called an estimate though. When looking at order 1 and order 2 filters, I’ve been unable to find a situation where they don’t match visually. The demo has a checkbox at the bottom to draw the estimate in teal if you want to see for yourself.

# Why is it Called Finite Impulse Response?

There is a concept of an “impulse” where you have a signal this is a bunch of zeroes, then a single 1, then a bunch of zeroes again. Putting an impulse through a filter is a way to analyze how a filter behaves, and in fact is a unique finger print for the filter which you can use to reconstruct the coefficients of the filter itself.

If it’s a finite impulse response filter, the impulse response is actually the coefficients of the filter itself, which is pretty nifty.

Let’s check out the difference function for an order 1 FIR:

$y_n = a_0x_n + a_1x_{n-1}$

In this situation we have only a couple cases to think about.

When we start in the “a bunch of zeroes” section of the signal, it’s going to be a0 * 0 + a1 * 0, which is always going to be zero.

So, for that part of the signal, we always have zeroes.

Then, we hit a point where $x_n$ is 1 and $x_{n-1}$ is 0. In this case, the output of the filter is going to be a0.

Next, we hit the point where $x_n$ is 0 again and $x_{n-1}$ is 1. In this case, the output of the filter is going to be a1.

Then, we have a bunch of zeroes again and the filter will keep putting out zeroes until the sun goes dark.

The fact that this filter had a “finite response” to the impulse, meaning it settled back down to zero and stayed zero, is why it’s named a finite response.

In contrast to this, if the filter had some kind of internal state that was updated as each sample went through, it could very well never go back to zero. In that case, we’d have an infinite impulse response or IIR. The next blog post is going to talk about that kind of a filter.

# Crafting a Filter By Placing Zeroes

If you have a high order FIR – say an 8th order one – it’s very difficult to find the zeroes because you ultimately need to find the zeroes of an 8th order polynomial, which is not really possible analytically in the general case.

However, if you want to go the reverse way, where you have placed your zeroes in specific locations and want to make a filter from that, that is a lot easier.

First up, you need to decide where your zeroes are.

• You can have as many zeroes as you want.
• Duplicating zeroes make them have steeper attenuation, but also the effect is more localized and more distant frequencies are affected less than closer ones which are affected more.
• You can add as many real valued zeroes as you want.
• For every complex zero you add, you also need to add it’s complex conjugate. For instance, if you add 1+3i, you also need to add 1-3i. If you don’t do this, your filter can spit out complex numbers instead of only real numbers!

To work through an example, let’s say that we have 4 zeroes: 3, -2, 3i, -3i.

First we need to make a polynomial with those zeroes.

$0 = (x-3)(x+2)(x+3i)(x-3i) \\ 0 = (x^2-x-6)(x^2+9) \\ 0 = x^4-x^3+3x^2-9x-54$

Next, we could work through some simple algebra (the reverse of what we did in previous sections), but to jump straight to the punch line, our filter coefficients are the coefficients of this polynomial: 1, -1, 3, -9, -54. That gives us this difference equation:

$y(n) = 1*x_n - 1*x_{n-1} + 3*x_{n-2} - 9*x_{n-3} - 54*x_{n-4}$

Or simplified:

$y(n) = x_n - x_{n-1} + 3x_{n-2} - 9x_{n-3} - 54x_{n-4}$

If you want a $a_0$ gain parameter in your filter function, just multiply all the coefficients by that value.

You can also use those coefficients to go directly to the transfer function very similarly. The coefficients are inside the parens.

$H(z) = a_0(1*z^0 - 1*z^{-1} + 3*z^{-2} - 9*z^{-3} - 54*z^{-4})$

Or simplified:

$H(z) = a_0(1 - z^{-1} + 3z^{-2} - 9z^{-3} - 54z^{-4})$

Before moving on, it’s important to note that some coefficients may be zero and you need to watch out for that. For instance, let’s say you have zeroes at +/- i:

$0 = (x-i)(x+i) \\ 0 = x^2 + 1$

You would probably be tempted to say that the filter coefficients are “1, 1” (I made that mistake!) but in fact, they are “1, 0, 1”. The equation above is really this one below:

$0 = x^2 + 0x + 1$

# Making Cool Sounds & Music

A really fun thing to do in the demo (http://demofox.org/DSPFIR/FIR.html) is to make it play white noise and then filter the white noise. You can make it sound like wind or the ocean, or other things. White noise contains all frequencies so filtering like this can make some really interesting sounds. Some other types of sounds (like a sine wave tone) have only a single or very few frequencies, so putting them through filters doesn’t make them sound that interesting.

Filtering sounds to create new sounds is called subtractive synthesis (since you are attenuating or removing frequencies). It’s not something I’m very experienced at musically, but I’m finding that I really like the sound of making some frequencies become very loud and clip (causing distortion), while other frequencies do not. I anticipate the next post on IIRs will have more of that. IIRs place infinities (poles) on the pole zero plot so can boost frequencies quite high, instead of how FIRs can only place zeroes that make some frequencies quieter.

Here’s something to try in the demo…

1. Have it play the harp sound clip
2. Set it to an order 2 filter
3. Turn on overdrive and set gain to 33
4. A0 = 33, alpha1 = -1.34, alpha2 = 1.
5. Bonus: slide alpha1 back and forth slowly. You could imagine putting that parameter on low frequency sine wave so that it did this part automatically.

You are on your way to making music! If that sounds awesome to you, you might like my other audio blog posts: https://blog.demofox.org/#Audio

# Closing

Now that you understand how FIR’s work, you may be interested in a post i wrote about generating different distributions and colors of noise using dice. It turns out they are just FIRs in disguise. The dice are just a source of white noise and the operations with adding and subtracting is just having positive and negative coefficients in the filter 🙂

https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/

Also, something nice about FIR filters is that the filter is stateless. You just multiply the last N samples by N specific coefficients to get your result. This means you can apply a filter to a signal via convolution. Convolution isn’t a cheap operation, but it does a lot of simple branchless operations in parallel, which is something that things like GPUs and SIMD excel at. You can also use the convolution theorem by taking the signal and filter coefficients into frequency space (DFT via the FFT), multiplying them together, and taking them back into time space (IDFT via the FFT again). This can be more efficient for a high order filter. In fact, really nice reverb is made up of a very long impulse response, which makes it a very high order filter. It’s not uncommon to implement reverb via convolution (but doing it as a multiplication in frequency space).

I’d like to thanks these 3 peeps who helped me understand some DSP concepts. Thanks for your kindness!
@BartWronsk
@nickappleton
@rygorous

Also, this book is exceptional as an introduction to DSP, specifically meant for audio purposes. It has dsp theory but also information about how to program things like compressors and limiters, flange, chorus and reverb. It’s very cool and a lot of fun.

# Bezier Triangles

There’s an interactive Bezier Triangles webGL demo I made that goes with this post, and what i used to make the 3d rendered images with. You can find here:
http://demofox.org/BezierTriangle.html

Above is a cubic bezier triangle which has 10 control points. Each edge of the triangle is a cubic Bezier. That makes for 9 control points, and the 10th control point is in the center of the triangle.

I’ve been thinking more about using raytracing for data lookups (https://blog.demofox.org/2018/11/16/how-to-data-lookups-via-raytracing/) and have been trying to think of how you might get higher order interpolation than linear barycentric interpolation. The hope is to fit the data better, and with fewer data points, resulting in a sparser data mesh.

The higher order interpolation may take a little more compute and memory access, but having fewer triangles to raytrace against is a good thing (smaller BVH) besides the potentially better data fit. There’s a trade off here, I’m sure – and I’m sure that trade off is situational.

In the past, I found a way to exploit the linear interpolation of texture samplers to get higher order interpolation (https://blog.demofox.org/2016/02/22/gpu-texture-sampler-bezier-curve-evaluation/) but when thinking about how to extend that to triangles in a vertex shader, I never found a good way to do it.

One of the challenges in the vertex shader triangle case is not easily being able to get barycentric coordinates on the triangle, but that’s drop dead simple in the raytracing case, since barycentric coordinates are one of the few pieces of data you actually have out of the box.

These ideas came together and I realized it was time to finally dig into the guts of Bezier triangles.

Luckily, it turns out they are not that scary!

Here is some info on Bezier curves that are good background for the rest of this post:

# Bezier Simplex and Barycentric Interpolation

Bezier triangles are very close conceptually to Bezier curves, and in fact, both of them are examples of a Bezier Simplex.

A simplex is the shape with the fewest number of points to create an object in a given dimension.

A 0 dimensional simplex is a point, a 1 dimensional simplex is a line, a 2 dimension simplex is a triangle, a 3 dimensional simplex is a tetrahedron, and so on.

Barycentric coordinates are coordinates on a simplex, and sum up to 1.0.

Bezier curves are parameterized by a value $t$, like in the linear Bezier curve formula below (which is also just a lerp):

$y=A*(1-t)+Bt$

Another way to think of this though is that there are two parameters $s$ and $t$, where they sum up to 1. That makes $s=1-t$. That lets us write the same equation like the below:

$y=As+Bt$

That is 1 dimensional linear barycentric interpolation between the values A and B on a line.

We can bring that up to 2 dimensions to get a linear barycentric interpolation between 3 points on a triangle.

$y=As+Bt+Cu$

That extends to any dimension by just adding more terms.

Much like we can bring this concept to higher order interpolation on a line with Bezier curves, we can do higher order Barycentric interpolation on simplices of higher dimensions. If you do this with a triangle, you get a Bezier triangle. If you do this with a tetrahedron, you get a Bezier tetrahedron, and so on.

# Calculating Barycentric Coordinates

Calculating the barycentric coordinates is conceptually pretty simple.

2. When you place a point in it, that divides the simplex into several simplices of the same dimension. Each of these other simplices will be opposite a point on the larger simplex.
3. Calculate the area of a smaller simplex and divide it by the area of the large simplex. This is the barycentric coordinate of the point that is opposite this simplex.

That explanation might have been a little thick. Lets look at lines and triangles as an example.

First is lines:

The point A divides the line (1-simplex) into two parts: P10 to A, and A to P01. P10 to A is 1 unit long. A to P01 is 4 units long. The entire line’s length is 5 units long.

The simplex opposite P10 is the A to P01 line segment, which has a length of 4. Dividing that by 5, we get a value of 0.8. So, the value of s is 0.8.

The value of t can be found the same way. The opposite line segment of P01 is P10 to A and has a length of 1, which we can divide by 5 to get 0.2. We could have also just calculated 1 – 0.8 since they have to add up to 1.0!

The Barycentric coordinates of point A on that line is (0.8, 0.2).

Calculating the Barycentric coordinates of a triangle is very similar. In the image below, point A has Barycentric coordinates of (0.5, 0.333, 0.166)

A point is known to be inside a simplex if all the components of it’s Barycentric coordinates are between 0 and 1.

# The De Casteljau Algorithm For Bezier Triangles

Starting with linear Bezier triangles, you have a control point at each corner of the triangle, and you use barycentric interpolation of those values to get the value at a point on the triangle.

This is a linear interpolation (the result is on a plane), and is what we’ve been doing in vertex shaders since vertex shaders existed. Simple barycentric interpolation.

Much like how a linear Bezier curve is a linear interpolation (lerp) between two values, a linear Bezier triangle is a linear interpolation between three values. Each edge of the triangle is also a linear Bezier curve.

Next up is a quadratic Bezier triangle.

A quadratic Bezier curve is made by linearly interpolating between two linear Bezier curves. A quadratic Bezier triangle is made by linearly interpolating between three linear Bezier triangles.

Here is a refresher on how a quadratic Bezier curve actually works. There are 3 control points A, B, C. One linear interpolation is from A to B. The other linear interpolation is from B to C. These values are linearly interpolated for the final point on the curve.

How does this work on a Bezier triangle? Each edge of the triangle in a quadratic Bezier curve which gives us control points like the below:

What we do first is use point P’s Barycentric coordinates to interpolate the values of the red triangle: 002, 011, 101 to get point A. You treat those values as if they are at the corners of the larger triangle (if you think about it, this is how quadratic Bezier curves work too with the time parameter. Each line segment has t=0 at the start and t=1 at the end).

We next interpolate the values of the green triangle: 011, 020, 110 to get point B.

We then interpolate the values of the blue triangle: 101, 110, 200 to get the point C.

Now we can do a regular linear barycentric interpolation between those three values, using the same barycentric coordinates, and get a quadratic Bezier triangle. Each edge of the triangle is also a quadratic Bezier Curve.

For a cubic Bezier triangle, it’s much the same. You evaluate a quadratic Bezier for each corner, and then linearly interpolate between them to get a cubic. Below is an image showing the control points used to create the three quadratic Bezier triangles.

Which results in a cubic Bezier triangle. The edges of this triangle are cubic Bezier curves.

This works the same way with any Bezier simplex by the way.

# The Formula For Bezier Curves

Just like Bezier curves, you can come up with a multivariable polynomial to plug values into for Bezier Triangles, instead of using the De Casteljau algorithm. The De casteljau algorithm is more numerically stable, but the polynomial form is a lot more direct to calculate.

Let’s consider Bezier curves again first.

Lines (a 1-simplex) have two points, so we’ll describe control points as $P_{xy}$. If we want to make a degree N curve, our control points will be all the x,y pairs that add up to N. Here are the control points for the first few degrees of Bezier curves.

1. Linear: $P_{10}$, $P_{01}$
2. Quadratic: $P_{20}$, $P_{11}$, $P_{02}$
3. Cubic: $P_{30}$, $P_{21}$, $P_{12}$, $P_{03}$

You might notice that a degree N curve has N+1 control points. Linear curves have 2 control points, quadratic curves have 3 control points, cubic curves have 4 control points and so on.

The index values on these control points tell you where they are on the line, and they also tell you the power of each barycentric coordinate they have, which we will get to in a minute.

The last part of making the formula is that the control point terms are also multiplied by the Nth row of Pascal’s triangle where row 0 is at the top row. (Pascal’s triangle image from https://brilliant.org/wiki/pascals-triangle/)

Doing that, you have everything you need to create the familiar Bernstein basis polynomial form of Bezier curves.

The image below puts it all together. Orange is the name of the control point, which you can see also describes where it is on a line. The index values of the control point also describe the power of the Barycentric coordinates s and t which are in green. Lastly, the row of pascal’s triangle is in blue. To get the formula, you multiply each of those three things for each control point, and sum them up.

# The Formula For Bezier Triangles

You can follow the same steps as the above for making the formula for Bezier triangles, but you need to use Pascal’s pyramid (aka trinomial coefficients) instead of Pascal’s triangle for the control points. (image from https://en.wikipedia.org/wiki/Pascal%27s_pyramid)

This time, we have 3 index numbers on control points instead of the 2 that curves had, but we still find all permutations which add up to N.

Here are the control points for the first few degrees of Bezier triangles.

1. Linear: $P_{100}$, $P_{010}$, $P_{001}$
2. Quadratic: $P_{200}$, $P_{110}$, $P_{101}$, $P_{020}$, $P_{011}$, $P_{002}$
3. Cubic: $P_{300}$, $P_{201}$, $P_{210}$, $P_{120}$, $P_{111}$, $P_{102}$, $P_{030}$, $P_{021}$, $P_{012}$, $P_{003}$

Below are diagrams in the same style as the last section, which show the equation for a Bezier triangle and how to come up with it.

This pattern works for any Bezier simplex.

# Use For Data Lookup On the GPU

If you wanted to use this for data lookup, you’d first have to fit your data with a Bezier triangle mesh, using whatever degree you wanted to use (quadratic or cubic are the seemingly likely choices!). The fit would involve not only finding the right values to give at control points, but also the right location to put the vertices of the triangles, so that you could make the mesh sparser where you could get away with it, and denser where you needed it.

Doing that fit sounds challenging, but I bet doing gradient descent of vertex positions and control points could be a decent place to start. Simulated annealing probably would help too.

Once you have your data fit, you need to figure out how you are storing it. Linear Bezier triangles only need data stored per vertex, but quadratic ones store data per edges too, and cubic ones store data per triangle. Beyond cubic, no special class of storage is needed.

Storing data per vertex and per triangle is easy, because that’s the paradigm we work with already in graphics (and raytracing specifically). The per edge data is harder. The reason that it’s harder is that to be fast, you want it to be a simple array, where you index it by the two neighboring triangles of the edge. A 16 bit index for triangle A and 16 bit index for triangle B means a 32 bit index into the edge data array. The problem is, that data is extremely sparse (each triangle index will connect to just 3 other triangle indices. The rest of the entries will be empty / useless!), so making a simple array is a lot of wasted space, and frankly is just a lot of space in general.

One idea is to store edge information in the per triangle data, even though it’s redundant (neighboring triangles would have duplicate info for the shared edge). It still would be a simple lookup.

Another idea is to store 3 edge indices in each triangle and use that to look up into an edge array for the per edge data. The memory usage here is minimized to only what is needed, but it’s a double dereference, so is pointer chasing on the GPU which doesn’t sound very nice.

Another idea would be to use minimal perfect hashing (https://blog.demofox.org/2015/12/14/o1-data-lookups-with-minimal-perfect-hashing/). What that would allow you to do is get a hash function where you could put in the triangle index and which edge you were querying for (0, 1 or 2) it would give you as output the index into an edge array to get the data from. This is different from the last one because you aren’t reading an index for the edge, so is one less dependent data lookup.

Once you have your mesh fit, and you have a way to look up the data needed to evaluate a point on your Bezier mesh, assuming your mesh is oriented on the X,Z plane, with a Y value of 0 (the “height” data being stored in a different data channel), you can shoot a ray from (x,y,1) to (x,y,-1) for the (x,y) point you want to do a data lookup for. The ray will hit a triangle, give you barycentric coordinates, the triangle id, and enough info for you to read the control points for that triangle.

From there, you do your calculations and get your result. All done!

# Is This a “1D” Bezier Triangle?

You might have noticed that with things as I’ve described them, control points can only be moved up and down on the vertical axis, but not on the horizontal plane.

That is true and this is a simpler form of a more general Bezier triangle.

If you want control points that you can move in any direction in 3d, all you need to do is make one Bezier triangle that you evaluate to get an X axis value, another to get a Y axis value, and another to get a X axis value.

Using it for data lookup seems to go out the window, but for other usage cases, this is probably useful knowledge.

# How Did You Render The Surfaces?

TL;DR I use ray marching, but keep reading for more details.

How the 3d surfaces were rendered (the red/black checkerboard 3d images) is nearly out of scope of this post, but someone asked so I thought I’d share. Feel free to “view source” on that demo. It’s written in webgl so should be pretty easy to follow the shader etc.

First, i get a ray starting position and direction for each pixel of the image.

I next intersect that ray against a triangular prism shape (the green background you see behind the object). I do that by intersecting the ray by each plane that makes up that shape, and keeping track of the min and max hit time of the ray. If the min > the max, i know the ray missed the bounding shape and i can just draw black.

Next up, I calculate if the min time of the intersection is above or below the triangle bezier surface. I do this by getting the (X,Y,Z) position of the ray at that time, using (X,Z) to calculate Barycentric coordinates, using those to evaluate the triangle and see if the triangle’s height at that point is above or below the ray’s height (Y axis value).

Next, i take up to 20 steps from the start time to the end time, stopping when either the “aboveness” of the ray changes (meaning it intersected the surface) or if i get to the end of the 20 steps and nothing was hit.

If the ray intersected the surface, i check how much above the surface the ray was on the previous step vs where it is in the current step, treat it as a line (make a y=mx+b function out of these data points!) and solve for zero. I take that as my intersection point.

I next calculate the gradient of the surface using finite differences (https://blog.demofox.org/2015/08/02/finite-differences/), forward differences specifically, and use that to get a surface normal for shading.

Some improvements that could be done here…

1. It’s possible to get the gradient of a Barycentric triangle analytically, and isn’t that difficult since it’s just a polynomial. Doing that instead of finite differences would make a more accurate and better gradient.
2. Instead of taking 20 fixed sized steps, I could use the gradient to get a distance estimate to be able to take larger steps down the ray when it’s farther from the surface (more info here: https://iquilezles.org/www/articles/distance/distance.htm)
3. If this needed to run faster, the step count could be dropped down to a smaller number. If you did that though, you’d start to notice “stairs” and other artifacts. A way to combat these visual problems would be to use a screen space blue noise texture as a “random number” from 0 to 1 to push each ray down that percentage of a step away from it’s starting point. That would turn the visual artifacts into a more pleasant blue noise pattern – the error would be evenly distributed and be much harder to notice. If you animated that noise well over time, it would look even better. If you did this under TAA, and possibly switched to interleaved gradient noise (something to try, not a slam dunk for sure always better thing IMO), it would look very nice. (more info on animating noise over time https://blog.demofox.org/2017/10/31/animating-noise-for-integration-over-time/ and also the follow up post)
4. Another thing to try would be to not do evenly spaced sampling down the ray from min to max time. Uniform sampling is not great. For best results, you’d want to distribute the sampling locations in a 1d blue noise pattern so that it evenly distributed the error and had low aliasing to give you the most pleasant result. This would be hard to be compatible with some of the other options, but is a nice thing to try if not doing them.

# Extensions

There are lots of extensions that you could do to these Bezier triangles.

One extension would be to bring this to volumetric data with Bezier tetrahedrons. The result when using it for data meshes would be a tetrahedral mesh. To do a look up for a specific (x,y,z), you would shoot a ray from that point in any direction. The triangle you hit would contain 3 of the 4 points of the tetrahedron. You could shoot a ray in the opposite direction to get another triangle which had the 4th, or you could have data stored for that triangle index which says “if you hit this triangle from the front, the fourth vertex is M, if you hit this triangle from the back, the fourth vertex is N”, which means you just need to do a dot product of the ray and the triangle normal to figure out the fourth vertex, instead of doing another ray intersection against the BVH.

An easy but powerful extension would be that you could calculate the value of a point in two different Bezier triangles and divide the values. This would give you a rational surface which can represent a LOT more shapes than an integral surface can. For example, a quadratic Bezier curve can’t make conic sections (circles, etc), but a rational quadratic Bezier can make them, exactly.

Making it rational is just the R in NURBS, so other extension ideas are… NU = non uniform control points. R = rational instead of integral. BS = B-Spline.

Another extension that you could do is not limit the shape to being inside the simplex. For instance, here’s a cubic Bezier triangle where i let the equations go outside of the triangle, but am still limiting it to a cube. The math is totally fine with this. No division by zeros or any other badness.

Lastly, it turns out you can extend the concept of Barycentric coordinates to convex shapes that aren’t a simplex, such as a pentagon, or a dodecahedron. In the case of a quadratic Bezier pentagon, you would evaluate 5 linear pentagons – one for each corner of the pentagon – and then combine those with another linear Bezier pentagon interpolation. I’m not sure when this would be useful, but it’s definitely possible.

https://en.wikipedia.org/wiki/Barycentric_coordinate_system#Generalized_barycentric_coordinates

There are a lot of links in this post, but I found this page to be really helpful for understanding some of the details of Bezier triangles.
https://plot.ly/python/v3/ipython-notebooks/bezier-triangular-patches/

# Calculating Information Entropy

This post has an overview of information entropy then moves onto some technical details and experiments.

The code that goes with this post is at https://github.com/Atrix256/Entropy/

Thanks to all the people on twitter who answered questions & helped me learn this stuff. Links to threads and resources are at the bottom of the post!

# Information Entropy Overview

Information entropy is a measure of how much information there is in some specific data. It isn’t the length of the data, but the actual amount of information it contains.

For example, one text file could contain “Apples are red.” and another text file could contain “Apples are red. Apples are red. Apples are red.”. The second text file is 3 times longer than the first but doesn’t really give any extra information. These two text files would have nearly the same amount of information entropy, despite them having different lengths.

It isn’t really possible to calculate the actual amount of information entropy a specific piece of data has in it because it’s related to Kolmogorov complexity: the length of the shortest program that can generate the data (https://en.wikipedia.org/wiki/Kolmogorov_complexity), which turns out to be an incalculable value. That’s why i said that those two text files would have nearly the same amount of information entropy instead of saying they had the same amount. A program to generate the second text file is probably going to be very slightly more complex than the text file to generate the first one: it’ll have a for loop on the outside!

You can get a pretty good upper bound on information entropy though by making a histogram of how often each symbol appears, treating that histogram like a weighted N sided dice (a probability mass function or PMF), and calculating the entropy of that dice.

Interestingly, Huffman compression is related to this (https://en.wikipedia.org/wiki/Huffman_coding). In Huffman compression, you take a histogram of the data, and give shorter length bit patterns to the symbols that occur more. This makes the overall data smaller, but doesn’t change the amount of information there is in the data. In fact, it’s part of a family of compressors called “Entropy encoders” (https://en.wikipedia.org/wiki/Entropy_encoding).

Because of this, you can get a pretty good idea of how much information is in some data by how much you can compress it. If it shrinks a lot, the information density was low. If it doesn’t shrink much, or even gets larger (which happens when the compressed data plus the compression header is larger than the original data!), then the information density is pretty high.

However, there are counter examples to that. Encrypting data hides the patterns of the data, making it look more like uniformly distributed white noise (independent random numbers) while keeping the data the same size. Encrypting data doesn’t actually change the amount of entropy in the data, but it changes the APPARENT amount of entropy. It makes the “shortest program that can generate the data” harder to find, and in fact, encrypted data will not compress!

This actually feels opposite to some things i’ve read about machine learning type algorithms. In machine learning, finding the solution to a problem involves finding the global minimum, and you want a global minimum to be wide and flat so that it’s easy to find and hard to fall out of. There are techniques to make this be more true, such as the SVM kernel trick, or neural networks which make the problem space be higher dimensional and be saddle shaped to help not get stuck in local minima. Encryption seems to do the opposite, making the minimums (global OR local!) of the Kolmogorov complexity be very hard to find, like a dirac delta, where there is no hint where they are so that you have to know where they are in advance (aka you have to know the encryption key!) to be able to find the minimum.

Before moving onto details and experiments, here is a really cool read “Entropy and Redundancy in English” which talks about how Claude Shannon calculated letters in the english language to have about 2.62 bits of information per letter.
https://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm

Interestingly, that last link also talks about something I wrote up previously, which is using markov chains to generate text.
https://blog.demofox.org/2019/05/11/markov-chain-text-generation/

There is also a really interesting tweet from Donald Mitchell (who is famous for research which includes Mitchell’s best candidate algorithm for generating blue noise sample points!), which talks about encryption and compression and generating valid looking language. It’s a very cool tangle of ideas I’m still sorting out in my head 🙂

That “transforms English sentences to other English sentences” part also sounds a lot like format preserving encryption.

## The Formula

To calculate information entropy, you need to calculate the entropy for each possible event or symbol and then sum them all up.

To calculate the entropy of a specific event X with probability P(X) you calculate this: $-P(X)*log_2(P(X))$

As an example, let’s calculate the entropy of a fair coin.

The probability of heads is 50%. Here’s the entropy we get when plugging that 0.5 into the equation:
$E_{\text{heads}} = -0.5*log_2(0.5) \\ E_{\text{heads}} = -0.5*-1 \\ E_{\text{heads}} = 0.5$
Since tails has the same probability it has the same entropy as heads (which is 0.5) so we add the entropy of heads and tails to get 1 bit of entropy for a fair coin flip. You can alternately say it’s 1 Shannon of entropy, and if you use different log bases, there are other unit of measurements. The important take away though is that a fair coin gives 1 bit of entropy.

What if the coin isn’t fair though? Here’s the entropy if there’s a 75% chance of heads and a 25% chance of tails.
$E_{\text{heads}} = -0.75 * log_2(0.75) \\ E_{\text{heads}} = 0.311 \\ E_{\text{tails}} = -0.25 * log_2(0.25) \\ E_{\text{tails}} = 0.5 \\ E = 0.311 + 0.5 = 0.811$
So, in this case, each flip of this unfair coin only gives you 0.811 bits of information. It kind of makes sense because each flip you are pretty sure it’s going to be heads, so when it is heads, it doesn’t give you a whole lot of information.

What if the coin somehow is incapable of landing tails. Maybe it is just a 2 headed coin.
$E_{\text{heads}} = -1 * log_2(1) \\ E_{\text{heads}} = 0 \\ E_{\text{tails}} = 0 * log_2(0) \\ E_{\text{tails}} = 0 \\ E = 0 + 0 = 0$
Flipping this maximally unfair coin gives us zero bits of information which makes sense. We know that it will always land heads so the coin flip gives us absolutely no new information.

This equation works the same way no matter how many events or what their probabilities are. It works with 6 sided dice. It works with byte values in a text file, it also works with bits in a text file (going back to heads or tails).

When you have N discrete symbols or events, and a probability of encountering each, it’s called a probability mass function or PMF, which again, can be made from a histogram.

Interestingly, entropy calculations have a full suite of calculations similar to probability calculations. There is joint entropy, conditional entropy, the chain rule of entropy and even a Bayes rule of entropy!

## Text Experiments

The C++ code and source data for these experiments can be found on github at https://github.com/Atrix256/Entropy

We are going to start with a 26KB text file of “The Last Question” by Isaac Asimov. (from https://hell.pl//szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm)

We can make a histogram of how many times each 8 bit letter occurred, then turn that into a PMF by calculating the percentage occurrence of each character. From there we can calculate the entropy and divide by 8 to get the entropy per bit. Doing that we get the value 0.582. So, there are 0.582 bits of information entropy per bit of data in that text file. (Technically: or less)

If we compress it with the standard zip file compressor in windows, making an 11KB zip file, then do the same to that file, we get a value of 0.962 bits of information per bit of data in that text file. The file shrank to 42.3% of the size (the old file is 2.36 times larger) and we got 1.65 times as much information entropy.

If instead of compressing the text file, we encrypt it with openssl (command: openssl enc -aes-256-cbc -salt -in lastquestion.txt -out lastquestion.enc -pass pass:moreentropyplease), then calculate the entropy of that file, we get a value of 0.926 bits of information entropy per bit of data.

It’s weird to me that the encrypted data reports a lower entropy than the compressed data, and i’d expect the opposite. The issue might be that our calculation of entropy is naive, and that a more robust entropy calculation would show the encrypted data to have more entropy.

An example of being less naive would like making a histogram of 4 bit, 6 bit, 12 bit values etc instead of only 8. Another thing to do would be to try calculating entropy of re-arrangements of the data. Ultimately, you would report the lowest entropy seen from different “views” of the data.

I’m not sure if that’s what’s at play in the entropy between encrypted and compressed data though.

Interestingly, if you try to compress the encrypted data, it doesn’t really compress and stays at 26KB. The encryption has hidden the structure of the source data pretty well! It now reports an information entropy per bit of 0.923 which is very slightly down from 0.926 bits. It might be the added useless compression header causing the information entropy to go down.

What if we take the zip file of the plain text file and base 64 encode it? We are taking the same data and encoding it in fewer 8 bit symbols. That makes the file larger without changing the amount of information in the data. It’s like the reverse of compression so we’d expect the amount of information entropy per bit to go down and that is what happens! The file increases from 11KB to 14KB and we get an information entropy per bit value of 0.749, which is down from 0.962.

## Noise Distributions

What if we look at the entropy of uniform white noise (plain vanilla random numbers)? Taking 100,000 white noise bytes, building a histogram of each 8 bit value, making a PMF and calculating entropy, we get 0.999733. That makes sense because white noise gives maximal entropy so we’d expect a value of 1 and got something very close.

In a recent post i talked about using dice to generate different colors and distributions of noises (https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/). Do different distributions of noise have different amounts of entropy?

First, let’s look at a triangle distribution vs a uniform distribution.

Rolling a 6 sided die gives us a uniform distribution with possible values 1 through 6. The entropy of a 6 sided die is 2.585 bits.

If we wanted a triangle distribution with possible values 1 through 6, we could roll a 4 sided die and a 3 sided die (strangely, they exist! https://www.amazon.com/Bescon-Polyhedral-3-sided-Gaming-Assorted/dp/B06XHF4Y6J) and add them to get values 2 through 7, then we could subtract 1 to get values 1 through 6.

We can make a histogram by looking at how many ways each possible number can come up, and turn them into percentages:

• one= 1 = 8%
• two = 2 = 17%
• three = 3 = 25%
• four = 3 = 25%
• five = 2 = 17%
• six = 1 = 8%

Summing the entropy of each of those percentages gives about 2.454 vs a uniform 6 sided die giving 2.585 bits. So, triangular distributed values have less entropy than uniform distributed values. That makes a bit of sense because we already knew that a fair coin gave more entropy than an unfair coin, and a triangle distribution is like an extension of an unfair coin to dice.

## Noise Colors

Next, let’s look at the color of noise (frequency composition / correlation). Let’s use blue noise because that’s my favorite.

I made some floating point 1d blue noise data (with values between 0 and 1) using Mitchell’s best candidate algorithm (https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/), and converted the floating point values to uint8.

If you look at a histogram of those uint8 values, it will show a flat histogram, meaning a uniform distribution and will give maximal entropy.

That isn’t quite right though, because we know that the values have randomized high frequency components and no low frequency components.

That means that each roll is based on the previous rolls, and so the rolls aren’t uncorrelated, and they aren’t independent. That means they should have less entropy.

One idea to capture this relationship is to take each pair of uint8s in the data, make them into a uint16, and make a histogram of those uint16s (Thanks Nathan! https://twitter.com/Reedbeta). If the uint8s were uncorrelated and evenly distributed, then the uint16s should be evenly distributed. That would mean that a value didn’t care about what it’s previous value was.

Since we know there is correlation in the blue noise, the result in our case shouldn’t be evenly distributed, and should result in less entropy than white noise.

That’s the result we get in fact, and so for 100,000 uint8 blue noise values, we get an entropy of 0.92608, where white noise gives 0.965912.

So interestingly, blue noise is more predictable and less random than white noise, which shouldn’t come as a huge surprise, since we know the PMF isn’t a uniform distribution at each new value.

Making a histogram of 16 bit values of the 8 bit stream like the above is equivalent to making an order 1 Markov chain and only has a “memory” of the previous value. You could make higher order Markov chains by making 24 or 32 bit values and beyond, to find correlation that went beyond immediate neighbors, but keeping a histogram of 2^24 or 2^32 values and larger is a pretty large amount of memory. You’d also need a lot more source data, to give the histogram a chance to stabilize / converge.

There might also be a delay of correlation, like each value could be correlated with the value that is 100 values ago. At one byte per value, you’d need an order 100 Markov chain to capture that relationship, and so would need a 2^800 bytes of memory which is astronomically large.

## Other Info

If you look at the csv in the output folder of the repository for this post (https://github.com/Atrix256/Entropy), you’ll see some oddities.

One is how when looking at 11 bits at a time, the plain text has more entropy than when looking at it with 8 or 12 bits. The reason for that specific thing is that the data inherently is 8 bit, which is coprime to the 11 bits we are looking at it with, so makes the entropy look a lot higher by making it seem like there are more unique bit patterns than there are. Ultimately, when calculating entropy in a variety of ways, you should take the minimum value you find as the final answer, because it’s related to the shortest program you can reproduce it with. There are plenty of longer programs, but you only care about the shortest. You should also be careful not to get too crazy in whatever code you use to make your “view” of the data (re-arranging values / looking forward & backward a far amount) because that stuff is part of the program needed to reproduce the data, but isn’t represented in the entropy calculation you are doing.

Something else is that if you don’t have enough data, the histogram / PMF won’t converge to the actual values. You can see this manifesting in the csv data in the difference between the “small white noise” which is 8 bytes of white noise, and the “white noise” which is 100,000 bytes of white noise. They really do have the same characteristics and entropy amount per bit, but the small white noise doesn’t have enough data available to show the correct values.

A very related, but more formalized, write up on information theory.
https://www.blackhc.net/blog/2019/better-intuition-for-information-theory/

A stack exchange question about how to calculate entropy of correlated samples.
https://math.stackexchange.com/questions/3428693/how-to-calculate-entropy-from-a-set-of-correlated-samples

Here’s a neat post that shows how to visualize entropy in a binary file, to find where encryption keys are since they are high entropy. Thanks Ashley! (https://twitter.com/ACEfanatic02)
https://corte.si/posts/visualisation/entropy/index.html

The “Breaking Math” podcast episode 24 is really good and related to this stuff:

“The Complexity Barrier” blog post by John Baez
https://johncarlosbaez.wordpress.com/2011/10/28/the-complexity-barrier/

A thread showing how Oskar Stålberg is bitbacking data very efficiently, so the data is not compressing very well at all.

The difference between uncorrelated and independent:

# Bayes’ Theorem Intuition

Bayes’ theorem is a way to calculate the probability of something happening based on evidence that may affect that probability. It’s a formula for combining probabilities together when they might affect each other. It’s also good for updating probabilities when you get new data.

The formula is:

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

The formula reads: “The probability of A happening, when B has happened is equal to the probability of B happening when A has happened, multiplied by the probability of A happening, divided by the probability of B happening.

Bayes’ theorem is a mouthful and looks formidable, especially if you aren’t used to the notation. Believe it or not, Bayes thought this was such a trivial formula that he himself didn’t even share it with anyone. After his death, the formula was found in his notes and made public. (https://en.wikipedia.org/wiki/Bayes%27_theorem#History)

The goal of this post is to give you the intuition for this formula such that you feel as Bayes did, that this is so trivial it isn’t even worth having a name. Despite that, it is very cool and very useful, so thanks for that, ghost of Bayes, wherever you may be.

I learned the intuition in backwards order, because I didn’t know what to look for in advance. Lucky for you, you don’t have to learn it in backwards order!

# Joint Probability

The probability of event A happening is written as $P(A)$.

The probability of event B happening is written as $P(B)$.

The probability of both events happening, is $P(A \cap B)$ and this is called the joint probability.

If the two events are independent of each other – where one event happening has no bearing on the other event happening – you can calculate the joint probability by multiplying the probabilities together:

$P(A \cap B) = P(A) \cdot P(B)$

An example of independent (unrelated) events would be if A was “getting struck by lightning” and B was “enjoying the song wooly bully”

# Conditional Probability

If the events are not independent, we have to turn to conditional probabilities.

A conditional probability is written like as $P(A|B)$ and it means “What is the probability of A happening, if B has happened?”

If we are working with a conditional probability, we can still calculate a joint probability, but it is calculated differently:

$P(A \cap B) = P(A|B) \cdot P(B)$

Which reads “The probability of A and B happening is the probability of A happening if B has happened, multiplied by the probability of B happening”.

You can see how this calculates the joint probability because it’s still calculating the probability of both events happening. It just so happens that in this case, the two events are related.

An example of events that are not independent would be if A was “getting struck by lightning” and B was “climbing a power line in a storm”.

# Putting It Together

Are you ready for the intuition?

Firstly, as we said above, the joint probability for conditional probabilities is calculated like this:

$P(A \cap B) = P(A|B) \cdot P(B)$

Secondly, you can re-order A and B in the joint probability. You are calculating the probability of both things happening so it doesn’t matter which event is first and which one is second in the equation.

We can write that like this:

$P(A \cap B) = P(B \cap A)$

And if you expand it out, it looks like this:

$P(A|B) \cdot P(B) = P(B|A) \cdot P(A)$

Thirdly and lastly, let’s divide both sides by the probability of B happening. This is just a simple algebra operation.

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Oh my goodness, we derived Bayes’ theorem. What?!!!

Joint Probability: https://www.investopedia.com/terms/j/jointprobability.asp

An interactive visualization of Bayes’ things: https://seeing-theory.brown.edu/bayesian-inference/index.html

Bayesian Updating: http://www.statisticalengineering.com/bayesian.htm

Joint, Marginal and conditional probabilities: https://machinelearningmastery.com/joint-marginal-and-conditional-probability-for-machine-learning/

# Dice, Distributions & Noise Colors

Thank you Uncle Chris and Aunt Lily for sending me this gigantic 120 sided dice!

A while back I came across an interesting page that talks about some fundamental concepts relating to noise. It isn’t very long or hard to read, but has some great info.

Dither noise probability density explained
https://www.digido.com/ufaqs/dither-noise-probability-density-explained/

The main points it makes are:

1. White noise just means the noise values are drawn independently and have no correlation to each other. Blue and red noise (and other noise colors) are correlated samples.
2. The distribution (histogram) the values are generated / drawn from is completely unrelated to the “color”.

I wanted to go through the experiments it mentioned to understand what it was talking about a bit better. This post does that and explores a few other related experiments and algorithms.

You can find the code that goes with this at https://github.com/Atrix256/NoiseDims

# Intro Image Explanation

Thanks to my Uncle Chris and Aunt Lily for sending me that gigantic 120 sided dice. That thing is epic.

You’d need to roll those 20 dice next to it to equal the massive THAC0 power than it has. But as you’ll see in this post, doing so would give you a Gaussian distribution, and not all numbers would come up with equal probability or even be possible. Try rolling a 1 with 20 six sided dice!

The black dice in the background are some cool diablo themed dice I got one year from Blizzcon.

Seeing as the dice are not magnetic, or quantumly entangled, rolling these dice should give you uncorrelated values (white noise), but I can’t vouch for their histograms (are they fair dice or not?).

BTW i am taking this picture from inside my hotel room in LA during SIGGRAPH, instead of going out and being social. This is what real programmers do right? Ditch social engagements to take pictures of dice and compare histograms and DFTs? This post has been sitting in my drafts for far too long and just really needed to get finished 😛

Onto the post!

# Testing Methods

There are two main tools used in this post to understand noise patterns better.

The first is the histogram:

You do your process N times to get N random numbers and keep track of how many times each number came up. For a small number of rolls, the histogram will likely be erratic, but for a larger number of rolls, it will take the shape of the distribution of the numbers.

The second is the magnitude component of the Fourier transform of the rolls:

You do your process N times to get N random numbers. Do a discrete Fourier transform of that sequence of numbers to see the frequencies involved and extract the magnitude of each frequency (ignoring the phase). Just like the histogram, A small number of rolls will likely be erratic, but a larger number of rolls will show more clearly the actual frequencies present in the number sequence.

So in short: The histogram shows the distribution, and the Fourier transform shows the noise color.

# White Rectangular Noise

White rectangular noise is “regular old random numbers”. Roll a die, write down the value. Rinse and repeat.

It has the properties that…

1. The samples have no correlation to each other.
2. Every possible value has the same probability of being chosen.

The example code rolls a 6 sided dice a number of different times. Here is the result it gets for rolling the dice 16 times:
2, 0, 2, 2, 0, 3, 2, 4, 2, 2, 0, 5, 0, 3, 1, 3

There sure are a lot of 2’s aren’t there? You might not look at this sequence and think it’s very random, but it is.

This is just how white noise works; it has clumps and voids and is in general pretty uneven.

At the limit (large sample counts) it evens out. Check out the image below to see how the histogram gets flatter as the number of rolls gets higher. This shows that it is a rectangular distribution. (Note: The 8192 x 100 line is the average of 100 different 8192 runs)

From a frequency point of view, since it’s white noise, it should have all frequencies present in equal amounts.

At 16 samples, the frequency magnitudes are a bit wild:

At 1024, they are flattening a bit:

At 8192, even more so:

When averaging 100 runs of 8192, it cleans up a lot and shows roughly the same magnitude for all frequencies, which shows that the noise color is white and the rolls are uncorrelated.

# White Triangular and Gaussian Noise

If you roll two dice and add them, you get white triangular noise. You can also subtract one from the other and get white triangular noise.

If you involve more than two dice and do additions and subtractions of various amounts, the triangle rounds out and starts to look more like a normal aka Gaussian distribution. If you rolled an infinite number of dice, it would be Gaussian. This is due to the central limit theorem (https://en.wikipedia.org/wiki/Central_limit_theorem)

Doing these things, the dice rolls are not correlated with each other so they are still white noise.

Adding two dice (0-5) together, gives a result 0-10. Here’s the histogram of 1024 rolls:

Averaging 100 different 8192 sets of rolls gives this much cleaner histogram:

And when looking at the frequencies (averaging 100 different 8192 rolls), it still looks white, despite having a different histogram / distribution.

Adding three dice together gives results 0-15. The histogram for 1024 rolls still looks pretty lumpy:

But it cleans up when averaging 100 different 8192 rolls and looks pretty gaussian, since there are 3 dice involved instead of only 2.

The frequencies look the same (white) so I won’t even bother showing them.

Here’s the histogram of 100 different 8192 rolls, using 20 dice each roll and summing them up. The frequencies look white still, so i won’t bother showing them.

Rolling two dice and subtracting one from the other looks the exact same as adding them together, except the histogram goes from -5 to +5, instead of from 0 to 10.

You can generalize this subtraction to a larger number of dice by summing up all the dice, but multiplying half of them by -1 before doing the summing.

If doing an odd number of dice in this setup, it looks different in that one side of the Gaussian tail is chopped off, like below, but still looks white in frequency space. 3 dice were used here, where two dice were multiplied by -1 before summing.

Sure you can roll a lot of dice and add them together to get white Gaussian noise, but you can also generate Gaussian numbers directly (check this out to see one way: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform). If you do this and write down the numbers, the histogram looks Gaussian as you’d expect, and amazingly (or not amazingly), it still looks perfectly white in frequency space. I won’t bother showing the images, but I did write the code for it and verify, so you can check out the output of the code that goes with this post for yourself if you’d like!

# Red Triangular and Gaussian Noise

Let’s say you roll two dice and add them up. This is triangular white noise like we saw above.

If you name your dice A and B, you can now reroll A and add them again to get the second number. You then reroll B and add them again to get the third number. You then reroll A and add them again to get the fourth number, and so on. This is triangular red noise!

By only rerolling one of the dice, you are low pass filtering the white noise and getting red noise.

In more plain terms, by only re-rolling one of the dice each time, the sum is going to tend to be more similar to the last sum. That makes for random numbers that have stronger low frequency components.

Here are the results from the program of doing this 16 times:
2, 4, 2, 3, 5, 6, 6, 4, 2, 5, 5, 3, 4, 4, 8, 7

As you can see, the numbers are usually pretty similar to the numbers around them, which is different than the white noise case. This is red noise.

Doing this 8192 times, the histogram looks like a wiggly triangle as the image below shows, but averaging 100 runs of 8192 samples, it looks like a perfect triangle (not shown).

Looking at it in frequency space, here’s 16 samples:

1024:

8192:

8192 x100 averaged:

So, the histogram shows a triangular distribution, and the frequency space shows low frequency content – the part in the middle is the low frequencies and that’s where all the amplitude is. We have triangular red noise!

What if we use a different number of dice?

Here is the histogram and DFT for having 3 dice, and rerolling one each time (cycling through which die is rerolled). These are the averaged 8192 x100 rolls. The histogram shows a more gaussian distribution, and something odd is going on with the frequencies.

Here is the same for 4 dice.

Here it is for 10 dice.

And here it is for 20 dice.

So, as we use more dice, the histogram gets more Gaussian, but for every dice added, we also get “half a hump” in the frequency space. The noise is undeniably red though, as the largest peaks are in the middle, where the low frequencies are.

If you are curious, here’s 16 rolls of 20 dice being used to make red noise, which means only re-rolling one dice each roll. It is very “random walk”-ish and the numbers don’t stray too far from 50, which is the expected value.
49, 49, 51, 51, 53, 55, 56, 55, 54, 54, 57, 57, 62, 61, 61, 62

You might think that those numbers are on their way up, but here are the next 16 values it gives to show you that isn’t the case.
59, 59, 57, 52, 51, 55, 55, 54, 55, 51, 52, 53, 57, 58, 59, 58

All this time, we’ve only been rerolling one die. What if we re-roll a different number of dice?

Here we roll 20 dice, but reroll 5 each time (cycling through the dice). We did this 8192 times, and averaged the results of doing that 100 times. In frequency space it looks a lot like our 4 dice case. I guess that isn’t too surprising if you think about us re-rolling 1/4th of the dice in both cases.

Here we reroll 10 dice each time. In frequency space it looks like when we rolled 2 dice. Again, maybe not surprising, because we are rerolling 1/2 of the dice now just like before. We do get the nicer Gaussian distribution though of using 20 dice though.

Rerolling 15 dice each time, the histogram is unaffected, but here are the frequencies:

Here we reroll 19 dice:

So, taking some empirical observations it looks like when summing dice…

Using more dice makes the distribution more Gaussian (thanks central limit theorem!)

Rerolling less of the dice each time makes for stronger low frequency content, while rerolling more dice makes the result look more like white noise. That makes some intuitive sense i think.

The color of the noise here seems to be directly related to what PERCENTAGE of the dice you re-roll, without caring how many actual dice you used.

Using these 2 things together, you can craft the distribution (histogram) separately from how red you want the noise to be, which is pretty neat.

It’s strange too though, that as you re-roll more dice, it looks sort of like you are zooming into the center of the graph, in frequency space. In this way, you only get a flat line (white noise) at the limit (infinity). That makes sense to me because if you are only rerolling one dice out of N dice total, it’s only at the limit of N approaching infinity, that your noise will be purely white. Anything short of that will make slightly red noise.

Before moving on, I wanted to mention that this works very similarly when using Gaussian distributions. Here is 20 Gaussian numbers summed, with 1 “rerolled” each time. 8192 rolls, results are the average of 100 runs. Mean of 0 and Sigma of 1.

By the way, if you ever wanted to generate red rectangular noise, you should look into Brownian motion (aka Brown noise. Named after Robert Brown. Total coincidence that his name is a color too. Brown noise is red), which is just a random walk. You’d use a rectangular distribution random number generator to make small perturbations on a path. In effect, you would be low pass filtering the white noise, which gives you red noise.

# Blue Triangular and Gaussian Noise

Let’s say you roll two dice and subtract one from the other. This is triangular white noise like we saw above.

If you name your dice A and B, you can now reroll A and and take your second number as A-B. You can then reroll B and take B-A as your third number. You then reroll A and take A-B as your fourth number, and so on. This is triangular blue noise!

By only rerolling one of the dice, and flipping which dice is subtracted from the other, you are high pass filtering the white noise and getting blue noise.

In more plain terms, by only re-rolling one of the dice each time, but flipping the sign of the dice, the sum is going to tend to be more different to the last value. That makes for random numbers that have stronger high frequency components.

Here are the results from the program of doing this 16 times:
2, 0, -2, 3, -1, 2, -2, 0, -2, 5, -5, 3, -2, 2, 2, -3

As you can see, the numbers are usually pretty different to the numbers around them, which is different than the white noise case. This is blue noise.

Just like red noise, where you add the dice instead of subtract, this also gives you a triangle distribution, but the triangle is centered on zero. At 8192 rolls, the histogram looks like a wobbly triangle like below, but averaging 100 different runs of 8192 rolls it looks like a perfect triangle (not shown).

Here it is in frequency space for 16 samples:

Here is 1024:

Here is 8192:

Here is the average of 100 different 8192 runs:

So, the histogram shows a triangular distribution, and the frequency space shows high frequency content. We have triangular blue noise!

What if we use a different number of dice, like we did for red?

Here’s what we are going to do… we’ll roll N dice to start out, multiply every even numbered die by -1 and sum them up to get our first number. We will then reroll die 0, multiply every odd numbered die by -1 and sum them up to get our second number. We will then reroll die 1, multiply every even numbered die by -1 and sum them up to get our third number. Repeat this process to get as many numbers as you want.

The histogram gets more Gaussian for higher numbers of dice, just like it did for red noise, so I will only show the frequency information for different numbers of dice. All results are the average of 100 different 8192 sequences.

Here it is with 3 dice:

Here it is with 4 dice:

Here it is with 10 dice:

Here it is with 20 dice:

Just like with red noise, for every dice added, we also get “half a hump” in the frequency space. The noise is undeniably blue though, as the largest peaks are on the sides, where the high frequencies are.

If you are curious, here’s 16 rolls of 20 dice being used to make blue noise, which means only re-rolling one dice each roll. Every other number flips sign just about, except near the end where a 3 and a 2 are next to each other.
-9, 9, -7, 7, -5, 7, -6, 5, -6, 6, -3, 3, 2, -3, 3, -2

If you think the numbers are trending toward zero, you are right, but it’s just a short term anomaly. Here are the next 16:
-1, 1, -3, -2, 1, 3, -3, 2, -1, -3, 4, -3, 7, -6, 7, -8

If you wanted the numbers to be strictly positive, or to be within specific values, you could shift (add) and scale them to be in the range you wanted.

All this time, we’ve only been rerolling one die. What if we re-roll a different number of dice, like we looked at with red noise?

Here we roll 20 dice, but reroll 5 each time (cycling through the dice). We did this 8192 times, and averaged the results of doing that 100 times. In frequency space it looks a lot like our 4 dice case. We had that same result with red noise.

Here it is rerolling 10 dice each time, which looks a lot like our 2 dice case.

Here we reroll 15 dice each time.

Here we reroll 19 out of the 20 dice each time, which looks nearly like white noise.

The histogram was unaffected by all of this – at 20 dice it is very Gaussian –
but the frequency make up changed quite a bit. As we rerolled more dice, it started to look more like white noise, which is the same as we saw with the red noise.

I won’t show it, because it’s redundant and there are a lot of images on this post already, but this method of making blue noise also works when sampling from a Gaussian distribution, just like it did with red noise.

If you ever wanted to generate blue rectangular noise, you should look into Mitchell’s best candidate algorithm, using it in 1D. (https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/)

You could also look into low discrepancy sequences, which have some similar properties to blue noise (depending on what you are looking for) but are deterministic instead of randomized.

# Purple Triangular and Gaussian Noise

What happens if we make blue noise and add red noise to it? Would we get purple noise? Yes we would 😛

Purple noise is band stop filtered white noise. It has low and high frequency components but doesn’t have much in the way of middle frequency components.

We’ll be doing 20 dice red noise with 1 reroll, and 20 dice blue noise with 1 reroll, and adding them together to get purple noise.

Here are 32 purple noise values made with the process above:
52, 47, 50, 52, 53, 59, 52, 59, 50, 55, 55, 59, 61, 58, 66, 57, 59, 59, 57, 51, 51, 54, 59, 52, 54, 50, 53, 50, 61, 53, 63, 56

Now we’ll do this 8192 times to get that many numbers, and average the result of 100 such runs, to show you the histograms and frequency amplitudes.

Here is the histogram and frequency amplitudes of 16 rolls:

Here is 1024 rolls:

Here is 8192 rolls:

Here is the average of 100 different 8192 rolls:

The histogram is very rough, but does settle into a Gaussian curve like it should. The frequency components show strong low and high frequencies, but not much in the middle.

We could play around with re-rolling different amounts of dice, and different amounts of dice for red vs blue, but I’ll leave that to you if you are curious about it 😛

# Purplish Noise

So interestingly, if you generate N blue noise numbers and N red noise numbers, you can lerp between the number sets. Like the first number could be 25% blue and 75% red.

Doing this lets you control how red or how blue the numbers are.

Following are the histograms and DFT magnitude data for red noise made with 20 dice and 1 reroll, lerped towards blue noise made with 20 dice and 1 reroll. These are the average of 100 different 8192 runs.

First, 1% red, 99% blue. There is a strange spike near zero in the histogram I can’t explain, but I don’t think is a bug in the program. If you know what is causing it, let me know!

Next is 25% red, 75% blue.

Next is 50/50, which looks like regular old purple noise.

Here is 75% red, 25% blue.

Lastly here is 99% red, 1% blue.

Before moving on, what happens if we take the max of blue and red noise? This is kind of a throw back to the post that talks about what happens when you take the max of two (uniform white) random numbers. https://blog.demofox.org/2019/06/01/taking-the-max-of-uniform-random-numbers/

Sadly, it just makes red noise with a Gaussian distribution. Nothing special there are far as I can see.

# Green Triangular and Gaussian Noise

Green noise is the opposite of purple noise and is band pass filtered white noise. Green noise doesn’t have much high or low frequency to it, but it has middle frequencies.

I wasn’t sure how to generate green noise so asked online to see if anyone knew how or had any ideas.

https://twitter.com/eigenbom (Ben Porter) had an idea of basically mixing blue and red techniques together.

Rolling 20 dice (0 to 5), rerolling one die each roll, and flipping the signs of dice every TWO rolls instead of every roll, you do get a nice green noise as you can see in this DFT below. The histogram was Gaussian.

Trying to generalize this idea, I tried changing the “sign flip period” to other values. That didn’t generalize it very well and I got some strange results. If you are curious, check out the output of the program!

https://twitter.com/thingcreator (@Ray) also had some thoughts on how to generate blue noise, which was more based on some deep DSP ideas.

The first idea is to interleave zeros into blue noise, to decrease the high frequency amplitudes of blue noise to lower frequencies.

Doing that screws with the histogram, since half of the values are zeros, and also screws with the randomness since every other value is zero.

But, it does make some nice green noise!

Ray had another idea. Instead of zeros, you could just interleave two independent blue noise streams. This doesn’t screw with the histogram or the randomness like adding zeros does.

I do that by generating blue noise, chopping it in half, and then interleaving the halves. Doing that gives the expected Gaussian histrogram, and still gives a very nice green noise DFT.

What if you do that “chop in half and interleave” multiple times? The histogram will stay the same because the numbers used are all the same, but the frequencies should change.

Here it is being done twice instead of once:

Here it is 3 times:

Here it is 10 times:

Here are 32 green noise numbers using Ray’s technique of interleaving two blue noise streams. You can see how it has both rapid change (blue noise qualities) and similar numbers (red noise qualities).
-9, 1, 9, 4, -7, -5, 7, 1, -5, 0, 7, -2, -6, 1, 5, -5, -6, 5, 6, -6, -3, 2, 3, -3, 2, 7, -3, -9, 3, 8, -2, -8

Green noise apparently has some usage in dithering physical objects (like shirt designs). The idea seems to be that green noise is more stable in the physical world without tiny high frequency blue noise features that are more likely to be damaged. https://www.semanticscholar.org/paper/Green-noise-digital-halftoning-Lau-Arce/3e9b673900f86546db94b0922932ce13db84bf75

# Pink Noise

I’ve known pink noise as “red noise mixed with white noise” for a long while but didn’t think much of it as a noise color. I didn’t know of a usage case for it.

It turns out that pink noise is a pretty important noise color and that many natural and biological processes have this situation where there is some low frequency (red noise) trend over time, but also a small scale randomness (white noise) added on top.

There’s a famous paper by Voss that describes an algorithm to generate pink noise.

You start by rolling some number of dice and adding them up.

From there, you re-roll dice 0 every roll, dice 1 every 2 rolls, dice 2 every 4 rolls, dice 3 every 8 rolls and so on. If you know how to count in binary you should see a pattern emerging.

Doing that, the histogram is just Gaussian, but the DFT looks like a bit like a fuzzy red noise, due to the white noise added into the red noise to make pink noise:

Here are 32 pink noise values made with that algorithm:
6, 4, 9, 10, 6, 9, 5, 6, 15, 17, 10, 11, 14, 15, 14, 13, 15, 13, 13, 11, 7, 8, 9, 8, 14, 16, 17, 18, 16, 11, 11, 14

McCartney has a variation on this algorithm where you only reroll one die each roll. The number of leading zeros in your roll number in binary is the number of the dice to reroll. That makes it a more even workload for how many random numbers you generate each time.

It doesn’t affect the histogram, but here is the DFT, which is a little bit sharper in the middle:

Here are 32 pink noise values made with this algorithm:
6, 4, 7, 9, 11, 11, 10, 8, 11, 11, 12, 13, 12, 16, 15, 15, 15, 13, 11, 12, 11, 9, 14, 15, 13, 11, 8, 10, 13, 15, 15, 11

The last variation of the voss algorithm for generating pink noise I’ll show is a stochastic one. It uses a geometric distribution random number each roll to figure out which die to reroll.

This makes for a very sharp frequency spike:

Here are 32 values made with that algorithm:
6, 8, 8, 6, 7, 9, 4, 6, 9, 10, 9, 9, 7, 9, 6, 10, 9, 9, 8, 9, 9, 14, 10, 9, 11, 9, 8, 8, 5, 4, 4, 3

https://www.dsprelated.com/showarticle/908.php

# Closing

This stuff may seem a bit strange, ad hoc, and magical, but thanks to Ray for pointing this out, this is all just “Finite Impulse Response” DSP stuff. Rerolling dice is like numbers falling out of the delay line, and changing the signs of numbers and summing them (etc) is just filter coefficients for the N order filter used for the N dice.

There’s definitely some deeper DSP stuff here that unifies it all and makes it make a lot more sense that I don’t understand. (yet. Hopefully will one day.)

Here is some interesting info linking this stuff to convolution
https://stats.stackexchange.com/questions/331973/why-is-the-sum-of-two-random-variables-a-convolution/331983#331983

# Generating Blue Noise Textures With Void And Cluster

Blue noise comes in two main flavors: sample points and masks. I’ll be calling blue noise masks “blue noise textures” in this post.

The code that accompanies this post, implementing the void and cluster algorithm, and running tests on it, can be found at:
https://github.com/Atrix256/VoidAndCluster

Update: Jonathan Dupuy has a GPU implementation that runs much more quickly than this CPU implementation. https://github.com/jdupuy/BlueNoiseDitherMaskTiles

# Blue Noise Sample Points

Blue noise sample points can be used when you are taking samples for integration, like when sampling shadow maps, taking AO samples, or when doing path tracing. Blue noise doesn’t converge as fast as other sampling patterns, but it is really good at hiding error by evenly distributing it, as if error diffusion had been done.

Here are some blue noise samples and their frequency magnitudes:

Here’s an example result of blue noise vs white noise sampling, from https://www.arnoldrenderer.com/research/dither_abstract.pdf

If you can afford enough samples to converge to the right result, you should do that, and you should use a sequence that converges quickly, such as Martin Robert’s R2 sequence (a generalization of the golden ratio, which is competitive with sobol!): http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/

If you can’t afford enough samples to converge, that’s when you should reach for blue noise, to minimize the appearance of the error that will be in the resulting image.

To get blue noise samples, my go to method is “Mitchell’s Best Candidate Algorithm”:
https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/

That algorithm is easy to understand and quick to implement, and while it makes decent results, it isn’t the best algorithm. The best blue noise sample point algorithms all seem to be based on capacity constrained Voronoi diagrams (CCVD). One benefit of Mitchell’s best candidate algorithm though is that the sample points it makes are progressive. That means if you generate 1024 sample points, using the first N of them for any N will still be blue noise. With CCVD algorithms, the sample points aren’t blue unless you use them all. This notion of progressive samples is something that will come up again in this post later on.

# Blue Noise Textures

The second kind of blue noise is blue noise masks or textures.

One thing blue noise textures are good for is dithering data: adding random noise before quantization. Quantization lets you store data in fewer bits, and dithering before quantization gives a result that makes it harder (sometimes virtually impossible) to tell that it is even using fewer bits.

Here is a blue noise texture and it’s DFT

This is an example of white vs blue noise used for dithering, from http://johanneskopf.de/publications/blue_noise/

Using blue noise gives much better results than white noise, for the same amount of error. It does this by more evenly distributing the error, which gives results as if there was error diffusion done, without actually having to do the error diffusion step. The results are especially good when you animate the noise well. For more information on that, check out these posts of mine.
Part 1: https://blog.demofox.org/2017/10/31/animating-noise-for-integration-over-time/
Part 2: https://blog.demofox.org/2017/11/03/animating-noise-for-integration-over-time-2-uniform-over-time/

If that’s an area that you are interested in, you should give this a read too, making sure to check out the triangular distributed noise portion!

Interestingly though, blue noise textures can also be used for sampling as this paper talks about: https://www.arnoldrenderer.com/research/dither_abstract.pdf

Here is another link between the two types of blue noise, where it shows how to turn blue noise textures into blue noise sample points of the desired density (like for making sample density be based on importance / the PDF): http://drivenbynostalgia.com/#frs

For blue noise textures, my go to has been to get them from this website where the author used the void and cluster algorithm to generate blue noise textures: http://momentsingraphics.de/?p=127

It’s been on my todo list for a while to understand and implement the void and cluster algorithm myself so that if that website ever goes down, I wouldn’t lose my only source of “the good stuff”. This post is the result of me finally doing that!

It’s possible to make blue noise (and other colors of noise too) by repeatedly filtering white noise: https://blog.demofox.org/2017/10/25/transmuting-white-noise-to-blue-red-green-purple/

Doing it that way has some problems though depending on what it is you want to use the blue noise for: https://blog.demofox.org/2018/08/12/not-all-blue-noise-is-created-equal/

# Using Both Types of Blue Noise Together

I tend to use blue noise a lot in my rendering techniques, because it’s a nice way to get the look of higher sample counts (the appearance of low error), without spending the cost of higher sample counts.

It’s fairly common that I will use both types of blue noise together even!

For example, if I’m taking multiple PCF samples of a shadow map, I’ll make some blue noise sampling points to sample the shadow map per pixel. I’ll also use a 64×64 blue noise texture tiled across the screen as a per pixel random number for rotating those sample points per pixel.

As a bonus, i’ll use the golden ratio (conjugate) to make the noise be low discrepancy over time, as well as being blue over space by doing this after reading the value from the blue noise texture:

blueNoiseValue = fract(blueNoiseValue + 0.61803398875f * (frameNumber % 64));

The end result is really nice, and looks as though many more samples were taken than actually were.

(Beware: doing random sampling in cache friendly ways can be a challenge though)

# Void And Cluster

The void and cluster algorithm was introduced in 1993 by Robert Ulichney and was previously under patent, but it seems to have expired into the public domain now.

Here is the paper:

The algorithm has four steps:

1. Generate Initial Binary Pattern – Make blue noise distributed sample points.
2. Phase 1 – Make those points progressive.
3. Phase 2 – Make more progressive sample points, until half of the pixels are sample points.
4. Phase 3 – Make more progressive sample points, until all pixels are sample points.

Since the points are progressive, it means they have some ordering to them. This is called their rank. The final blue noise texture image shades pixels based on their rank: lower numbers are darker, while higher numbers are brighter. The end goal is that you should have (as much as possible) an even number of each pixel value; you want a flat histogram.

Something interesting is that Phase 1 turns non progressive blue noise sample points into progressive blue noise sample points. I don’t know of any other algorithms to do this, and it might be common knowledge/practice, but it seems like this would be useful for taking CCVD blue noise sample points which are not progressive and making them progressive.

Before we talk about the steps, let’s talk about how to find voids and clusters.

Finding Voids and Clusters

Each step of this algorithm needs to identify the largest void and/or the tightest cluster in the existing sample points.

The tightest cluster is defined as the sample point that has the most sample points around it.

The largest void is defined as the empty pixel location that has the most empty pixel locations around it.

How the paper finds voids and clusters is by calculating a type of “energy” score for each pixel, where the energy goes up as it has more 1’s around it, and it goes up as they are closer to it.

Let’s say we are calculating the total energy for a pixel P.

To do that, we’ll loop through every pixel and calculate how much energy they contribute to pixel P. Add these all up and that is the energy of the pixel P.

If we are looking at how much energy a specific pixel S contributes to pixel P, we use this formula:

Energy= e^(-distanceSquared/(2*sigma*sigma))

The void and cluster paper uses a “sigma” of 1.5 which means the divisor 4.5. The “free blue noise textures” implementation uses a sigma of 1.9 and i have to agree that I think that sigma gives better results. A sigma of 1.9 makes the divisor be 7.22.

Distance squared is the squared distance between S and P, but using a “torroidal” wrap around distance which makes the resulting texture tileable as a side effect of satisfying “frequency space needs” of the pattern needing to be tileable. (I wrote a post about how to calculate torroidal distance if you are curious! https://blog.demofox.org/2017/10/01/calculating-the-distance-between-points-in-wrap-around-toroidal-space/)

Now that you know how to calculate an energy for each pixel, the hard part is done!

The pixel that has the highest total energy is the tightest cluster.

The pixel that has the lowest total energy is the largest void.

Generate Initial Binary Pattern

To start out, you need to make a binary pattern that is the same size as your output image, and has no more than half of the values being ones. In the code that goes with this post, I set 1/10th of the values to ones.

Then, in a loop you:

1) Find the tightest cluster and remove the one.
2) Find the largest void and put the one there.

You repeat this until the tightest cluster found in step 1 is the same largest void found in step 2.

After this, the initial binary pattern will be blue noise distributed sample points but they aren’t progressive because they have no sense of ordering.

This current binary pattern is called the “prototype binary pattern” in the paper.

Phase 1 – Make The Points Progressive

Phase 1 makes the initial binary pattern progressive, by giving a rank (an ordering) to each point.

To give them ranks you do this in a loop:

1) Find the tightest cluster and remove it.
2) The rank for that pixel you just removed is the number of ones left in the binary pattern.

Continue this until you run out of points.

Phase 2 – Make Progressive Points Until Half Are Ones

Phase 2 resets back to the prototype binary pattern and then repeatedly adds new sample points to it, until half of the binary pattern is ones.

It does this in a loop:

1) Find the largest void and insert a one.
2) The rank for the pixel you just added is the number of ones in the binary pattern before you added it.

Continue until half of the binary pattern is ones.

Phase 3 – Make Progressive Points Until All Are Ones

Phase 3 continues adding progressive sample points until the entire binary pattern is made up of ones.

It does this by reversing the meaning of zeros and ones in the energy calculation and does this in a loop:

1) Find the largest CLUSTER OF ZEROS and insert a one.
2) The rank for the pixel you just added is the number of ones in the binary pattern before you added it.

Continue until the entire binary pattern is ones.

You now have a binary pattern that is full of ones, and you should have a rank for every pixel. The last step is to turn this into a blue noise texture.

Finalizing The Texture

To make a blue noise texture, you need to turn ranks into pixel values.

If your output texture is 8 bits, that means you need to turn your ranks that are 0 to N, into values that are 0 to 255. You want to do this in a way where as much as possible, you have an even number of every value from 0 to 255.

I do this by multiplying the rank by 256 and dividing by the number of pixels (which is also the number of ranks there are).

# Optimization Attempts

I tried a variety of optimizations on this algorithm. Some worked, some didn’t.

What I ultimately came to realize is that this algorithm is not fast, and that when you are using it, it’s because you want high quality results. Because of this, the optimizations really ought to be limited to things that don’t hurt the quality of the output.

For instance, if the algorithm takes 10 minutes to run, you aren’t going to shave off 30 seconds to get results that are any less good. You are running a ~10 minute algorithm because you want the best results you can get.

Success: Look Up Table

The best optimization I did by far was to use a look up table for pixel energy.

I made a float array the same size as the output, and whenever I put a 1 in the binary pattern, i also added that pixel’s energy into the energy look up table for all pixels. When removing a 1, i subtracted that pixel’s energy from the look up table for all pixels.

Using a look up table, finding the highest number is all it takes to find the tightest cluster, and finding the lowest number is all it takes to find the largest void.

I further sped this up by using threaded work to update the pixels.

I first used std::thread but didn’t get much of a win because there are a lot of calls to updating the LUT, and all the time moved from doing LUT calculation to thread creates and destroys. Switching to open mp (already built into visual studio’s compiler!) my perf got a lot better because open mp uses thread pools instead of creating/destroying threads each invocation.

Failure: Limiting to 3 Sigma

The equation to calculate pixel energy is actually a Gaussian function. Knowing that Gaussians have 99.7% of their value withing +/- 3 sigma, i decided to try only considering pixels within a 3*sigma radius when calculating the energy for a specific pixel.

While this tremendously sped up execution time, it was a failure for quality because pixels were blind to other pixels that were farther than 6 pixels away.

In the early and late stages of this blue noise generation algorithm, the existing sample points are pretty sparse (farther apart) and you are trying to add new points that are most far apart from existing points. If the farthest away you can see is 6 pixels, but your blue noise texture is 512×512, that means samples are not going to be very well distributed (blue noise distributed) at those points in time.

That had the result of making it so low and high threshold values used on the blue noise texture gave very poor results. (Thanks to Mikkel Gjoel for pointing out these edges to test for blue noise textures!)

Didn’t Try: DFT/IDFT for LUT

The “free blue noise textures” website implementation does a neat trick when it wants to turn a binary sample pattern into a look up table as I described it.

It does a DFT on the binary sample pattern, multiplies each pixel by the frequency space representation of a gaussian function of the desired size, and then does an inverse DFT to get the result as if convolution had been done. This uses the convolution theorem (a multiplication in frequency space is convolution in image space) to presumably get the LUT made faster.

I didn’t try this, but it doesn’t affect quality as far as I can tell (comparing my results to his), and i’m pretty sure it could be a perf win, but not sure by how much.

How I’ve written the code, there are two specific times that i want to make a LUT from scratch, but many, many more times i just want an incremental update to the LUT (at 1024×1024 that’s roughly 1 million LUT updates I would want to do) so i didn’t pursue this avenue.

Probable Success: Mitchell’s Best Candidate

Noticing the result of “initial binary pattern” and “phase 1” was to make a progressive blue noise sample, I thought it might be a good idea to replace those with an algorithm to directly make progressive blue noise samples.

I grabbed my goto algorithm and gave it a shot and it is noticeably faster, and seems to give decent results, but it also gives DIFFERENT results. The difference is probably harmless, but out of abundance of caution for correctness, I wouldn’t consider this a for sure success without more analysis.

# Results

The code that goes with this post can be found at: https://github.com/Atrix256/VoidAndCluster

Speed

At 256×256, on my machine with my implementation, it took 400.2 seconds to run the standard void and cluster algorithm. Using Mitchell’s best candidate instead of the initial binary pattern and phase 1, it took 348.2 seconds instead. That’s a 13% savings in time.

At 128×128, standard void and cluster took 24.3 seconds, while the Mitchell’s best candidate version took 21.2 seconds. That’s a 13% time savings again.

At 64×64, it was 1.3 seconds vs 1.1 seconds which is about 13% time savings again.

Quality

From top to bottom is void and cluster, void and cluster using mitchell’s algorithm, and the bottom is a blue noise texture taken from the “free blue noise textures” site.

Here the same textures are, going through a “threshold” test to make sure they make good quality blue noise of every possible density. Same order: void and cluster, void and cluster using mitchell’s algorithm, and the bottom one is taken from the other website. You can definitely see a difference in the middle animation vs the top and bottom, so using the best candidate algorithm changes things, but it seems to be better at low sample counts and maybe not as good at low to medium sample counts. Hard to say for sure…

# Red Noise

Lately I’ve been having a fascination with red noise as well.

Where blue noise is random values that have neighbors being very different from each other, red noise is random values that have neighbors that are close to each other.

Red noise is basically a “random walk”, but stateless. If you had a red noise texture, you could look some number of pixels forward and would get the position of the random walk that many steps away from your current position.

The simplest Markov Chain Monte Carlo algorithms use a random walk to do their work, but I’m not sure that a higher quality random walk would be very helpful there.

I’m sure there must be some usage case for red noise but I haven’t found it yet.

In any case, I do believe that the void and cluster algorithm could be adapted to generate red noise, but I’m not sure the specifics of how you’d do that.

If i figure it out, i’ll make a follow up post and link from here. If you figure it out, please share details!

Some more cool stuff regarding noise colors is coming to this blog, so be on the lookout 🙂

# Taking the Max of Uniform Random Numbers

There is 80 lines of simple standalone C++ code that generated the data for this post at:

Let’s say you generate 1,000,000 random numbers from 0 to 255 and count how many times each number came up in a histogram. I did that and here it is:

In a previous post I talked about how adding dice rolls together, counting bits in a random number, and similar, would approach a normal / Gaussian distribution:
https://blog.demofox.org/2017/07/25/counting-bits-the-normal-distribution/

The same thing happens if you average random numbers, but there’s a nice side effect of getting values in the same range. That is to say: if you roll N 6 sided dice and average them, you will still get values between 1 and 6, but the more dice there are, the more the distribution will approach Gaussian. It shapes up pretty quickly too. Here is the histogram:

Something interesting to note is that adding two uniform random numbers together – or averaging them – makes something called “triangle distributed noise”. Looking at the histogram for averaging two values, hopefully you can see why it’s called triangle distributed! Triangle noise some cool properties for noise in graphics and is orthogonal to eg white noise vs blue noise vs low discrepancy sequences.

I recently saw some code that was taking the maximum of two random numbers and that made me wonder what sort of distribution that might give.

Being a better programmer than a mathematician, i made a histogram, then took a guess as to what the formula behind the shape might be.

It turns out that taking the max of N uniform random numbers is the same as using y=x^(N-1) as a PDF. (You of course would need to normalize it to be a real PDF)

Here are the histograms:

These are apparently beta distributions:
https://en.wikipedia.org/wiki/Beta_distribution

If we take the min rather than the max, what happens then? Well, if for example the numbers are between 0 and 1, taking the min will give you the same count as if you took the max of 1-x. So, the graph of the histogram should look the same, just flipped on the x axis.

It turns out that it does:

# Generating Random Numbers From a Specific Distribution With The Metropolis Algorithm (MCMC)

There is ~400 lines of standalone C++ code that implements the main ideas in this post. You can find it at: https://github.com/Atrix256/MetropolisMCMC

In previous posts I showed how to generate random numbers from a specific distributing by using two techniques:

This post will show how to do it using a Markov Chain Monte Carlo method called “The Metropolis Algorithm”. This post also talks about using it for numerical integration.

If you want an intro or review to either Markov Chains or Monte Carlo, these two posts can help you out.

# Overview

The Metropolis algorithm lets you generate random numbers that follow a distribution given by any function y=f(x), where y is the probability of choosing x.

Rejection sampling does this as well, but you have to throw away an unknown number of bad samples before getting each good sample.

Inverting the CDF doesn’t throw out any samples, but it’s limited in the type of distributions it can do: It can be mathematically complex, or impossible, to find the inverted CDF for a given function analytically.

In these ways, the Metropolis algorithm is the best of both worlds. You can sample from a distribution defined from any function, and you don’t have to throw out any samples while doing it.

Another interesting thing about the Metropolis algorithm is that it can work blindly. It never actually has to KNOW what function describes the probability distribution. If there is a black box you can give an x to get a y, the Metropolis algorithm can generate random numbers from the distribution hidden in that black box.

The Metropolis algorithm also works in any dimension: you can use it with functions like z=f(x,y) or t = f(x,y,z,w,s).

In higher dimensions such as z=f(x,y), the z is the probability of a 2d random number (x,y) being chosen. It sounds weird but this situation is like if you rolled two dice, the value that one die came up as affected the probability of the other die. Maybe when the first die was a larger number, it made it be more probable for the other die to be a larger number too.

As weird as that is, if you can describe the relationship as a z=f(x,y) function, this can generate (x,y) random numbers from that distribution for you.

It’s worth noting that the Metropolis algorithm is a simpler special case of the Metropolis-Hastings algorithm, and these are just two of many Markov Chain Monte Carlo algorithms.

# The Metropolis Algorithm

The metropolis algorithm is pretty simple.

You start with an x value and calculate y which is just f(x). This is the initial sample and hopefully is a location where y is greater than 0.

To get the next sample, we first need to calculate a candidate sample, and then choose whether to take it or not.

To make a candidate sample, take a small random step from the current x point to get a new x, either in the negative or positive direction. Calculate y which is just f(x) using the new x.

Calculate a value A (for acceptance value) which is the candidate y divided by the last y value. This is the percentage chance you should take the new sample as the current sample, otherwise you take the old sample as the current sample.

To make the decision, you just generate a random number between 0 and 1 and accept the new sample if it’s below the A value.

Rinse and repeat for as many samples as you want.

Here’s a simple but fully featured implementation that you can find in the code that goes with this post(https://github.com/Atrix256/MetropolisMCMC)

The x axis of each sample is a random number drawn from the distribution described by y=f(x). If you keep track of the average x value seen, you’ll get the expected value of the PDF.

The y axis of each sample isn’t that useful directly. It’s the pdf(x) but scaled up by an unknown amount – the normalization constant for the function.

The convergence rate of Metropolis MCMC isn’t as well understood as monte carlo integration, since Metropolis has dependent samples (a random walk that knows where it was last step) vs independent samples (a stateless random sample).

In higher dimensions, you just take a random step on each axis, instead of only the x axis. The rest of the algorithm remains the same.

Regarding the random walk, it’s possible to use many different types of random numbers to take a step, but it’s most common to use a normal distribution.

# Metropolis Burn in and 0.234

Metropolis is stateless, and has no memory of the past. Despite this, it’s common to do a “burn in” with MCMC where you throw out some number of samples before you start.

This might sound weird, but the reason for it is that there are good places to sample from and bad places to sample from, and this is an attempt to have the random walk find a better place to sample from.

For instance, if you were sampling from y=sin(x)*sin(x)*x from 0 to 2 pi, you’d have a pdf that had a shape like the bimodal function below:

If you started the random walk at 2, you’d be doing a random walk in the left, less probable hump, and it would be hard to break out into the right side which was supposed to be more probable.

You should eventually get into the larger hump and stay there longer, but your initial guesses may get stuck in a local minimum and not do a very good job of following the distribution.

While burn in can help situations like these, this is also an example of how the Metropolis algorithm can fail.

It’s worth noting that using a normal distribution for the random walk can help it not get stuck in bad places. If you have a uniform random distribution that can generate numbers between -k and +k, you can get stuck in situations where you need to take a larger than k step to get to a better (more probable) location. If instead you use a normal distribution, the sigma allows you to have a good idea of how big most of the steps will be, but there will always be a greater than zero chance that you can take a step of ANY size, which could get you out of a local minimum.

There is a rule of thumb that the step size you use (the sigma in the case of a normal distribution) should make it so you are accepting a new step on your random walk 23.4% of the time. This supposedly is a good balance between exploration (finding better places elsewhere) and exploitation (staying in a good place) in many situations.

This isn’t fully settled though, as this is only true some of the time, and counter research has come up. (https://www.sciencedirect.com/science/article/pii/S0304414907002177)

I experimented at having a “Burn in” phase where it also adjusted the sigma to try and reach that 23.4% acceptance rate over the previous N samples. I didn’t play with it very long though, and was unable to get it to reliably reach that acceptance rate.

Even beyond the 0.234 acceptance rate goal, tuning the step size for your specific situation can help you get better or worse results. I didn’t play around with that much in my experimentation though, and found a sigma of 0.2 worked pretty well when working with functions in the 0-pi and 0-2pi range.

The initial starting point of your random walk can affect performance too obviously, since the burn in stage is supposed to find a better starting position.

# Limiting the Function’s Range

In one of the tests I did, I used the function y=sin(x) with x going from 0 to pi/2.

At first i tried clamping x to 0 to pi/2, to keep it in range but doing that made the technique fall apart. There were plenty of times the random walk would try to step out of bounds, but instead of taking that step, it would clamp to the border. This meant that the random walk had a significantly higher chance of reaching the boundary than anywhere else on the graph, and that broke the algorithm.

The correct thing to do was to just make the function return 0 when x was out of range. In this way, it would end up taking any out of range location with 0% probability, but there was no bias about more or less likely places visited by the random walk, other than it preferring higher (more probable) parts of the function.

Something else worth noting about the function is that if the function ever returns a negative number, it ends up being the same as if it returned zero probability.

If use Metropolis MCMC for integration, this can be an important fact, because it will basically ignore the negative values, and treat them as zero.

# Discrete Case

As described, this algorithm works with continuous random numbers, drawing them from a PDF.

The same concepts work for discrete states though too (a more traditional looking markov chain), drawing from a PMF instead.

When handling the discrete case, it needs to be possible to be in any state at any point in time. A usual way to avoid the edge case of probabilities being such that you can only be in some nodes on even steps and others on odd steps, is to have a “self loop” on at least one node, which has a greater than zero probability of staying at the same node.

Markov Chain Monte Carlo Without all the Bullshit

# Integration

Using the Metropolis algorithm for numerical integration is possible, but is not as straightforward as Monte Carlo integration.

In Monte Carlo integration, to get a single estimate of an integral you calculate f(x) / pdf(x). f(x) is the function value at the random location x, and pdf(x) is the probability of that x value being chosen. You take the average of N such estimates and as N approaches infinity, the error of the average of estimates approaches 0.

In Metropolis MCMC we do have N number of samples, and it almost seems like we have enough data to do this, but it turns out that we don’t.

For f(x) which is the function value at the random location, you literally do have f(x). It’s the y component of each sample generated.

The problem is that we don’t have pdf(x).

The probability of choosing x is in fact based on the function we are evaluating f(x), but the function is essentially an un-normalized pdf, but we are able to draw random numbers from the pdf without knowing the normalization constant.

So, pdf(x) is some scalar multiple of f(x), but we have no idea what the multiplier is. That multiplier, the normalization constant, turns out to be the integration value we want to search for.

So we’ve gone in a circle and are no closer to being able to integrate with Metropolis.

There are some ways to deal with this though.

One way is to do mathematical tricks to make it so things “cancel out” and leave the normalization constant.

There is something called the “Harmonic Mean Estimator” that does this, but has infinite variance so is called “The Worst Monte Carlo Method Ever”.

There is another way though, that I use in the code that goes with this post.

Imagine that while you are doing your Metropolis MCMC you have some interval [a,b] that whenever you get a random number drawn in that range, you increment a counter.

After N total samples, you’ll have M samples that fell in this interval. An estimate of the integral of the normalized pdf over this interval is M/N.

Now, you can do regular Monte Carlo integration of the function over this range to get the integral of the UN-normalized pdf over this interval.

When you divide the unnormallized pdf value by the normalized pdf value, you’ll get the normalization constant aka the integral of the function.

A smaller interval size is better for the Monte Carlo integration because it will converge faster (better results in fewer samples), but it’s worse for the Metropolis integration because a smaller interval is less likely to be accurate with the random walk.

My intuition tells me that if you keep a histogram of the x values you’ve seen be generated from the Metropolis algorithm, that the ones with higher counts are more likely to be accurate. So I just integrate over whatever histogram bucket has the highest count. I haven’t done any real analysis of whether or not this is true, or how good this integration estimate is in general.

For the Monte Carlo integration, I used white noise (regular old random numbers) to integrate, but in reality you’d get much better results from something like sobol. I used white noise because i made the code generic for N dimensions and white noise generalizes to any dimension.

# Experiment Results

I didn’t play around much with initial guesses, sigmas, trying to reach the 0.234 acceptance rate, or burn in, but here’s some results from the code that goes with the post.

In the below, the blue line – normalized function value – is the actual desired PDF . The red line – Percentage – is how many samples we actually got in that histogram bucket. When these lines match up, we are happy and everything worked like we wanted it to.

y=sin(x) x in [0,pi]

y=|sin(x)| x in [0, 2pi]

y=sin(x)*sin(x) x in [0, 2pi]

It’s interesting to see the last one be so far from the real PDF. That function must trap the random walk in one side or the other a bit too much.

There is also a 2d function z=f(x,y) that is tested in the c++ code that goes with this post. I don’t know of any easy ways to make a 2d histogram so don’t have any results to show.

The Metropolis algorithm is pretty neat but it’s just the beginning of MCMC methods. I’ve heard that Hamiltonian Monte Carlo can give much better results by using derivatives to make more intelligently sized step sizes.

Something I find interesting is that plain Monte Carlo uses white noise, quasi Monte Carlo uses low discrepancy sequences (and i think blue noise would fit in here), while Metropolis MCMC uses a random walk, which is red noise.

I’m not sure what to make of that, but my brief reading about Hamiltonian Monte Carlo was that it allows the samples to be less dependent, and that’s why it improves things. Maybe there are some secrets to red noise, like there are for blue noise? I’m not really sure but will keep looking 😛

A great write up on Metropolis MCMC
https://stephens999.github.io/fiveMinuteStats/MH_intro.html

Another small but useful write up
http://www.pmean.com/07/MetropolisAlgorithm.html

A 35 minute video about Metropolis MCMC

A mathier set of videos about Metropolis MCMC that is actually very easy to understand:

“A Zero-Math Introduction to Markov Chain Monte Carlo Methods”
https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50

A more mathy overview of Metropolis
https://ermongroup.github.io/cs323-notes/probabilistic/mh/

A series of posts aimed at being a gentle introduction to MCMC
https://theclevermachine.wordpress.com/tag/monte-carlo-integration/

“Introduction to MCMC”

Introduction to MCMC

“MCMC Burn In”

MCMC burn-in

# Markov Chain Text Generation

This post includes a standalone (only standard headers, no external libs) ~400 line C++ source file that can analyze text and use an order N Markov chain to randomly generate new text in the same style. The Markov code itself is fairly generic / re-usable and a template parameter to the class lets you specify the order of the chain as well as the type of state data to use. That code is on github at: https://github.com/Atrix256/TextMarkovChain

When I see material on Markov chains, it usually comes in two flavors:

1. Very Mathy
2. Pretty impressive results light on explanation

It turns out the reason for this is because they CAN be very mathy but they can also be extremely simple.

Without knowing this, I decided it was time to learn about Markov chains. I leveled up my linear algebra knowledge a bit, finally getting a solid grasp on eigen vectors, and learning things like how to put a matrix into an eigen basis form to be able to make matrix exponentiation a trivial operation. There are links at bottom of post if you want to learn this stuff too.

Then, I sat down to learn Markov chains and nearly flipped my table over! Yes, Markov chains can be mathy (and matrix exponentiation is one way to find a Markov chain steady state, but not the best), but that stuff isn’t really required for most uses.

# Markov Chains

A Markov chain is just any situation where you have some number of states, and each state has percentage chances to change to 0 or more other states.

You can get these percentages by looking at actual data, and then you can use these probabilities to GENERATE data of similar types / styles.

# Example

This post uses Markov chains to generate text in the style of provided source text.

The first step it does is analyze source text.

To analyze the source text, it goes through text, and for each word it finds, it keeps track of what words came next, and how many times those words came next.

When analyzing the story “The Tell-Tale Heart” by Edgar Allan Poe for instance (https://poestories.com/read/telltaleheart , also is data/telltale.txt in the code that goes with this post), here are the words that came after “when” and their counts.

• all – 1
• enveloped – 2
• he – 1
• i – 4
• my – 2
• overcharged – 1
• the – 1

Here are the counts for the words that appear after “is”:

• but – 1
• impossible – 1
• merely – 1
• nothing – 1
• only – 1
• the – 2

After all these counts have been gathered up, the next step is to convert them into probabilities. You do this by summing up the words that come after a specific word, and dividing the count of each word by that total sum.

The above examples then turn from counts to probabilities. Here is “when”:

• all – 8%
• enveloped – 16%
• he – 8%
• i – 33%
• my – 16%
• overcharged – 8%
• the – 8%

Here is “is”:

• but – 14%
• impossible – 14%
• merely – 14%
• nothing – 14%
• only – 14%
• the – 28%

Note: The code that goes with this post spits out these counts and percentages in the “out/stats.txt” file if you ever want to see the data.

Once the probabilities are known, you can start generating text. The first thing you do is pick a word purely at random, this is the first word in the text.

Next, you use the probabilities of what words come after that word to randomly choose the next word.

You then use the probabilities of what words come after that word to randomly choose the next word.

This repeats until you’ve generate as much text as you want.

The code with this post generates 1000 words into the “out/generated.txt” file.

That is literally all there is to it. You could do this same process with sheet music to generate more music in the same style, you could do it with weather forecasts to generate realistic weather forecasts (or even try to use it to predict what weather is next). You can do this with any data you can imagine.

# Example Generated Output

Here is 100 words of generated text from various sources.

First is text generated from “The Tell-Tale Heart” by Edgar Allan Poe (https://poestories.com/read/telltaleheart):

…About trifles, and with perfect distinctness — very slowly, my sagacity. I then took me, louder — you cannot imagine how stealthily — with what caution — cautiously — would have told you may think that no longer i knew that no blood – spot. He would not even his room, to do the hour had made up my whole week before him. I knew what dissimulation i showed them causeless, undisturbed. Now a hideous heart, no — wide open — all and the old man, and he would have…

Here is text generated from “The Last Question” by Isaac Asimov (http://hell.pl/szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm):

…Glory that. Man said, it into a meaningful answer. Granted, said, might be kept from the entire known to restore the universe for meaningful answer. Mq – talkie robot, ac learned how many stars are dying. The boys appreciated that not. Cosmic ac that, how may be able to reach the small station, said at half the same. He shrugged. We’ll have enough to be alone. And lose itself aloof. When any other kind of universal ac. He consisted of individuals were self – contact…

Here is text generated from a research paper “Projective Blue-Noise Sampling” (http://resources.mpi-inf.mpg.de/ProjectiveBlueNoise/ProjectiveBlueNoise.pdf):

…Numerical integration. Mj patterns to vector multiplication to achieve a way that the above question whether there exist distributions have addressed anisotropic classic lloyd relaxation green and rotated pattern significantly worse than the j 1, where each site: our projective blue – noise point distributions along both axes. Previous work sampling when undergoing one after a certain number of common blue noise patterns, but at the publisher s ., cohen – left constructs a quality of latinizing the non – sample counts however, as a set only in a theory 28, this shrinkage…

Here is text generated from an example (not real, but representative) psych report from my wife who is a school psychologist:

…Brother had to mildly impaired body movement, the school and placement after a 90 probability that student: adapting to struggle as video games. Student’s planning and he request, spelling subtest scores. This time. The student: this time and accurately with both, including morphology, 2013. Administrators should consider participation in the following are student as intellectually disabled specific auditory comprehension of reading: mr. Mrs. The two subtest is designed to use of or economic disadvantages, gestures, vitality or economic disadvantages, picking at approximately 5th grade prior…

Here we generate a markov chain using ALL the above source texts, to get a mash up of all of them.

…Restore the sphere packing radius is likely an adaptive skills. Please see inset in the conner s problems, we’ll just have well and visualization and he is computed on 1 2 was contacted by things, and restricted number of his abilities. We can simply like them, as well as a s difficulty interacting with a closer to cry, the process based on the standards – appropriate to spurious aliasing artefacts mit87, making a meaningful answer. Finally, 11 months through hyperspace to try his eye contact. Jerrodine’s eyes were going out if…

Lastly, here is only Poe and Asimov combined:

…Could not forever, and continually increased. And stood for a sudden springing to get back and the eighth night i to that man, 2061, but the original star and made trips. A very, and fell full youthfulness even to feel — i then stop someday in five words on a while i heard all the noise steadily for us, calling him to pluto and now a galaxy alone pours out, quick sound would think of individuals. He stirred his hideous veil over the ceiling. Twenty billion years ago, man, …

# Nth Order Markov Chains

Using one word to generate the next word works somewhat well – the generated Poe text definitely seemed like Poe for instance – but there are plenty of times when things don’t make much sense.

A markov chain can become higher order when you don’t just look at the current state to transition to the next state, but you look at the last N states to transition to the next state.

In the text generation case, it means that a 2nd order Markov chain would look at the previous 2 words to make the next word. An order 3 markov chain would look at the previous 3 words to make the next word.

Interestingly, an order 0 Markov chain looks at NO WORDS to generate the next word, so is purely random word generation, with similar word counts (by percentage) as the original text.

The code that goes along with this post lets you specify the order on the Markov chain.

Here is “The Tell-Tale Heart” with an order two markov chain.

…Dark as midnight. As the bell sounded the hour, there came to my ears: but he had been too wary for that. A tub had caught all — ha ha when i describe the wise precautions i took for the concealment of the old man sprang up in bed, crying out — no blood – spot whatever. I removed the bed and examined the corpse. Yes, he was stone, stone dead. I knew that he had been lodged at the police. A watch’s minute hand moves more quickly than did…

If you compare that to the actual story, you can find fairly large sections of that are taken verbatim from the source text, but the arrangement of those larger chunks are different.

The reason for this is that when you have two words mapping to the next word, the number of these go up, which makes it so on average, there are going to be fewer choices for “next words”, which make the results less random, and more deterministic.

If you gave it more text (like, maybe, all of Edgar Allan Poe’s work), there would be more options for the next word after specific 2 word pairs, but with a single short story, it doesn’t have very many choices. If you look at the out/stats.txt file and compare order 1 vs order 2, you can see that order 2 has a lot more situations where a current state maps to a single next state.

At order 3 there are even fewer choices, and it hits a pattern loop:

…Had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to fear there entered three men, who introduced themselves, with perfect suavity, as officers of the police. A shriek had been heard by a neighbor during the night; suspicion of foul play had been aroused; information had been lodged at the police office, and they the officers had been deputed to search the premises. I smiled, — for what had i now to…

Here is an order 2 mashup of Poe and Asimov:

…Crossing the floor, and still chatted. The universal ac interrupted zee prime’s own. It had to be contrary, and jerrodette i. Ask multivac. As the passage through hyperspace was completed in its place, each cared for by perfect automatons, equally incorruptible, each with its dreadful echo, the real essence of men was to be contrary now, now, honeys. I’ll ask microvac. Don’t shout. When the sun, and their only concern at the visiplate change as the frightened technicians felt they could hold their breath no…

Lastly, here’s an order 2 mashup of all 4 source texts:

…Mathematics: student does not require special education and related services, the radius of each other, indistinguishable. Man said, ac organized the program. The purpose of this report provides information about the child s educational performance. Other pertinent future work includes the extension of our projective lloyd patterns against other patterns on a role not based on his scores on this scale is different for the sake of visual clarity, we specify all spaces via a set x. In a way, man, i undid it just so much that a single…

# Other Implementation Details

When combining the texts, it might make sense to “normalize” the percentages for each source text. How it works now with raw counts makes it so longer documents have more of their style preserved in the final output document.

You may also want to give weightings to different text so you can have a sliding scale between Poe and Asimov for instance, by basically scaling the counts from their files higher or lower to give more or less representation in the results.

When analyzing the text, I had to think about what to do with punctuation. I chose to treat punctuation as words in themselves, but ignored some punctuation that was giving weird results – like double quotes. I’ve only just now realized that I incorrectly ignore question marks. Oops.

When generating text, i made it so some words don’t put a space before themselves (like, a period!), and i also made it so words would have their first letter capitalized after a period or similar. There seems to need ad hoc, domain specific massaging to get reasonable results.

It’s possible (especially with higher order markov chains) that you can get into a situation where your current state has nothing to transition to. You’d have to figure out what to do in this case. One idea would be to choose a next word at random. Another idea would be to fall back to a lower order markov chain maybe?

I feel like once you understand the algorithm, it’s an art form to teach and tune the Markov chain to get good results. I bet there are some interesting techniques beyond the simple things I’ve done here.

Mathy Markov Chain Info

If you want to dive into the mathy side of markov chains, here are some great resources you can follow to get there…

A great linear algebra online “text book”, that is very easy to read and understand: http://immersivemath.com/ila/index.html

Some great videos on linear algebra: https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

A 9 part series on markov chains. It’s this long because it’s very explicit and works through the details by hand. I watched it at like 1.5x speed and was fine 😛

Some “mathy” notes about Markov chains, including higher order ones:

Q Learning

Related to markov chains, Q learning is essentially is a way to learn a Markov chain from data – for instance learning how to play tic tac toe, or how to traverse a maze.

I would like to learn Q learning better and make a post (and code!) at some point.

Q Learning Explained With HTML5
https://blockulator.github.io/Q-Learning-Explained-With-HTML5/

An introduction to Q-Learning: reinforcement learning
https://medium.freecodecamp.org/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc

Reinforcement Learning Tutorial Part 1: Q-Learning
https://blog.valohai.com/reinforcement-learning-tutorial-part-1-q-learning

Reinforcement Learning Tutorial Part 2: Cloud Q-learning
https://blog.valohai.com/reinforcement-learning-tutorial-cloud-q-learning

Reinforcement Learning Tutorial Part 3: Basic Deep Q-Learning
https://towardsdatascience.com/reinforcement-learning-tutorial-part-3-basic-deep-q-learning-186164c3bf4

Other

Here is a twitter conversation about some compelling uses of Markov chains

Here’s a video “Markov Chain Monte Carlo and the Metropolis Algorithm” which uses Markov chains to help calculate integrals numerically.

# Code

Again, the code for this post is up on github at https://github.com/Atrix256/TextMarkovChain

The code is written for readability and runs plenty fast for this demo (nearly instant in release, a couple seconds in debug) but There are lots of string copies etc that you would want to fix up if using this code seriously.