Basic Methods For Finding Zeroes and Mins / Maxes of Functions

There is ~500 lines of simple standalone C++ code that goes with this post, that generated all data used and implements the things talked about. You can find it at: https://github.com/Atrix256/BasicRootFinding

Simple equations can be solved using algebra without too much fuss:

3x-6=0 \\ 3x=6 \\ x=2

You can even solve more complicated equations by cleverly using identities, the quadratic equation, and other mathematical tools you might have in your tool belt.

Some equations are beyond our ability to solve them using simple algebra though, like high degree polynomials:

3x^5+2x^2-20x+2=0

We still can find where functions like that equal zero, but we can’t do it analytically, we have to do it numerically.

That is, we do things like roll a ball down hill and hope we find a zero.

(Apparently there are some analytical solutions to be had from that function. I don’t know how to get them though, so would have to resort to numerical methods!)

This post talks about a few methods for finding zeroes of equations we can’t solve analytically, and talks about how to use those methods to find minimum and maximum values of functions as well.

The Basic Idea

Let’s say I tell you that the value of a function at a specific point is “3” and that for every step to the right you take, it subtracts 1 from that value.

Then I ask: “Where does this function equal zero?”

If you answered “three steps to the right” you are correct, and guess what – you’ve just used Newton’s method for root finding.

There is an asterisk here though. The slope (how the function value changes as you take a step to the right. Also called the derivative.) was constant in the above example, but is not going to be constant in the functions we want to solve. That slope is going to change depending where you are on the graph.

To deal with this, Newton’s method takes the step it thinks it should take, and then asks again for the value and slope (derivative) and takes another step.

The hope is that with a good initial guess, Newton’s method will find a zero without too much effort.

We’re going to get a little more formal, while also showing some results, but all the rest of what we are going to talk about is based on Newton’s method (or is even simpler).

Newton’s Method

We’ve seen this informally, but formally, a single step of Newton’s method looks like this:

x = x - \frac{f(x)}{f'(x)}

That is, we divide the y value at our current point by the derivative at our current point, and subtract that from our current x to get our new x.

Bullet Points:

  • Requires f(x) and f'(x)
  • Requires a “good guess” for an initial x value
  • Converges Quadratically

More info: https://en.wikipedia.org/wiki/Newton’s_method

Secant Method

What if you don’t have the analytical first derivative to your function and still want to use Newton’s Method?

Well, you can use finite differences to get the first derivative: move the x over a little bit, see how the y value changes over distance, and use that as your numerical derivative, instead of the analytical one. (A blog post of mine on finite differences: https://blog.demofox.org/2015/08/02/finite-differences/)

If you do that, you are now using the secant Method.

m = \frac{f(x+e)-f(x-e)}{2e}

x = x - \frac{f(x)}{m}

The above uses “central differences” to calculate the first derivative numerically. e is a tuneable parameter, but should be small (without triggering numerical issues) – like perhaps maybe 0.01.

Bullet Points:

  • Requires f(x)
  • Requires a “good guess” for an initial x value
  • Converges at a rate of 1.618… (the golden ratio! ??!) so is slower than Newton’s method, but can be handy if you don’t have the first derivative.

More info: https://en.wikipedia.org/wiki/Secant_method

Halley’s Method (& Householder’s method)

If the first derivative helped us find a zero, surely the second derivative can help us find it even faster, right?

Yep. If you generalize Newton’s method to using the second derivative, you get Halley’s method.

x = x - \frac{2f(x)f'(x)}{2f'(x)^2-f(x)f''(x)}

Bullet Points:

  • Requires f(x), f'(x) and f”(x). You can get the derivatives numerically but it will slow down convergence.
  • Requires a “good guess” for an initial x value
  • Converges cubically (fast!!)

Both Newton and Halley are part of a larger family of methods called “Householder’s Method” which generalizes Newton’s method to any order derivative.

More info:
https://en.wikipedia.org/wiki/Halley%27s_method
https://en.wikipedia.org/wiki/Householder%27s_method

Bisection

Bisection is simpler than the other methods. You start with a left x and a right x, with the requirement that the signs of the y values for these x’s have to be different (one positive, one negative) which means that a zero must be between the two x’s.

The algorithm itself is just a binary search, where you look at the value in the middle of the left and right x, and update the left or right x to be the middle, based on the sign of the y value at the middle.

Bullet Points:

  • Requires f(x)
  • Requires a min x and a max x that contains a zero between them, and the y values at those x’s have to have opposite signs
  • Converges Linearly (slow!)

More info:
https://en.wikipedia.org/wiki/Bisection_method

Experimental Results: y = x^2 – 1

Here’s the first equation: y=x^2-1

You can see visually that there’s a zero at x=-1 and at x=1.

Here’s how the various methods did at finding the roots. the x axis is how many steps were taken and the y axis is the absolute value of y at that step, which also happens to be the error too, since we are finding zeros.

Halley is the winner which is unsurprising with it’s cubic convergence rate, then newton with it’s quadratic convergence rate, then secant with it’s golden ratio convergence rate, and bisect comes in last after a strong start, with it’s linear convergence rate. Our experimental results match the theory, neat!

The values in the legend next to the result name specify what parameters were used. For newton and Halley, the value shown is the initial guess. For secant, x is the initial guess and px is the previous initial guess. For bisect it shows the range given to the bisection algorithm.

Choosing different parameters for those can drastically affect performance of the algorithms. Here are some less well chosen values shown along with the previous results. The previous results are the blue shades.

Bisection is now so bad that it dwarfs everything else. Let’s remove that.

Removing bisection, the first worst data point dropped to 100 from 2500 for all the “more poorly chosen parameters” which are orange, yellow and green. The previous values are the blues and the grey line, which basically look flat.

In these larger values with worse parameters, we are still seeing Halley winning, Newton coming in second, secant coming next, and then bisection, but we are seeing much worse performance. These parameters are important to getting good performance!

Experimental Results: y = sin(x)

Here’s the next equation: y=sin(x)

Here is the convergence for each technique:

You can’t see it, but the orange, grey and blue are all basically overlapping. Looking at the actual data below though, you can see that the winners are in the same order again.

Experimental Results: y = x^2-x-1

This is a fun equation, because a zero is at the golden ratio (and the other is at -1/goldenRatio): y=x^2-x-1

Here is the convergence graph:

An interesting thing happens here if we choose to start at the value “0.5” for Newton and Halley. The dark blue line for Newton is under the grey line for Halley which is flat and doesn’t go up and down. The reason this happens is that the first derivative is 0 at 0.5 so the algorithm doesn’t know which direction to go and gets stuck.

Using other initial guess values causes it not to be stuck, but you can see that newton and secant starting at 0.75 has a pretty big jump in error in the beginning.

Error Case: x^2+1

For the next equation we have: y=x^2+1

This graph has no zeroes. Our bisect bounds also do not have their needs met since one side is supposed to be negative and the other positive, but this graph is positive everywhere. That means we don’t have any results for bisect on this one.

Here’s the convergence graph:

Newton goes absolutely off the rails for a few samples. Lets remove it to see what the others are doing:

Secant goes real bad at the end, so let’s remove it to see that Halley is pretty well behaved despite there not actually being a root to find:

Ray vs Sphere

Let’s do something more interesting… let’s use these methods to find where a ray intersects a sphere. We can do this analytically (there is a formula to calculate this!) but we can use this as a simpler example of a more complex usage case.

We need a formula that is zero where a ray hits a sphere, and we need to be able to calculate it’s first and second derivative analytically ideally.

I fought with this for a bit but came up with a solution.

Here are the variables involved:

  • rayPos and rayDir, both are vec3’s
  • spherePos which is a vec3 and a sphereRadius which is a scalar
  • t – how far down the ray we are, and is a scalar

We are going to make the sphere center be the origin by subtracting it out of the ray position (then we don’t need spherePos anymore), so then all we have to find is where the magnitude of the vector from the origin to the ray position at time t is the sphere radius. To simplify calculating the derivatives, we are going to square both sides of the equation.

f(t) = MagnitudeSquared(rayPos + rayDir * t) - sphereRadius^2

If we can find where that equation is zero, that will tell us the time t that the ray hits the sphere, and we can get the actual intersected position (then normal, etc) from that.

For a 3d vector, Magnitude squared is just the below which is the distance equation without the square root.

MagnitudeSquared(P) = P.x^2+P.y^2+P.z^2

If you expand f(t) out for just the x axis to get the pattern per axis, you get:

rayPos.x^2+2*rayPos.x*rayDir.x*t+rayDir.x^2*t^2 - sphereRadius^2

You would add a similar thing for y and z – but not do the sphere radius part because it’s already handled above.

We can take the derivative of the above with respect to t, to get:

2*rayPos.x*rayDir.x+2*rayDir.x^2*t

So the first derivative is that equation, but also adding in the y and z axes:

f'(t) = 2*rayPos.x*rayDir.x+2*rayDir.x^2*t + 2*rayPos.y*rayDir.y+2*rayDir.y^2*t + 2*rayPos.z*rayDir.z+2*rayDir.z^2*t

That looks a bit like spaghetti but is way cleaner in code with a loop iteration for each axis 😛

We can then take the derivative of that first derivative to get the second derivative (still, with respect to t). That gives us this:

f''(t) = 2*rayDir.x^2 + 2*rayDir.y^2 + 2*rayDir.z^2

Ok cool, now that we have our equations, I used (0,0,0) for the ray position, (0,0,1) for the ray direction, (0.2, 0.3, 5.0) for the sphere position, and a sphere radius of 2.0.

Let’s see how our methods did at finding the intersection time of the ray vs the sphere.

Once again, Halley wins, newton comes in second, then secant, then bisect. Neat!

Finding Function Mins & Maxes

For being one of the headlines of the blog post title, this is actually a really simple thing to explain now that you know the rest.

The minimum and maximum of a function – by definition! – has a derivative (slope) of zero, because it’s right where the function goes flat. The function has either climbed to the top of a mountain, gone flat, and is heading back down, or it has gone to the bottom of a valley, gone flat, and is heading back up. There is actually also a third option called a saddlepoint where it is pausing it’s ascent or descent momentarily and then continues.

In any case, to find the minimum or maximum of a function, we need to find where it’s derivative is zero. We do this by doing root finding on it’s first derivative.

From there, we can plug that x value found into the original f(x) function to get our extrema value.

The code that goes with this blog post uses this technique to find the maximum value for the function y=-x^2+x+1.

It finds that the derivative (y=-2x+1) is zero at x=0.5. Plugging 0.5 into the original function for x gives us a value of 1.25. That does look like the correct x and y value for the maximum of the function, doesn’t it?

Links

The code that goes with this blog post can be found at: https://github.com/Atrix256/BasicRootFinding

A good 10 minute video on Newton’s Method

Here’s a way to get an analytic derivative of a function without having to do it symbolically. Dual numbers are a form of automatic differentiation, just like backpropagation is.
https://blog.demofox.org/2014/12/30/dual-numbers-automatic-differentiation/

You can extend this stuff to higher dimensions using the gradient instead of the derivative, and maybe a Jacobian matrix or Hesian matrix for higher derivatives. That might make a good future blog post.

We actually do something similar to Newton’s method when doing sphere tracing (ray marching with a distance estimate). We divide the distance value at a point in space (the y value, or f(x)) by the length of the gradient (which is basically the slope aka generalized derivative) to know that a surface can’t be any closer than that value, so we can step that length in whatever direction our ray wants to move. This also has applications to drawing 2d graphics. You can read about that here:
https://www.iquilezles.org/www/articles/distance/distance.htm

Using White Noise to Choose Between Red Noise and Blue Noise

There is source code that goes with this post, which generated all the images and did all the tests. You can find it at https://github.com/Atrix256/DFTRandomFibonacci

I recently saw a really cool video from @Numberphile which mixed some of my very favorite things: Fibonacci numbers (aka the golden ratio), red noise and blue noise.

The link to the video is below and is very much worth watching.
“Random Fibonacci Numbers”

It had me thinking though… they are using a coin flip (white noise) to determine if they should add the last two numbers (red noise / low pass filter) or subtract them (blue noise / high pass filter).

I was curious what the DFT of that would look like, which would show what sort of frequency content that number sequence had.

BTW I have a post talking about dice, probability distributions and noise color here that is a bit related: https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/

White Noise

Just to prime things, here is 100 uniform white noise values on the number line and the DFT of those points. The points clump together and leave holes, and the frequency spectrum shows all frequencies.

Here is the average DFT of 100,000 such point sets. The average flattens out and the grey lines showing 1 standard deviation are flat as well.

Regular Fibonacci Numbers

The first 90 Fibonacci numbers look like the below on the number line:

And here’s their DFT, where 0hz (DC) is in the middle:

Nothing super interesting really.

Randomized Fibonacci Numbers

Here is 90 randomized Fibonacci numbers on the numberline, the dft, and the average DFT of 100,000 such number sets.

It’s interesting to see that the individual randomized Fibonacci has strong low frequency content, but other frequencies too, while the average DFT of the number sets shows only low frequency content.

I think what’s going on here is that since the numbers start out small and grow larger over time, that they will always start out clumped together (red noise), but then depending on the coin flip (white noise) which have different frequency content in each set of numbers, as the numbers grow larger. This means that the only thing common among them is low frequency content, but the rest is just white noise and averages out to be flat.

Maybe not that interesting of a result, but it’s an answer at least 😛

Prime Numbers

I got a tweet from Tariq wondering what DFTing the prime numbers would look like.

Here are the first 25 on the numberline, and the DFT:

Here are the first 100:

Here are the first 200:

Here are the first 1000:

I don’t really see much of a pattern, but I guess if someone did, they would have gotten some prize by now? 🙂

“The next coin flip has to be tails!”

If you saw a fair coin flipped and 10 heads came up in a row, you’d probably either think that the coin was a 2 headed coin, or that the next flip HAD to be tails.

In the code that goes with this post (https://github.com/Atrix256/DFTRandomFibonacci check out DoCoinTossTest()), I have a random number generator generate bits until there are 10 ones in a row, and then count how many times the next random bit is a one again.

I do that test 10000 times, and ran that overall test 5 times. Here are the results!

At an infinite number of coin flips, you can expect an even numbers of heads and tails, but until you reach infinity, all bets are off. When a coin has come up heads 10 times in a row, it still has a 50/50 chance of heads in the next coin flip.

That’s the definition of white noise – independent, random events.

Red noise would tend to have similar values – so, heads would clump together, and tails would clump together. This makes more sense with dice, where similar values would be rolled.

Blue noise would tend to have dissimilar values – so heads and tails would specifically NOT clump. And with dice, you’d rarely get the same, or similar values, in 2 dice rolls.

White noise doesn’t care about what came before, and just gives you a new number based on the probability distribution.

Keep this in mind if you ever play roulette!

How Do I Calculate Variance in 1 Pass?

If you google “how do i calculate variance?” you’ll get some nice explanations that say:

  1. Calculate the mean (average) of your numbers
  2. Calculate the average of: each number minus the mean, squared

That’s fine for people trying to just understand how the math works, but if you are calculating variance in a computer program, you might not realize there is a way to do it in a single pass over the data.

That can be significant to the performance and even feasibility of your algorithm!

Here is how you calculate variance in one pass:

  1. Calculate the mean (average) of your numbers
  2. In the same loop, calculate the mean (average) of your numbers squared
  3. After the loop, variance is the absolute value of #2, minus #1 squared

That might look like word salad, so here’s a code snippet.

float Lerp(float a, float b, float t)
{
    return a * (1.0f - t) + b * t;
}

float Variance_1Pass(const std::vector & data)
{
    // get the average (value) and average (value*value)
    float average_value = {};
    float average_valueSquared = {};
    for (size_t index = 0; index < data.size(); ++index)
    {
        float value = data[index];
        average_value = Lerp(average_value, value, 1.0f / float(index + 1));
        average_valueSquared = Lerp(average_valueSquared, value * value, 1.0f / float(index + 1));
    }

    // variance is absolute value of average(value*value) - (average_value*average_value)
    return abs(average_valueSquared - (average_value * average_value));
}

There is code that goes with this post, that implements it both ways and shows you that they are equivalent. You can find it at: https://github.com/Atrix256/CalculateVariance1Pass/blob/master/main.cpp

If you are wondering why I'm using "lerp" to average numbers, check out this post: https://blog.demofox.org/2016/08/23/incremental-averaging/

It turns out this one pass method can have numerical problems though, so no free lunch. Here is a more numerically robust way to do it, which also allows you to incrementally calculate variance, as numbers come in (Thanks Bart!): https://www.johndcook.com/blog/standard_deviation/

Why might you want to calculate variance?

One reason is if you are analyzing or reporting data, the average value is important to know, but it's also important to know if the numbers were usually pretty close to the average, or if they had lots of spikes above and below the average. You can square root variance to get the standard deviation, which is in the same units as the data you are reporting.

Assuming your data is a Gaussian distribution (due to something called the central limit theorem, a surprising number of things are actually gaussian distributed – like even rolling a bunch of dice and adding them up), 68% of the data points are + or – 1 standard deviation from the mean.

As an example, if the average temperature at your house over a month was 75 degrees Farenheit with a standard deviation of 5 degrees, that means that 68% of the days had a temperature between 70 and 80 degrees.

If the average temperature was still 75 but had a variance of 25 degrees, that means that 68% of the days had a temperature between 50 and 100 degrees. That is quite a difference! Just reporting the average temperature doesn't convey this information the same way as reporting average and standard deviation (or variance) does.

For more info on that, check this out: https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

Lastly, I mentioned that doing 2 passes to calculate variance could be a deal breaker for an algorithm.

An example of this is a graphics technique called "Variance Shadow Maps" (paper: http://www.punkuser.net/vsm/vsm_paper.pdf) which ultimately calculates a mean and a variance for how far a group if pixels is away from a light source. When rendering the variance shadow map from the point of the view of the light, each pixel stores the depth, as well as the depth squared. A fun property is that you can blur these values with neighboring pixels without harming the mathematical integrity of the values. The result is soft shadows. (more info on soft shadows: https://blog.demofox.org/2017/07/01/why-are-some-shadows-soft-and-other-shadows-hard/)

When lighting a pixel later on and using the shadow map, it knows the pixel's distance from the light source, and can read the two values from the filtered (blurred) shadow map, which allow it to get the mean and variance of the objects in the shadow map (the things that are casting shadows). It then uses something called Chebyshev's inequality to get an estimate for how much in shadow the pixel is.

That is a lot of setup explanation, but if having to do 2 pass variance calculations, instead of the 1 pass that it does do, you'd have to run over the full screen of pixels and do logic for each one (subtract the average pixel value), to calculate the variance. In real time graphics, having to do an extra "full screen pass" can be pretty costly, and can easily put a game over budget, meaning the technique would have to be cut so the rest of the game could render fast enough.

This blog post is here in the hopes that the next time someone googles "how do i calculate variance" for use in a computer program, that they see this post, and implement it as a single pass. Fingers crossed! 😛

Using Low Discrepancy Sequences & Blue Noise in Loot Drop Tables for Games

I never thought I’d be much of a stats person but here we are. This post is low on formalism though, so may the gods of formalism have mercy on my soul!

This post includes the result of experiments showing the value of what is talked about, and includes some simple C++ that you can find at: https://github.com/Atrix256/LootLDS

If you’ve ever played a game that involved grinding for loot, you might have looked online and found the drop rate for a specific item, only to find that if it says it drops one out of 100 times, that it takes you 200-300 runs to get it, while your friends get the drop in the first 10 runs.

What the heck is that about?!

That, my friends, is the nature of regular old random numbers – aka white noise – the kind of random numbers you get from rolling fair dice, flipping fair coins, hashing values using good hash algorithms, or using well made random number generators.

The core issue is that white noise random numbers can take on any valid value with equal probability at any time, regardless of whatever has happened before.

If you were betting on whether a fair coin would come up heads or tails after seeing 10 heads, if you say the next will be tails (because of course it will!) you will still only be right 50% of the time. If you flip the coin an infinite number of times, you will get an even number of heads or tails, but before reaching infinity, all bets are off.

This can be a problem for game designers too. They can collect statistics about how their randomized systems are behaving, analyze the results and come to the conclusion that everything is balanced and even. While that may be true when looking at global averages, the individual player experience may vary wildly and not be balanced at all.

Tangent: This is called variance and is the source of noise in raytraced rendering.

Tangent: There’s a fun related story here about the U.S. air force realizing there is no such thing as an average pilot:
https://www.thestar.com/news/insight/2016/01/16/when-us-air-force-discovered-the-flaw-of-averages.html

In any case, is this “globally balanced but individually unbalanced” something we have to live with, or is there something we can do about it?

Luckily there is something we can do about it, and we can ensure that individual players have a more balanced, more pleasant, and more controlled experience, without sacrificing randomization.

Enter Low Discrepancy Sequences

A low discrepancy sequence is a sequence of numbers which are neither too close together nor too far apart.

If we put marks evenly spaced on a number line, the sequence of numbers at those marks would have zero discrepancy because they were evenly spaced. Low discrepancy numbers have a low discrepancy value that is greater than zero.

Examples of low discrepancy sequences that you may have heard of are: Sobol, Halton, Van Der Corput.

Some nice links of mine for learning more about low discrepancy sequences are:
https://blog.demofox.org/2017/05/29/when-random-numbers-are-too-random-low-discrepancy-sequences/
https://github.com/Atrix256/SampleZoo

Tangent: Going back to the raytraced noise example, regular sampling makes aliasing, and white noise sampling makes noise. Low discrepancy sequences sort of lay somewhere in the middle, gaining the benefits of both worlds, and actually having mystical powers of making numerical integration converge very quickly.

So what do low discrepancy sequences have to do with our problem?

If you use a low discrepancy sequence to generate 5 “random numbers” between 0 and 1, those 5 numbers will be roughly evenly spaced, which means that if you use those numbers on a loot table, the player is going to get a wider spread on the full possibilities of what the loot table has to offer.

If something has a very small percentage to drop, the player still has a low probability to get that drop, but if it’s a 1 in 100 drop, it’s more likely to happen at the 100th drop mark.

This is in constrast to white noise where the values may be clumped together and leave big holes in the 0 to 1 range, like: 0.114, 0.081, 0.093, 0.2, 0.95. There is a huge gap empty between 0.2 and 0.95, which is 75% of the possibilities!

There’s a problem with low discrepancy sequences though: They are deterministic – that is, they are predictable and not random. You get the same values from low discrepancy sequences every time.

Before admitting defeat though, there is another way to get randomization from this even though the sequences themselves are not random: You can shuffle the loot table!

Now, if you have thousands of players on an MMO server rolling numbers against a loot table, you probably just barfed in your mouth a little at my suggestion. There is a way to make a “shuffle iterator” though, so that you can get the benefits of shuffling a loot table, without actually having to keep copies of shuffled loot tables around. You’d use some unique player id (and more) as a random seed for the shuffle iterator, then could use a low discrepancy sequence to select loot. This way, each player would see different (randomized) loot drop sequences, but the loot rolls would still be low discrepancy.

Tangent: you can read more about a way to make shuffle iterators using low quality encryption in “Format Preserving Encryption” here: https://blog.demofox.org/2013/07/06/fast-lightweight-random-shuffle-functionality-fixed/

But we aren’t done yet…

Enter Randomized Low Discrepancy Sequences

The low discrepancy sequences talked about in the last section were deterministic, but what if we were able to make number sequences that had low discrepancy but were randomized?

That exists, and it’s known as “blue noise” because blue noise is random numbers which have only high frequency components (like how blue light is high frequency).

The property of both being low discrepancy, but also randomized, is great for many reasons way outside the scope of this article. For our loot drop purposes, it means that the loot will be both unpredictable, but also a player will get a balanced personalized experience, instead of only the global averages being balanced.

Tangent: Here’s a link about how to generate a blue noise sequence: https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/

The other shoe dropping is that blue noise can take a long time to generate, so is computationally expensive. In graphics, it’s common to generate blue noise in advance and just use the pre-made sequence. In loot drops, that is a less viable option because it makes your sequence deterministic and then you are back to shuffling the loot table.

Not real sure the answer here, but it may involve just keeping track of the last N loot drops, and using Mitchell’s best candidate algorithm to generate the N+1th value, adding that to the loot drop RNG list and removing the oldest one. If you get creative you might find a solution that fits your needs.

Prove it

Before we show experimental results, I wanted to defined a couple terms in regards to low discrepancy sequences.

  1. Progressive Sequence – A progressive sequence is a sequence of N values, where if you use less than N of the values, they still have the desired properties. For instance, if you make 100 blue noise distributed numbers, but only use the first 10, if it’s a progressive sequence, those first 10 will also be blue. If it isn’t a progressive sequence, you have to use all 100 before they are blue. This is also a property of deterministic low discrepancy sequences. For our loot drop purposes we NEED to use progressive sequences because other wise, the loot drops won’t be balanced until the very end, which kind of defeats the point.
  2. Open Sequence – An open sequence is one that you can always add more items to. If you regularly space 4 samples from 0 to 1 you are going to get 0, 0.25, 0.5, 0.75. If you want to add a 5th number you can’t! That means that this sequence is not open. Many low discrepancy sequences are open, and using Mitchell’s best candidate to generate blue noise does make an open sequence. For loot drops, we generally do want open sequences, because we usually don’t know how many times the player is going to loot something in advance.

The numbers below are from experiments done using the code that goes with this blog post. It’s ~380 lines of C++ in a single file using only standard includes. You can find it at: https://github.com/Atrix256/LootLDS

I used the following sequences:

  • White Noise – Straight up vanilla RNG.
  • Blue Noise – Using Mitchell’s Best Candidate to generate a progressive, open, uniform blue noise distributed number sequence.
  • Golden Ratio – Starting at 0, i add the golden ratio to the previous loot drop value to get the next loot drop value. I use modulus to keep the value between 0 and 1. The golden ratio has some interesting & good properties as a low discrepancy sequence.
  • Sobol – This is the low discrepancy sequence that leads in numerical integration convergence speeds.

For each sequence type, I generated 10 random loot tables which had 2 to 6 items, each item having a random roll weighting between 1 and 20. I then rolled as many loot drops as i needed to make it so the actual loot drop percentages were within 1% of what the loot drop table said they should be.

Higher numbers mean it took more loot drops for the actual loot dropped to reach the probabilities the loot table said they should be. Lower numbers mean it took fewer loot drops to reach the desired probabilities. I did 10 runs so that you can see if things are consistently good or consistently bad. Just doing a single test isn’t enough to show if something just got lucky one time, or if it’s a good choice.

Here are the results….

  • White Noise: 50513, 7834, 1859, 516, 8824, 3650, 1380, 24461, 35, 12455
  • Blue Noise: 72, 77, 143, 9, 129, 308, 353, 236, 176, 205
  • Golden Ratio: 47, 34, 50, 76, 55, 51, 114, 77, 21, 105
  • Sobol: 216, 155, 161, 77, 13, 71, 56, 75, 127, 51

It’s important to note that the loot tables themselves are generated with white noise, which is a source of variance in the results above. 10 samples isn’t a whole lot, so in a real analysis you might want to do more (100,000 or more runs?) but hopefully you should see that white noise really is not great. I was also surprised to see that Sobol didn’t do that well compared to blue noise and golden ratio. It must just do better at higher dimensions.

One last thing I wanted to mention is that this isn’t limited to loot drop tables. You can use these concepts for randomized events, procedural content generation, and basically anything else you use random numbers for in games.

The important takeaway here is that even if things look right when looking at global averages, using white noise makes the individual experience very different from that global average. Having better control over crafting a player’s individual experience is possible though, and has the possibility of giving a game a more hand crafted feel, even though you are still using RNG.

FRIENDS DON’T LET FRIENDS USE WHITE NOISE!

A Fun 2d Rotation Matrix Derivation

A few weeks ago I joined NVIDIA as part of the graphics dev tech team.

The dev tech team’s scope is pretty wide, but it encompasses nearly all real time graphics programming needs that NVIDIA may have: making code samples for new tech, helping developers optimize their games and make them look prettier, working with graphics researchers, and contributing to graphics research as well.

Sometimes it’s like being a meta engine programmer (making stuff for engine programmers who make stuff for people making games), and other times it’s like working on an early prototype dev kit, where the dev kit is the PC itself.

I’m enjoying it quite a bit so far. It’s a bit surreal to be working along side really respectable folks I’ve seen present at siggraph, or who have written papers I’ve read, but at the same time it feels appropriate because I’m interested in the same areas and am looking forward to mixing in my own contributions and getting to collaborate on cool stuff.

Speaking of which, yes, my first contribution involved blue noise, and I’ll point it out when the thing it’s part of is public!

My new boss is a guy named Rahul, and I could tell things were going to be great, because when we were talking about math, he became visibly dettached from the real world while explaining something. Alternately, he was interested in learning about some obscure topics I had, like dual numbers, or abusing texture sampling to calculate polynomials.

He is the one that showed this logic chain to me.

Act 1 – The 2D Cross Product

So there is this thing that people call the “2d cross product” which is not really a cross product but if you have a 2d vector, it will give you a vector perpendicular to that vector. This is a really valuable tool in the toolbox for game developers, as you are often working in 2d coordinates, even though the world may be 3d, or you can sometimes simplify problems to 2d and bring them back into 3d after solving them.

You calculate the 2d cross product by flipping the x and y components of a vector and flipping the sign of the new x component:

newVec.x = -oldVec.y;
newVec.y = oldVec.x;

You could negate the new y instead of the new x to get the other perpendicular vector (there are 2!) but for this derivation, the way I have it above is important to the result. (I’ll revisit this at the end of the post)

As a simple example, if we take the vector (1,0), flip x and y, and negate the new x, we get (0,1), which is indeed perpendicular.

This works with arbitrary 2d vectors though. Doing it to (1,2) gives you (-2,1) which you can see in the image below is definitely perpendicular.

You can actually express this operation as a matrix:

\begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix}

which you can see results in the same:

\begin{bmatrix}x & y\end{bmatrix} \cdot \begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix} = \begin{bmatrix}-y & x\end{bmatrix}

Act 2 – Complex Numbers

Way back in 2014 I wrote a blog post on how to use imaginary (complex) numbers to do vector rotation: https://blog.demofox.org/2014/12/27/using-imaginary-numbers-to-rotate-2d-vectors/

The key take away relevant here is that if you multiply a vector by the imaginary number i, it’s a 90 degree rotation counter clockwise.

Strangely, that’s exactly what our matrix does too… could this matrix be i? Well, i*i is -1, so let’s see if that happens when we multiply this matrix by itself:

\begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix} \cdot \begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix} = \begin{bmatrix}-1 & 0\\0 & -1\end{bmatrix} = -1 * I

So yeah, that checks out… this matrix, which represents the “2d cross product” is also the imaginary number i, the square root of -1. That’s fun. (the I above is just the identity matrix)

That means that multiplying by this matrix is the same as if we were multiplying a vector by the complex number (0+1i).

We can see this is true by re-doing the example (1,2), and multiplying the values together using FOIL (first, outer, inner, last).

(1 + 2i)*(0+i) = \\ (1*0)+(1*i)+(2i*0)+(2i*i) = \\ 0 + i + 0 -2 = \\ -2 + i = \\ (-2, 1)

Act 3 – Complex Exponentials

Euler’s formula is a way of calculating points on a circle on the complex plane and is given as:

e^{i\theta} = \cos{\theta} + i\sin{\theta}

The value (0+1i) is the just the above formula when theta is 90 degrees, which is the amount of rotation we got when multiplying. We can easily verify that this is 90 degrees by remembering that cosine of 90 is 0, and sine of 90 is 1.

So, let’s replace our multiplication of (0+1i) with the right side of Euler’s formula. This way we can rotate by arbitrary angles, not just 90 degrees.

(x+iy) * (\cos{\theta}+i\sin{\theta}) = \\ (x \cos{\theta}) + (x i\sin{\theta}) + (iy \cos{\theta}) + (iy i\sin{\theta}) = \\ (x \cos{\theta} - y \sin{\theta}) + i(x\sin{\theta}+y\cos{\theta})

From here you can extract it back into matrix form to get:

(x \cos{\theta} - y \sin{\theta}) + i(x\sin{\theta}+y\cos{\theta}) = \\ \begin{bmatrix}x & y\end{bmatrix} \cdot \begin{bmatrix}\cos{\theta} & sin{\theta}\\-\sin{\theta} & \cos{\theta}\end{bmatrix}

Bam, there’s the 2d rotation matrix.

Other Fun Matrices

This post showed the matrix form of the imaginary number i, where i*i=-1.

\begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix}

Dual numbers are another fun type of number where there is an \epsilon that is not zero, but \epsilon \cdot \epsilon is zero.

Dual numbers let you do forward mode automatic differentiation, and you can read some more about them at https://blog.demofox.org/2014/12/30/dual-numbers-automatic-differentiation/

A matrix form for them is:

\begin{bmatrix}0 & 1\\0 & 0\end{bmatrix}

Lastly are hyperbolic numbers (or split complex numbers), where you have an i that is not 1, but i*i is 1. I’m not sure what they are useful for but you can read more here: https://en.wikipedia.org/wiki/Split-complex_number

A matrix form for them is:

\begin{bmatrix}-1 & 0\\0 & 1\end{bmatrix}

By the way, when I introduced the 2d cross product i said flip x and y and negate the new x, but said the other way is possible as well. If you do it the other way, that operation is multiplying by -i, instead of multiplying by i. It isn’t a special case that breaks any of the things in this post, it’s just a different value. It’s a rotation by -90 degrees.

IIR Audio & Data Filters – Featuring Biquads

The last post talked about finite impulse response filters (FIRs) where the output of the filter was based on the current and previous input samples. (https://blog.demofox.org/2020/01/14/fir-audio-data-filters/)

In this post we are going to talk about infinite impulse response filters (IIRs) where the output of the filter is based not only on current and previous input samples, but also previous output samples.

This seemingly minor change makes for much more powerful and interesting filtering possibilities, but as it isn’t stateless, that means it must be evaluated serially (no SIMD!), and so is more computationally expensive.

The simple standalone C++ code that goes with this post is at: https://github.com/Atrix256/DSPIIR

The interactive web demo that goes with this post is at: http://demofox.org/DSPIIR/IIR.html

IIR Difference Equation

So let’s start with an order 2 FIR difference equation:

y_n = a_0x_n + a_1x_{n-1} + a_2x_{n-2}

Now let’s say we want the difference equation to include the previous two output samples too. We can just complete the pattern, by including some previous y terms with coefficient multipliers on the left side of the equation.

y_n + b_1y_{n-1} + b_2y_{n-2} = a_0x_n + a_1x_{n-1} + a_2x_{n-2}

We can then move everything but y_n to the right side of the equation to get a function that gives us our current filtered output:

y_n = a_0x_n + a_1x_{n-1} + a_2x_{n-2} - b_1y_{n-1} - b_2y_{n-2}

You might be wondering why there is no b_0 term and the reason for that is because it would be a multiplier for y_n which is weird. Do we really need to scale the output like that? Sometimes people will include the b_0 term, and will divide both sides of the equation by b_0 to get an equation like this:

y_n = \frac{1}{b_0}(a_0x_n + a_1x_{n-1} + a_2x_{n-2} - b_1y_{n-1} - b_2y_{n-2})

Let’s just pretend that if b_0 exists, it’s value is always 1, and then we can move on without it actually being there, complicating our equations needlessly.

So, to repeat it, here is a difference equation for an order 2 IIR filter, which is built on an order 2 FIR filter.

y_n = a_0x_n + a_1x_{n-1} + a_2x_{n-2} - b_1y_{n-1} - b_2y_{n-2}

You can pull out the a_0 parameter as a gain parameter again if you want to, but the b parameters don’t get the same sort of benefit, so you can leave them in their raw form.

y_n = a_0(x_n + \alpha_1x_{n-1} + \alpha_2x_{n-2}) - b_1y_{n-1} - b_2y_{n-2}

IIR Transfer Function

To calculate the transfer function, lets start back from where we added the previous outputs on the left side of the equation:

y_n + b_1y_{n-1} + b_2y_{n-2} = a_0x_n + a_1x_{n-1} + a_2x_{n-2}

Next, let’s take the z transform:

y(z) + b_1y(z)z^{-1} + b_2y(z)z^{-2} = a_0x(z) + a_1x(z)z^{-1} + a_2x(z)z^{-2}

We then factor out y(z) and x(z) to get:

y(z)(1 + b_1z^{-1} + b_2z^{-2}) = x(z)(a_0 + a_1z^{-1} + a_2z^{-2})

Since the transfer function is just y(z) divided by x(z) we can do simple algebra to do that now!

\frac{y(z)}{x(z)}(1 + b_1z^{-1} + b_2z^{-2}) = a_0 + a_1z^{-1} + a_2z^{-2} \\ H(z) = \frac{y(z)}{x(z)} = \frac{a_0 + a_1z^{-1} + a_2z^{-2}}{1 + b_1z^{-1} + b_2z^{-2}}

You can factor out the a0 term to be gain if you want, to get a more familiar looking top of the equation:

H(z) = \frac{a_0(1 + \alpha_1z^{-1} + \alpha_2z^{-2})}{1 + b_1z^{-1} + b_2z^{-2}}

From there you can plug in frequencies and see what sort of frequency and phase response you get, just like in the last blog post for FIRs.

You might notice that the transfer function is quadratic in the numerator, and the denominator. This is in fact called a “biquadratic filter” or a “biquad” for short.

Often times higher order filters (like EQs) are made by chaining multiple biquads together. Biquads are a pretty important staple of DSP.

Pole Zero Plot

You might wonder why in the last post we called it a “Pole Zero Plot” when there were zeros but no poles.

IIRs add poles to the pole zero plot and they are where the function shoots to infinity. This happens in a fraction when you divide by zero, so a pole is a place where the denominator of the transfer function is zero.

To make it explicit:

  1. You solve the quadratic equation in the numerator to get the zeros of that quadratic equation, which are also the zeros of the filter.
  2. You solve the quadratic equation in the denominator to get the zeros of that quadratic equation, which are the also the POLES of the filter.

That makes things pretty easy for making a pole zero plot. Calculating the zeros of the filter is the same, and you use the same technique to find the poles.

In the last post, we saw that the zeros of an order 2 FIR filter were at:

z = \frac{-\alpha_1}{2} \pm \frac{\sqrt{\alpha_1^2-4\alpha_2}}{2}

That still works for IIRS too. For poles, all you need to do is replace the alphas with bs:

z = \frac{-b_1}{2} \pm \frac{\sqrt{b_1^2-4b_2}}{2}

Example Filters

Unlike FIRs which are relatively tame and can only make certain frequencies lower, IIRs can cause frequency amplitudes to get much higher and even shoot off to infinity. That makes for some cool sounds by adding distortion to certain frequency notes of an instrument but not others.

Check the links section for the “RBJ cookbook” if you want to get deeper into the filters, but here are a couple interesting filters I found.

This one boosts low frequencies and attenuates high frequencies.

This does the reverse and boosts high frequencies while attenuating low frequencies.

This makes use of both poles and zeros to make a really interesting sounding filter.

It’s also still fun to slowly move the controls back and forth while some simple instrumental loop plays. Filtering white noise is still really interesting too because white noise has all frequency content, which means filtering frequencies out or boosting them up will always affect white noise. That isn’t true of things with simpler frequency components. The extreme of this is the sine wave which only has a single frequency so is unaffected by other frequencies being boosted up or attenuated.

Crafting a Filter By Placing Zeros

Creating a biquad filter from zero and pole locations is pretty simple if you already can make an FIR filter by placing zeros. In fact, that is exactly how you place the zeros.

To place the poles, you do the exact same steps as placing the zeros, but the coefficients you get out are for b0, b1 and b2 instead of a0, a1 and a2.

Estimating Frequency Response From a Pole Zero Plot

Doing this from an FIR involved getting the distance from a point on the unit circle to every zero, multiplying all those distances together as a product and that is the amount the amplitude would be multiplied by.

Adding poles into the mix extends this.

As a second step, you get the distance from the point on the unit circle to every pole. You multiply all those pole distances together as a product and divide the previous product by this amount.

That’s all there is to it, and you should hopefully be able to see that this is why a frequency at the pole would shoot to infinity. The distance is zero so the frequency response shoots to infinity.

Oscillators

If you want to generate sinusoid waves of specific frequencies, you can use IIRs to do that by putting the poles on the unit circle, and leaving the zeros at the origin.

For a filter to be stable (not shoot off to infinity), the poles need to be inside of the unit circle, so putting them on the edge of the unit circle is playing with fire a bit, but the instability is also what makes the oscillator function.

We could talk about the algebra for calculating polynomials in the numerator and denominator of the transfer function to do this, but lets jump to the end and look at the simplified result of what to set the parameters to in the difference equation below:

y_n = a_0x_n + a_1x_{n-1} + a_2x_{n-2} - b_1y_{n-1} - b_2y_{n-2}

The parameters should be:

  • a_0 = 1
  • a_1 = 0
  • a_2 = 0
  • b_1 = 2 cos(\omega)
  • b_2 = 1

Where omega (\omega) is the amount of radians to advance the wave form per sample.

If you plug these into the difference equation, you can simplify it to this:

y_n = x_n - 2 cos(\omega)y_{n-1} - y_{n-2}

These oscillators don’t need input and just want a steady stream of zeros. Because of this, we can remove the input from the equation.

y_n = -2 cos(\omega)y_{n-1} - y_{n-2}

The last thing you need to do however is to give it some initial starting state. If you make it so y[-1] and y[-2] are zero, to calculate y[0] and y[1], the oscillator won’t work correctly.

This is because we need to initialize the state (prior output) to be as if it’s been running all along.

So, you can set y[-1] to be cos(-\omega*1) and y[-2] to be cos(-\omega*2). That will make it so the next sample will be cos(0) which means the wave will start at zero degrees and continue.

You could initialize the state to whatever phase you wanted to, by initializing y[-1] and y[-2] to the prior two values to the desired phase.

As a quick and dumb example, let’s look at a sine wave that advances 180 degrees (pi radians) per sample. That means b1 will be 2, which makes our difference equation:

y_n = -2y_{n-1} - y_{n-2}

We’ll initialize y[-2] to be cos(-2pi) or 1, and we’ll initialize y[-1] to be cos(-pi) or 0.

Following the difference equation starting at y[0] we get…

\begin{array}{|c|c|} \hline \text{y index} & \text{value} \\ \hline -2 & 1 \\ -1 & -1 \\ 0 & 1 \\ 1 & -1 \\ 2 & 1 \\ 3 & -1 \\ 4 & 1 \\ 5 & -1 \\ \hline \end{array}

180 degrees is nyquist, and we can see that it’s doing the right thing of flipping between 1 and -1. It works with less clean numbers too, and the simple c++ code that goes with this post shows that working (https://github.com/Atrix256/DSPIIR).

Unfortunately, with less clean numbers, this method will start to drift from reality over time due to floating point imprecision. One way to deal with this is to reset the oscillator after every 360 degrees of advancement.

Nick Appleton (@nickappleton) has an alternate method if you are looking for a cheap oscillator.

First you make two complex numbers:

  • y = 1 + 0i
  • a = e^(i*omega)

Where omega is still the number of degrees the wave advances per sample. Another way to calculate a is: std::polar(1.0f, radiansPerSample)

Then, for each sample you do this:

y = y * a

the resulting complex number will have the cosine value in the real portion, and the sine value in the imaginary portion.

This has better numerical stability, which you can see in the c++ code output too.

Links

Here are some good links where i got info about oscillators.

http://dspfirst.gatech.edu/chapters/08feedbac/demos/recur/index.html

http://www.eas.uccs.edu/~mwickert/ece5655/lecture_notes/ece5655_chap8.pdf

There is a famous document called the “RBJ cookbook” which gives recipes for biquads that act in specific ways. RBJ stands for the author’s name Robert Bristow-Johnson. You can find it attached to the message at the link below!

https://www.musicdsp.org/en/latest/Filters/197-rbj-audio-eq-cookbook.html

Marc B Reynolds (@marc_b_reynolds) recently put out an interesting blog post talking about how it’s better to use integers to repeatedly add things (irrational numbers in his case) instead of using floats. There are some good lessons there that apply here as well I bet, probably especially for oscillators.

http://marc-b-reynolds.github.io/distribution/2020/01/24/Rank1Pre.html

FIR Audio & Data Filters

There is an interactive demo that goes with this post where you can play with filter parameters, hear the resulting filter applied to audio samples, and copy/paste the simple formula to use the filter in your own applications:

Finite Impulse Response Filtering.

There is also some simple standalone C++ code (~170 lines) that shows how to apply filters, as well as calculate phase/frequency response and zero locations:

https://github.com/Atrix256/DSPFIR

The math behind frequency filters can look pretty daunting, but it’s actually not that bad!

This post is going to talk about a simple, real time friendly filter called a “Finite Impulse Response” or FIR filter. They are also known as feed forward filters.

An order 1 FIR filter looks like this:
y_n = a_0x_n + a_1x_{n-1}

y_n is the filtered sample, x_n is the current unfiltered sample, x_{n-1} is the previous unfiltered sample, and a_0 and a_1 are values to multiply the samples by to control the filtering.

For instance, the below averages the samples, which is also called a box filter. This is in fact a low pass filter, meaning that it filters away higher frequencies while letting low frequencies pass through.
y_n = 0.5*x_n + 0.5*x_{n-1}

If you change the addition to a subtraction, you get a high pass filter, which filters out low frequencies, and lets high frequencies pass through.
y_n = 0.5*x_n - 0.5*x_{n-1}

An order 2 FIR filter looks like the below and the pattern continues for higher order filters.
y_n = a_0x_n + a_1x_{n-1} + a_2x_{n-2}

These equations are called difference equations and this post is going to show you how to analyze them to understand what they do, and how to craft your own that behave in specific ways.

Gain Parameter

If you made a machine that applied an order 1 FIR filter, where you gave a knob to control a_0 and a_1, you’d find that the knobs were pretty unintuitive.

y_n = a_0x_n + a_1x_{n-1}

Specifically, if you had the knobs set to a specific setting that you liked, but you wanted the overall filtered sound to be louder or quieter, it would be difficult to make that happen. (spoiler: the ratio between them would need to stay the same but if you made them larger magnitude numbers the result would get louder)

This problem only gets worse with higher order filters as you add more knobs.

To deal with this, you can factor out the a_0 parameter which makes the order 1 equation into the below:

y_n = a_0(x_n + \frac{a_1}{a_0}x_{n-1})

We can replace \frac{a_1}{a_0} with \alpha_1 to get this equation:

y_n = a_0(x_n + \alpha_1x_{n-1})

Now, you can have two knobs for those two parameters and things are much easier to deal with. You can adjust a_0 to adjust the volume, and you can adjust \alpha_1 to adjust how frequencies are affected by the filter.

If you wanted to get a_1 out again, like for use in the difference equation, you would just calculate it like this:

a_1 = \alpha_1 * a_0

You can do the same process to the order 2 FIR filter to get this equation:

y_n = a_0(x_n + \alpha_1x_{n-1} + \alpha_2x_{n-2})

where

\alpha_1 = \frac{a_1}{a_0}

and

\alpha_2 = \frac{a_2}{a_0}

From Difference Equation to Transfer Function

The difference equation is how you filter data, but it doesn’t really help you understand what the filter is actually doing.

To understand what the filter does, and how it affects various frequencies, we need something called a transfer function. We can transform a difference equation into a transfer function with a handful of steps.

First, instead of putting a signal x_n through our filter, we need to put through a complex sinusoid e^{i \omega t}, which will let us plug in different frequency values for omega (\omega) to see how the filter affects those frequencies.

While we can replace x_n with e^{i \omega t}, what do we replace x_{n-1} with? We replace it with e^{i \omega (t-1)}. Another way to write that is e^{i \omega t}e^{-i \omega}. That shows that we can multiply any point in our complex sinusoid signal by e^{-i \omega} to get the previous sample, which is super handy, as it delays the signal by one sample.

So, let’s go back to our difference equation and plug in the complex sinusoid.

y_n = a_0(x_n + \alpha_1x_{n-1}) \\ y(t) = a_0(e^{i \omega t} + \alpha_1e^{i \omega t}e^{-i \omega})

Then let’s factor out the complex sinusoid.

y(t) = a_0e^{i \omega t}(1 + \alpha_1e^{-i \omega})

That sinusoid is what we input into the equation so we can replace it with x(t).

y(t) = a_0x(t)(1 + \alpha_1e^{-i \omega})

The transfer function is y(t) divided by x(t) so let’s put it in that transform function form:

H(\omega) = \frac{y(t)}{x(t)} = a_0 (1 + \alpha_1e^{-i \omega})

The next step is to do a replacement in the function. We are going to define z=e^{i \omega}, which makes our transfer function look like the below.

H(z) = a_0 (1 + \alpha_1z^{-1})

We are done! That is the transfer function of our order 1 FIR filter which is controlled by two parameters:

  • a_0 – This is the gain (volume) of the filter
  • \alpha_1 – This controls how the frequencies are filtered. Multiply this by a_0 to get the a_1 parameter if you ever want to (like to use in the difference equation), but let the user control this value directly.

For order 2, if we do the same, we end up with this equation:

H(z) = a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2})

The parameters are the same but order 2 adds in a \alpha_2 which also controls how the frequencies are filtered. You can also multiply \alpha_2 by a_0 if you want to calculate the a_2 value for use in the difference equation.

Complex Sinusoids, Complex Unit Circle, Phase and Frequency Response

There’s a bizarre looking equation called Euler’s identity which looks like this:

e^{i\pi}=-1

This is actually just a specific value (pi) plugged into the more general Euler’s formula:

e^{ix}=\cos x+i\sin x

In those formulas, you can see that x is an angle, so when you plug in pi, you are plugging in 180 degrees. The result you get out is a complex number where the real part is the cosine of the angle, and the imaginary part is the sine of the angle.

The cosine of pi is -1, and the sine of pi is 0, so that’s why Euler’s identity gives you -1 as a result. It’s really -1 + 0i.

Something interesting you may remember is that if you want to draw a point on a unit circle, x is cosine theta, and y is sine theta. (Theta being the angle from the origin to the point)

So, what this tells us is that Euler’s formula is a way of drawing a circle where the x axis is the real number axis, and the y axis is the imaginary number axis.

In the last section, we plugged a similar looking – but different – complex sinusoid into the equation that looked like this:

e^{i \omega t}

The difference is that we have omega times t as our x value.

t is the time variable, so is what we increase to move around the circle.

Omega says how many radians around the circle the point travels every time t increases by 1.

If omega is zero, that means the point is stationary and doesn’t move. It also means it has a frequency of zero. This is 0hz or DC.

If omega is pi/2, that means the point moves pi/2 radians (90 degrees) around the circle every time t increase by 1. If this were a cosine wave, and t was an integer index for samples, it would repeat: 1, 0, -1, 0. This is also 1/2 nyquist frequency.

if omega is pi, that means the point moves pi radians (180 degrees) around the circle every time t increase by 1. As a cosine wave it would repeat: 1, -1. This is nyquist frequency.

Anything higher frequency than nyquist has aliasing problems and can’t be reconstructed from the samples, so it makes sense to stop there.

Similarly, if at 0 radians, you go negative, such as -pi/2, that is the same as 3*pi/2 which is above nyquist, so we stay away from the negative frequencies as well.

In any case, we are plugging a circle into the transfer function where the angle on the circle just means “how many radians does this wave form advance each sample”, and as output, we get that circle but distorted. A filter where output = input without any filtering happening would leave the circle unaffected.

The output point for a given input point is another point on the complex plane (x = real, y = imaginary).

The distance of this point from the origin (or, the magnitude of this vector) tells us how much the amplitude (volume) would be altered for a signal with that frequency. A value less than 1 means that it gets smaller/quieter, while a value greater than 1 means that it gets larger/louder.

The angle that the output point is on the circle (use atan2 to get this value) tells you how much this signal will be “phase shifted” or how much it will be delayed. For instance, if the signal is advancing at 90 degrees per sample, and it has a phase shift of -90 degrees, that means the signal is effectively delayed by 1 sample. A positive phase shift means that the signal is moved forward in time, and is an anti-delay (it’s from the future kinda).

This stuff is all very interesting, because real signals / sounds are commonly made up of several different frequencies. The filter treats all those frequencies independently and does the things it says it will – as far as wave amplitude and phase shift – to each frequency that may be present, independent of what it does to the other frequencies.

Frankly, it feels a bit magic that it’s able to do that, but that’s what it does.

You can actually plug angles between 0 and pi into the transfer function, get complex numbers out, calculate the amplitude and phase response, and graph them. This gives you something where you can see how a filter actually behaves. We do 0 to pi, because that is 0hz to nyquist.

Below is the before mentioned “box filter” aka low pass filter, that you get when you average the current sample with the last. You can see that the low frequencies are left alone (0db) but that higher frequencies drop down in amplitude a lot as they approach nyquist. The x axis labels are % of nyquist, so at the very right is the nyquist frequency – the highest frequency that is able to be expressed.

Note: this and the other screenshots came from the interactive demo that goes with this post. You can find it at http://demofox.org/DSPFIR/FIR.html

Pole Zero Plot of Order 1 FIR Filter

Lets take a look at the transfer function of our order 1 FIR filter again:

H(z) = a_0 (1 + \alpha_1z^{-1})

Now i have a question for you… what value(s) of z make this function equal to zero? You can work through some simple algebra to find that it’s where z = -\alpha_1:

a_0 (1 + \alpha_1z^{-1}) = 0\\ 1 + \alpha_1z^{-1} = 0\\ \alpha_1z^{-1} = -1\\ z^{-1} = -\frac{1}{\alpha_1}\\ z = -\alpha_1

This just means that on the complex plane, the function equals zero at that location.

As points in the function get closer to there, they will approach zero as well.

What this means is that frequencies on the complex sinusoid unit circle that are nearer to this zero will be attenuated (shrunken, made smaller / quieter), while frequencies farther away won’t be, or won’t be attenuated as much.

So, you control what frequencies get filtered just by where the zero is!

If the zero is on the left, at angle pi (180 degrees), where the highest frequencies live (nyquist!), the high frequencies will get attentuated, making it a low pass filter. Below is the box filter again, which does this.

If the zero is on the right, at angle 0, where the low frequencies live, the low frequencies will get attenuated, making it a high pass filter.

If you put the zero in the middle, you get a filter that doesn’t do anything, other than possibly a0 adjusting the volume, like in this filter.

You don’t have to keep the zero inside the unit circle though. Here it is going outside of the unit circle to the right. You can see that low frequencies are still attenuated, so it’s a high pass filter, but they are not attenuated as sharply.

It’s also worth noting that the last filter was the first filter where the phase response was not a line. The other filters had “linear phase response”. Linear phase filters have some nice properties that are sometimes desired, such as not distorting the shape of a wave form, other than adjusting frequency amplitude.

You’ve now essentially seen everything that an order 1 FIR filter can do. There’s a zero you can move around on the real axis (the x axis) and any frequencies near the zero get quieter when passed through the filter.

It’s important to note that the zero can only make things quieter. This means that FIR filters are stable. They never suffer from feedback issues where they make something louder and louder and louder into infinity or start spitting out NaNs. That makes them pretty attractive if you really want a stable filter, but that stability is also tameness.

Pole Zero Plot of Order 2 FIR Filter

Let’s take a look at the transfer function of 2nd order filters.

H(z) = a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2})

Just like with the order 1 filters, we need to find out where the zeroes of the function are. With some algebra we can get it to something more manageable:

a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2}) = 0 \\ 1 + \alpha_1z^{-1} + \alpha_2z^{-2} = 0 \\ \frac{z^2 + \alpha_1z^1 + \alpha_2}{z^2} = 0

Looking at that, when z is 0 (the origin), it shoots to infinity. That’s called a “pole” which is the opposite of a zero. We’ll talk about them lots more in the next post, but for now, since it’s at the origin we can ignore it as a “trivial pole”.

So, we need to figure out when the top part of the fraction is equal to zero:

z^2 + \alpha_1z^1 + \alpha_2 = 0

Luckily this is a standard quadratic polynomial, just like if we had Ax^2+Bx+C. In our case, x is z, A is 1, B is \alpha_1 and C is \alpha_2.

To solve those equations we normally would use the quadratic equation:

\frac{-B \pm \sqrt{B^2-4AC}}{2A}

For our purposes, it is more convenient to re-arrange that a bit to be like the below:

\frac{-B}{2A} \pm \frac{\sqrt{B^2-4AC}}{2A}

The reason for this is because we are going to have two zeroes (roots) when we solve this thing. The left side is the x axis (real axis) middle point between the two roots. The right side tells us how much to move on each side of that middle point to get to the zeroes. If the number under the square root is positive, it will give us a real number answer, that we will use to move left and right on the x axis. If the number under the square root is negative, it will give us an imaginary number answer, that we will use to move up and down on the y axis.

We can plug in our values for A, B and C and get our roots:

z = \frac{-\alpha_1}{2} \pm \frac{\sqrt{\alpha_1^2-4\alpha_2}}{2}

It isn’t as clear where the roots are as it was for an order 1 filter, but it will tell us where they are if we plug values in.

Firstly, the center between the zeroes will be at \frac{-\alpha_1}{2}. So, if \alpha_1 is 1.0, the zeroes will be evenly spaced away from -0.5 on the real axis. If it’s -2.0, the zeroes will be evenly spaced away from 1.0 on the real axis.

From there, you just modify \alpha_2 to change if the zeroes move on the real axis (x axis) or the imaginary axis (y axis) and by how much. Pro Tip: If you give \alpha_2 a value of 1.0, the zeroes will be on the unit circle.

Here is an order 2 high pass filter:

Here is an order 2 low pass filter:

Lastly, here’s an order 2 notch filter, which means it cuts a “notch” out of the frequency spectrum. It can sound pretty cool to play a sound (in the acompanying demo) and slide the notch up and down the frequency range.

Estimating Frequency Response From a Pole Zero Plot

If you have a pole zero plot of an FIR, here is how you estimate how much a specific frequency will be attenuated by that filter.

Draw a line from that frequency on the unit circle to every zero on the plot and get the lengths of those lines.

Multiply all the length of the lines together.

That final result is an estimate of how much the amplitude of a sinusoid wave in that frequency will be attenuated.

A value smaller than 1 means it will get smaller (quieter) while a value greater than 1 means it will get bigger (louder).

I’m not sure why it’s only called an estimate though. When looking at order 1 and order 2 filters, I’ve been unable to find a situation where they don’t match visually. The demo has a checkbox at the bottom to draw the estimate in teal if you want to see for yourself.

Why is it Called Finite Impulse Response?

There is a concept of an “impulse” where you have a signal this is a bunch of zeroes, then a single 1, then a bunch of zeroes again. Putting an impulse through a filter is a way to analyze how a filter behaves, and in fact is a unique finger print for the filter which you can use to reconstruct the coefficients of the filter itself.

If it’s a finite impulse response filter, the impulse response is actually the coefficients of the filter itself, which is pretty nifty.

Let’s check out the difference function for an order 1 FIR:

y_n = a_0x_n + a_1x_{n-1}

In this situation we have only a couple cases to think about.

When we start in the “a bunch of zeroes” section of the signal, it’s going to be a0 * 0 + a1 * 0, which is always going to be zero.

So, for that part of the signal, we always have zeroes.

Then, we hit a point where x_n is 1 and x_{n-1} is 0. In this case, the output of the filter is going to be a0.

Next, we hit the point where x_n is 0 again and x_{n-1} is 1. In this case, the output of the filter is going to be a1.

Then, we have a bunch of zeroes again and the filter will keep putting out zeroes until the sun goes dark.

The fact that this filter had a “finite response” to the impulse, meaning it settled back down to zero and stayed zero, is why it’s named a finite response.

In contrast to this, if the filter had some kind of internal state that was updated as each sample went through, it could very well never go back to zero. In that case, we’d have an infinite impulse response or IIR. The next blog post is going to talk about that kind of a filter.

Crafting a Filter By Placing Zeroes

If you have a high order FIR – say an 8th order one – it’s very difficult to find the zeroes because you ultimately need to find the zeroes of an 8th order polynomial, which is not really possible analytically in the general case.

However, if you want to go the reverse way, where you have placed your zeroes in specific locations and want to make a filter from that, that is a lot easier.

First up, you need to decide where your zeroes are.

  • You can have as many zeroes as you want.
  • Duplicating zeroes make them have steeper attenuation, but also the effect is more localized and more distant frequencies are affected less than closer ones which are affected more.
  • You can add as many real valued zeroes as you want.
  • For every complex zero you add, you also need to add it’s complex conjugate. For instance, if you add 1+3i, you also need to add 1-3i. If you don’t do this, your filter can spit out complex numbers instead of only real numbers!

To work through an example, let’s say that we have 4 zeroes: 3, -2, 3i, -3i.

First we need to make a polynomial with those zeroes.

0 = (x-3)(x+2)(x+3i)(x-3i) \\ 0 = (x^2-x-6)(x^2+9) \\ 0 = x^4-x^3+3x^2-9x-54

Next, we could work through some simple algebra (the reverse of what we did in previous sections), but to jump straight to the punch line, our filter coefficients are the coefficients of this polynomial: 1, -1, 3, -9, -54. That gives us this difference equation:

y(n) = 1*x_n - 1*x_{n-1} + 3*x_{n-2} - 9*x_{n-3} - 54*x_{n-4}

Or simplified:

y(n) = x_n - x_{n-1} + 3x_{n-2} - 9x_{n-3} - 54x_{n-4}

If you want a a_0 gain parameter in your filter function, just multiply all the coefficients by that value.

You can also use those coefficients to go directly to the transfer function very similarly. The coefficients are inside the parens.

H(z) = a_0(1*z^0 - 1*z^{-1} + 3*z^{-2} - 9*z^{-3} - 54*z^{-4})

Or simplified:

H(z) = a_0(1 - z^{-1} + 3z^{-2} - 9z^{-3} - 54z^{-4})

Before moving on, it’s important to note that some coefficients may be zero and you need to watch out for that. For instance, let’s say you have zeroes at +/- i:

0 = (x-i)(x+i) \\ 0 = x^2 + 1

You would probably be tempted to say that the filter coefficients are “1, 1” (I made that mistake!) but in fact, they are “1, 0, 1”. The equation above is really this one below:

0 = x^2 + 0x + 1

Making Cool Sounds & Music

A really fun thing to do in the demo (http://demofox.org/DSPFIR/FIR.html) is to make it play white noise and then filter the white noise. You can make it sound like wind or the ocean, or other things. White noise contains all frequencies so filtering like this can make some really interesting sounds. Some other types of sounds (like a sine wave tone) have only a single or very few frequencies, so putting them through filters doesn’t make them sound that interesting.

Filtering sounds to create new sounds is called subtractive synthesis (since you are attenuating or removing frequencies). It’s not something I’m very experienced at musically, but I’m finding that I really like the sound of making some frequencies become very loud and clip (causing distortion), while other frequencies do not. I anticipate the next post on IIRs will have more of that. IIRs place infinities (poles) on the pole zero plot so can boost frequencies quite high, instead of how FIRs can only place zeroes that make some frequencies quieter.

Here’s something to try in the demo…

  1. Have it play the harp sound clip
  2. Set it to an order 2 filter
  3. Turn on overdrive and set gain to 33
  4. A0 = 33, alpha1 = -1.34, alpha2 = 1.
  5. Bonus: slide alpha1 back and forth slowly. You could imagine putting that parameter on low frequency sine wave so that it did this part automatically.

You are on your way to making music! If that sounds awesome to you, you might like my other audio blog posts: https://blog.demofox.org/#Audio

Closing

Now that you understand how FIR’s work, you may be interested in a post i wrote about generating different distributions and colors of noise using dice. It turns out they are just FIRs in disguise. The dice are just a source of white noise and the operations with adding and subtracting is just having positive and negative coefficients in the filter 🙂

https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/

Also, something nice about FIR filters is that the filter is stateless. You just multiply the last N samples by N specific coefficients to get your result. This means you can apply a filter to a signal via convolution. Convolution isn’t a cheap operation, but it does a lot of simple branchless operations in parallel, which is something that things like GPUs and SIMD excel at. You can also use the convolution theorem by taking the signal and filter coefficients into frequency space (DFT via the FFT), multiplying them together, and taking them back into time space (IDFT via the FFT again). This can be more efficient for a high order filter. In fact, really nice reverb is made up of a very long impulse response, which makes it a very high order filter. It’s not uncommon to implement reverb via convolution (but doing it as a multiplication in frequency space).

I’d like to thanks these 3 peeps who helped me understand some DSP concepts. Thanks for your kindness!
@BartWronsk
@nickappleton
@rygorous

Also, this book is exceptional as an introduction to DSP, specifically meant for audio purposes. It has dsp theory but also information about how to program things like compressors and limiters, flange, chorus and reverb. It’s very cool and a lot of fun.

Bezier Triangles

There’s an interactive Bezier Triangles webGL demo I made that goes with this post, and what i used to make the 3d rendered images with. You can find here:
http://demofox.org/BezierTriangle.html

Above is a cubic bezier triangle which has 10 control points. Each edge of the triangle is a cubic Bezier. That makes for 9 control points, and the 10th control point is in the center of the triangle.

I’ve been thinking more about using raytracing for data lookups (https://blog.demofox.org/2018/11/16/how-to-data-lookups-via-raytracing/) and have been trying to think of how you might get higher order interpolation than linear barycentric interpolation. The hope is to fit the data better, and with fewer data points, resulting in a sparser data mesh.

The higher order interpolation may take a little more compute and memory access, but having fewer triangles to raytrace against is a good thing (smaller BVH) besides the potentially better data fit. There’s a trade off here, I’m sure – and I’m sure that trade off is situational.

In the past, I found a way to exploit the linear interpolation of texture samplers to get higher order interpolation (https://blog.demofox.org/2016/02/22/gpu-texture-sampler-bezier-curve-evaluation/) but when thinking about how to extend that to triangles in a vertex shader, I never found a good way to do it.

One of the challenges in the vertex shader triangle case is not easily being able to get barycentric coordinates on the triangle, but that’s drop dead simple in the raytracing case, since barycentric coordinates are one of the few pieces of data you actually have out of the box.

These ideas came together and I realized it was time to finally dig into the guts of Bezier triangles.

Luckily, it turns out they are not that scary!

Here is some info on Bezier curves that are good background for the rest of this post:

Bezier Simplex and Barycentric Interpolation

Bezier triangles are very close conceptually to Bezier curves, and in fact, both of them are examples of a Bezier Simplex.

A simplex is the shape with the fewest number of points to create an object in a given dimension.

A 0 dimensional simplex is a point, a 1 dimensional simplex is a line, a 2 dimension simplex is a triangle, a 3 dimensional simplex is a tetrahedron, and so on.

Barycentric coordinates are coordinates on a simplex, and sum up to 1.0.

Bezier curves are parameterized by a value t, like in the linear Bezier curve formula below (which is also just a lerp):

y=A*(1-t)+Bt

Another way to think of this though is that there are two parameters s and t, where they sum up to 1. That makes s=1-t. That lets us write the same equation like the below:

y=As+Bt

That is 1 dimensional linear barycentric interpolation between the values A and B on a line.

We can bring that up to 2 dimensions to get a linear barycentric interpolation between 3 points on a triangle.

y=As+Bt+Cu

That extends to any dimension by just adding more terms.

Much like we can bring this concept to higher order interpolation on a line with Bezier curves, we can do higher order Barycentric interpolation on simplices of higher dimensions. If you do this with a triangle, you get a Bezier triangle. If you do this with a tetrahedron, you get a Bezier tetrahedron, and so on.

Calculating Barycentric Coordinates

Calculating the barycentric coordinates is conceptually pretty simple.

  1. Start with an N dimensional simplex.
  2. When you place a point in it, that divides the simplex into several simplices of the same dimension. Each of these other simplices will be opposite a point on the larger simplex.
  3. Calculate the area of a smaller simplex and divide it by the area of the large simplex. This is the barycentric coordinate of the point that is opposite this simplex.

That explanation might have been a little thick. Lets look at lines and triangles as an example.

First is lines:

The point A divides the line (1-simplex) into two parts: P10 to A, and A to P01. P10 to A is 1 unit long. A to P01 is 4 units long. The entire line’s length is 5 units long.

To calculate the barycentric coordinates (s,t) we first start with s.

The simplex opposite P10 is the A to P01 line segment, which has a length of 4. Dividing that by 5, we get a value of 0.8. So, the value of s is 0.8.

The value of t can be found the same way. The opposite line segment of P01 is P10 to A and has a length of 1, which we can divide by 5 to get 0.2. We could have also just calculated 1 – 0.8 since they have to add up to 1.0!

The Barycentric coordinates of point A on that line is (0.8, 0.2).

Calculating the Barycentric coordinates of a triangle is very similar. In the image below, point A has Barycentric coordinates of (0.5, 0.333, 0.166)

A point is known to be inside a simplex if all the components of it’s Barycentric coordinates are between 0 and 1.

The De Casteljau Algorithm For Bezier Triangles

Starting with linear Bezier triangles, you have a control point at each corner of the triangle, and you use barycentric interpolation of those values to get the value at a point on the triangle.

This is a linear interpolation (the result is on a plane), and is what we’ve been doing in vertex shaders since vertex shaders existed. Simple barycentric interpolation.

Much like how a linear Bezier curve is a linear interpolation (lerp) between two values, a linear Bezier triangle is a linear interpolation between three values. Each edge of the triangle is also a linear Bezier curve.

Next up is a quadratic Bezier triangle.

A quadratic Bezier curve is made by linearly interpolating between two linear Bezier curves. A quadratic Bezier triangle is made by linearly interpolating between three linear Bezier triangles.

Here is a refresher on how a quadratic Bezier curve actually works. There are 3 control points A, B, C. One linear interpolation is from A to B. The other linear interpolation is from B to C. These values are linearly interpolated for the final point on the curve.

How does this work on a Bezier triangle? Each edge of the triangle in a quadratic Bezier curve which gives us control points like the below:

What we do first is use point P’s Barycentric coordinates to interpolate the values of the red triangle: 002, 011, 101 to get point A. You treat those values as if they are at the corners of the larger triangle (if you think about it, this is how quadratic Bezier curves work too with the time parameter. Each line segment has t=0 at the start and t=1 at the end).

We next interpolate the values of the green triangle: 011, 020, 110 to get point B.

We then interpolate the values of the blue triangle: 101, 110, 200 to get the point C.

Now we can do a regular linear barycentric interpolation between those three values, using the same barycentric coordinates, and get a quadratic Bezier triangle. Each edge of the triangle is also a quadratic Bezier Curve.

For a cubic Bezier triangle, it’s much the same. You evaluate a quadratic Bezier for each corner, and then linearly interpolate between them to get a cubic. Below is an image showing the control points used to create the three quadratic Bezier triangles.



Which results in a cubic Bezier triangle. The edges of this triangle are cubic Bezier curves.

This works the same way with any Bezier simplex by the way.

The Formula For Bezier Curves

Just like Bezier curves, you can come up with a multivariable polynomial to plug values into for Bezier Triangles, instead of using the De Casteljau algorithm. The De casteljau algorithm is more numerically stable, but the polynomial form is a lot more direct to calculate.

Let’s consider Bezier curves again first.

Lines (a 1-simplex) have two points, so we’ll describe control points as P_{xy}. If we want to make a degree N curve, our control points will be all the x,y pairs that add up to N. Here are the control points for the first few degrees of Bezier curves.

  1. Linear: P_{10}, P_{01}
  2. Quadratic: P_{20}, P_{11}, P_{02}
  3. Cubic: P_{30}, P_{21}, P_{12}, P_{03}

You might notice that a degree N curve has N+1 control points. Linear curves have 2 control points, quadratic curves have 3 control points, cubic curves have 4 control points and so on.

The index values on these control points tell you where they are on the line, and they also tell you the power of each barycentric coordinate they have, which we will get to in a minute.

The last part of making the formula is that the control point terms are also multiplied by the Nth row of Pascal’s triangle where row 0 is at the top row. (Pascal’s triangle image from https://brilliant.org/wiki/pascals-triangle/)

Doing that, you have everything you need to create the familiar Bernstein basis polynomial form of Bezier curves.

The image below puts it all together. Orange is the name of the control point, which you can see also describes where it is on a line. The index values of the control point also describe the power of the Barycentric coordinates s and t which are in green. Lastly, the row of pascal’s triangle is in blue. To get the formula, you multiply each of those three things for each control point, and sum them up.

The Formula For Bezier Triangles

You can follow the same steps as the above for making the formula for Bezier triangles, but you need to use Pascal’s pyramid (aka trinomial coefficients) instead of Pascal’s triangle for the control points. (image from https://en.wikipedia.org/wiki/Pascal%27s_pyramid)

This time, we have 3 index numbers on control points instead of the 2 that curves had, but we still find all permutations which add up to N.

Here are the control points for the first few degrees of Bezier triangles.

  1. Linear: P_{100}, P_{010}, P_{001}
  2. Quadratic: P_{200}, P_{110}, P_{101}, P_{020}, P_{011}, P_{002}
  3. Cubic: P_{300}, P_{201}, P_{210}, P_{120}, P_{111}, P_{102}, P_{030}, P_{021}, P_{012}, P_{003}

Below are diagrams in the same style as the last section, which show the equation for a Bezier triangle and how to come up with it.

This pattern works for any Bezier simplex.

Use For Data Lookup On the GPU

If you wanted to use this for data lookup, you’d first have to fit your data with a Bezier triangle mesh, using whatever degree you wanted to use (quadratic or cubic are the seemingly likely choices!). The fit would involve not only finding the right values to give at control points, but also the right location to put the vertices of the triangles, so that you could make the mesh sparser where you could get away with it, and denser where you needed it.

Doing that fit sounds challenging, but I bet doing gradient descent of vertex positions and control points could be a decent place to start. Simulated annealing probably would help too.

Once you have your data fit, you need to figure out how you are storing it. Linear Bezier triangles only need data stored per vertex, but quadratic ones store data per edges too, and cubic ones store data per triangle. Beyond cubic, no special class of storage is needed.

Storing data per vertex and per triangle is easy, because that’s the paradigm we work with already in graphics (and raytracing specifically). The per edge data is harder. The reason that it’s harder is that to be fast, you want it to be a simple array, where you index it by the two neighboring triangles of the edge. A 16 bit index for triangle A and 16 bit index for triangle B means a 32 bit index into the edge data array. The problem is, that data is extremely sparse (each triangle index will connect to just 3 other triangle indices. The rest of the entries will be empty / useless!), so making a simple array is a lot of wasted space, and frankly is just a lot of space in general.

One idea is to store edge information in the per triangle data, even though it’s redundant (neighboring triangles would have duplicate info for the shared edge). It still would be a simple lookup.

Another idea is to store 3 edge indices in each triangle and use that to look up into an edge array for the per edge data. The memory usage here is minimized to only what is needed, but it’s a double dereference, so is pointer chasing on the GPU which doesn’t sound very nice.

Another idea would be to use minimal perfect hashing (https://blog.demofox.org/2015/12/14/o1-data-lookups-with-minimal-perfect-hashing/). What that would allow you to do is get a hash function where you could put in the triangle index and which edge you were querying for (0, 1 or 2) it would give you as output the index into an edge array to get the data from. This is different from the last one because you aren’t reading an index for the edge, so is one less dependent data lookup.

Once you have your mesh fit, and you have a way to look up the data needed to evaluate a point on your Bezier mesh, assuming your mesh is oriented on the X,Z plane, with a Y value of 0 (the “height” data being stored in a different data channel), you can shoot a ray from (x,y,1) to (x,y,-1) for the (x,y) point you want to do a data lookup for. The ray will hit a triangle, give you barycentric coordinates, the triangle id, and enough info for you to read the control points for that triangle.

From there, you do your calculations and get your result. All done!

Is This a “1D” Bezier Triangle?

You might have noticed that with things as I’ve described them, control points can only be moved up and down on the vertical axis, but not on the horizontal plane.

That is true and this is a simpler form of a more general Bezier triangle.

If you want control points that you can move in any direction in 3d, all you need to do is make one Bezier triangle that you evaluate to get an X axis value, another to get a Y axis value, and another to get a X axis value.

Using it for data lookup seems to go out the window, but for other usage cases, this is probably useful knowledge.

How Did You Render The Surfaces?

TL;DR I use ray marching, but keep reading for more details.

How the 3d surfaces were rendered (the red/black checkerboard 3d images) is nearly out of scope of this post, but someone asked so I thought I’d share. Feel free to “view source” on that demo. It’s written in webgl so should be pretty easy to follow the shader etc.

First, i get a ray starting position and direction for each pixel of the image.

I next intersect that ray against a triangular prism shape (the green background you see behind the object). I do that by intersecting the ray by each plane that makes up that shape, and keeping track of the min and max hit time of the ray. If the min > the max, i know the ray missed the bounding shape and i can just draw black.

Next up, I calculate if the min time of the intersection is above or below the triangle bezier surface. I do this by getting the (X,Y,Z) position of the ray at that time, using (X,Z) to calculate Barycentric coordinates, using those to evaluate the triangle and see if the triangle’s height at that point is above or below the ray’s height (Y axis value).

Next, i take up to 20 steps from the start time to the end time, stopping when either the “aboveness” of the ray changes (meaning it intersected the surface) or if i get to the end of the 20 steps and nothing was hit.

If the ray intersected the surface, i check how much above the surface the ray was on the previous step vs where it is in the current step, treat it as a line (make a y=mx+b function out of these data points!) and solve for zero. I take that as my intersection point.

I next calculate the gradient of the surface using finite differences (https://blog.demofox.org/2015/08/02/finite-differences/), forward differences specifically, and use that to get a surface normal for shading.

Some improvements that could be done here…

  1. It’s possible to get the gradient of a Barycentric triangle analytically, and isn’t that difficult since it’s just a polynomial. Doing that instead of finite differences would make a more accurate and better gradient.
  2. Instead of taking 20 fixed sized steps, I could use the gradient to get a distance estimate to be able to take larger steps down the ray when it’s farther from the surface (more info here: https://iquilezles.org/www/articles/distance/distance.htm)
  3. If this needed to run faster, the step count could be dropped down to a smaller number. If you did that though, you’d start to notice “stairs” and other artifacts. A way to combat these visual problems would be to use a screen space blue noise texture as a “random number” from 0 to 1 to push each ray down that percentage of a step away from it’s starting point. That would turn the visual artifacts into a more pleasant blue noise pattern – the error would be evenly distributed and be much harder to notice. If you animated that noise well over time, it would look even better. If you did this under TAA, and possibly switched to interleaved gradient noise (something to try, not a slam dunk for sure always better thing IMO), it would look very nice. (more info on animating noise over time https://blog.demofox.org/2017/10/31/animating-noise-for-integration-over-time/ and also the follow up post)
  4. Another thing to try would be to not do evenly spaced sampling down the ray from min to max time. Uniform sampling is not great. For best results, you’d want to distribute the sampling locations in a 1d blue noise pattern so that it evenly distributed the error and had low aliasing to give you the most pleasant result. This would be hard to be compatible with some of the other options, but is a nice thing to try if not doing them.

Extensions

There are lots of extensions that you could do to these Bezier triangles.

One extension would be to bring this to volumetric data with Bezier tetrahedrons. The result when using it for data meshes would be a tetrahedral mesh. To do a look up for a specific (x,y,z), you would shoot a ray from that point in any direction. The triangle you hit would contain 3 of the 4 points of the tetrahedron. You could shoot a ray in the opposite direction to get another triangle which had the 4th, or you could have data stored for that triangle index which says “if you hit this triangle from the front, the fourth vertex is M, if you hit this triangle from the back, the fourth vertex is N”, which means you just need to do a dot product of the ray and the triangle normal to figure out the fourth vertex, instead of doing another ray intersection against the BVH.

An easy but powerful extension would be that you could calculate the value of a point in two different Bezier triangles and divide the values. This would give you a rational surface which can represent a LOT more shapes than an integral surface can. For example, a quadratic Bezier curve can’t make conic sections (circles, etc), but a rational quadratic Bezier can make them, exactly.

Making it rational is just the R in NURBS, so other extension ideas are… NU = non uniform control points. R = rational instead of integral. BS = B-Spline.

Another extension that you could do is not limit the shape to being inside the simplex. For instance, here’s a cubic Bezier triangle where i let the equations go outside of the triangle, but am still limiting it to a cube. The math is totally fine with this. No division by zeros or any other badness.

Lastly, it turns out you can extend the concept of Barycentric coordinates to convex shapes that aren’t a simplex, such as a pentagon, or a dodecahedron. In the case of a quadratic Bezier pentagon, you would evaluate 5 linear pentagons – one for each corner of the pentagon – and then combine those with another linear Bezier pentagon interpolation. I’m not sure when this would be useful, but it’s definitely possible.

Check out this link for more info about generalized Barycentric coordinates:
https://en.wikipedia.org/wiki/Barycentric_coordinate_system#Generalized_barycentric_coordinates

Links

There are a lot of links in this post, but I found this page to be really helpful for understanding some of the details of Bezier triangles.
https://plot.ly/python/v3/ipython-notebooks/bezier-triangular-patches/

Calculating Information Entropy

This post has an overview of information entropy then moves onto some technical details and experiments.

If you have any comments, questions, important additions, etc, please hit me up by commenting on this post, or on twitter at https://twitter.com/Atrix256.

The code that goes with this post is at https://github.com/Atrix256/Entropy/

Thanks to all the people on twitter who answered questions & helped me learn this stuff. Links to threads and resources are at the bottom of the post!

Information Entropy Overview

Information entropy is a measure of how much information there is in some specific data. It isn’t the length of the data, but the actual amount of information it contains.

For example, one text file could contain “Apples are red.” and another text file could contain “Apples are red. Apples are red. Apples are red.”. The second text file is 3 times longer than the first but doesn’t really give any extra information. These two text files would have nearly the same amount of information entropy, despite them having different lengths.

It isn’t really possible to calculate the actual amount of information entropy a specific piece of data has in it because it’s related to Kolmogorov complexity: the length of the shortest program that can generate the data (https://en.wikipedia.org/wiki/Kolmogorov_complexity), which turns out to be an incalculable value. That’s why i said that those two text files would have nearly the same amount of information entropy instead of saying they had the same amount. A program to generate the second text file is probably going to be very slightly more complex than the text file to generate the first one: it’ll have a for loop on the outside!

You can get a pretty good upper bound on information entropy though by making a histogram of how often each symbol appears, treating that histogram like a weighted N sided dice (a probability mass function or PMF), and calculating the entropy of that dice.

Interestingly, Huffman compression is related to this (https://en.wikipedia.org/wiki/Huffman_coding). In Huffman compression, you take a histogram of the data, and give shorter length bit patterns to the symbols that occur more. This makes the overall data smaller, but doesn’t change the amount of information there is in the data. In fact, it’s part of a family of compressors called “Entropy encoders” (https://en.wikipedia.org/wiki/Entropy_encoding).

Because of this, you can get a pretty good idea of how much information is in some data by how much you can compress it. If it shrinks a lot, the information density was low. If it doesn’t shrink much, or even gets larger (which happens when the compressed data plus the compression header is larger than the original data!), then the information density is pretty high.

However, there are counter examples to that. Encrypting data hides the patterns of the data, making it look more like uniformly distributed white noise (independent random numbers) while keeping the data the same size. Encrypting data doesn’t actually change the amount of entropy in the data, but it changes the APPARENT amount of entropy. It makes the “shortest program that can generate the data” harder to find, and in fact, encrypted data will not compress!

This actually feels opposite to some things i’ve read about machine learning type algorithms. In machine learning, finding the solution to a problem involves finding the global minimum, and you want a global minimum to be wide and flat so that it’s easy to find and hard to fall out of. There are techniques to make this be more true, such as the SVM kernel trick, or neural networks which make the problem space be higher dimensional and be saddle shaped to help not get stuck in local minima. Encryption seems to do the opposite, making the minimums (global OR local!) of the Kolmogorov complexity be very hard to find, like a dirac delta, where there is no hint where they are so that you have to know where they are in advance (aka you have to know the encryption key!) to be able to find the minimum.

Before moving onto details and experiments, here is a really cool read “Entropy and Redundancy in English” which talks about how Claude Shannon calculated letters in the english language to have about 2.62 bits of information per letter.
https://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm

Interestingly, that last link also talks about something I wrote up previously, which is using markov chains to generate text.
https://blog.demofox.org/2019/05/11/markov-chain-text-generation/

There is also a really interesting tweet from Donald Mitchell (who is famous for research which includes Mitchell’s best candidate algorithm for generating blue noise sample points!), which talks about encryption and compression and generating valid looking language. It’s a very cool tangle of ideas I’m still sorting out in my head 🙂

That “transforms English sentences to other English sentences” part also sounds a lot like format preserving encryption.

The Formula

To calculate information entropy, you need to calculate the entropy for each possible event or symbol and then sum them all up.

To calculate the entropy of a specific event X with probability P(X) you calculate this: -P(X)*log_2(P(X))

As an example, let’s calculate the entropy of a fair coin.

The probability of heads is 50%. Here’s the entropy we get when plugging that 0.5 into the equation:
E_{\text{heads}} = -0.5*log_2(0.5) \\ E_{\text{heads}} = -0.5*-1 \\ E_{\text{heads}} = 0.5
Since tails has the same probability it has the same entropy as heads (which is 0.5) so we add the entropy of heads and tails to get 1 bit of entropy for a fair coin flip. You can alternately say it’s 1 Shannon of entropy, and if you use different log bases, there are other unit of measurements. The important take away though is that a fair coin gives 1 bit of entropy.

What if the coin isn’t fair though? Here’s the entropy if there’s a 75% chance of heads and a 25% chance of tails.
E_{\text{heads}} = -0.75 * log_2(0.75) \\ E_{\text{heads}} = 0.311 \\ E_{\text{tails}} = -0.25 * log_2(0.25) \\ E_{\text{tails}} = 0.5 \\ E = 0.311 + 0.5 = 0.811
So, in this case, each flip of this unfair coin only gives you 0.811 bits of information. It kind of makes sense because each flip you are pretty sure it’s going to be heads, so when it is heads, it doesn’t give you a whole lot of information.

What if the coin somehow is incapable of landing tails. Maybe it is just a 2 headed coin.
E_{\text{heads}} = -1 * log_2(1) \\ E_{\text{heads}} = 0 \\ E_{\text{tails}} = 0 * log_2(0) \\ E_{\text{tails}} = 0 \\ E = 0 + 0 = 0
Flipping this maximally unfair coin gives us zero bits of information which makes sense. We know that it will always land heads so the coin flip gives us absolutely no new information.

This equation works the same way no matter how many events or what their probabilities are. It works with 6 sided dice. It works with byte values in a text file, it also works with bits in a text file (going back to heads or tails).

When you have N discrete symbols or events, and a probability of encountering each, it’s called a probability mass function or PMF, which again, can be made from a histogram.

Interestingly, entropy calculations have a full suite of calculations similar to probability calculations. There is joint entropy, conditional entropy, the chain rule of entropy and even a Bayes rule of entropy!

https://en.wikipedia.org/wiki/Joint_entropy
https://en.wikipedia.org/wiki/Conditional_entropy

Text Experiments

The C++ code and source data for these experiments can be found on github at https://github.com/Atrix256/Entropy

We are going to start with a 26KB text file of “The Last Question” by Isaac Asimov. (from https://hell.pl//szymon/Baen/The%20best%20of%20Jim%20Baens%20Universe/The%20World%20Turned%20Upside%20Down/0743498747__18.htm)

We can make a histogram of how many times each 8 bit letter occurred, then turn that into a PMF by calculating the percentage occurrence of each character. From there we can calculate the entropy and divide by 8 to get the entropy per bit. Doing that we get the value 0.582. So, there are 0.582 bits of information entropy per bit of data in that text file. (Technically: or less)

If we compress it with the standard zip file compressor in windows, making an 11KB zip file, then do the same to that file, we get a value of 0.962 bits of information per bit of data in that text file. The file shrank to 42.3% of the size (the old file is 2.36 times larger) and we got 1.65 times as much information entropy.

If instead of compressing the text file, we encrypt it with openssl (command: openssl enc -aes-256-cbc -salt -in lastquestion.txt -out lastquestion.enc -pass pass:moreentropyplease), then calculate the entropy of that file, we get a value of 0.926 bits of information entropy per bit of data.

It’s weird to me that the encrypted data reports a lower entropy than the compressed data, and i’d expect the opposite. The issue might be that our calculation of entropy is naive, and that a more robust entropy calculation would show the encrypted data to have more entropy.

An example of being less naive would like making a histogram of 4 bit, 6 bit, 12 bit values etc instead of only 8. Another thing to do would be to try calculating entropy of re-arrangements of the data. Ultimately, you would report the lowest entropy seen from different “views” of the data.

I’m not sure if that’s what’s at play in the entropy between encrypted and compressed data though.

Interestingly, if you try to compress the encrypted data, it doesn’t really compress and stays at 26KB. The encryption has hidden the structure of the source data pretty well! It now reports an information entropy per bit of 0.923 which is very slightly down from 0.926 bits. It might be the added useless compression header causing the information entropy to go down.

What if we take the zip file of the plain text file and base 64 encode it? We are taking the same data and encoding it in fewer 8 bit symbols. That makes the file larger without changing the amount of information in the data. It’s like the reverse of compression so we’d expect the amount of information entropy per bit to go down and that is what happens! The file increases from 11KB to 14KB and we get an information entropy per bit value of 0.749, which is down from 0.962.

Noise Distributions

What if we look at the entropy of uniform white noise (plain vanilla random numbers)? Taking 100,000 white noise bytes, building a histogram of each 8 bit value, making a PMF and calculating entropy, we get 0.999733. That makes sense because white noise gives maximal entropy so we’d expect a value of 1 and got something very close.

In a recent post i talked about using dice to generate different colors and distributions of noises (https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/). Do different distributions of noise have different amounts of entropy?

First, let’s look at a triangle distribution vs a uniform distribution.

Rolling a 6 sided die gives us a uniform distribution with possible values 1 through 6. The entropy of a 6 sided die is 2.585 bits.

If we wanted a triangle distribution with possible values 1 through 6, we could roll a 4 sided die and a 3 sided die (strangely, they exist! https://www.amazon.com/Bescon-Polyhedral-3-sided-Gaming-Assorted/dp/B06XHF4Y6J) and add them to get values 2 through 7, then we could subtract 1 to get values 1 through 6.

We can make a histogram by looking at how many ways each possible number can come up, and turn them into percentages:

  • one= 1 = 8%
  • two = 2 = 17%
  • three = 3 = 25%
  • four = 3 = 25%
  • five = 2 = 17%
  • six = 1 = 8%

Summing the entropy of each of those percentages gives about 2.454 vs a uniform 6 sided die giving 2.585 bits. So, triangular distributed values have less entropy than uniform distributed values. That makes a bit of sense because we already knew that a fair coin gave more entropy than an unfair coin, and a triangle distribution is like an extension of an unfair coin to dice.

Noise Colors

Next, let’s look at the color of noise (frequency composition / correlation). Let’s use blue noise because that’s my favorite.

I made some floating point 1d blue noise data (with values between 0 and 1) using Mitchell’s best candidate algorithm (https://blog.demofox.org/2017/10/20/generating-blue-noise-sample-points-with-mitchells-best-candidate-algorithm/), and converted the floating point values to uint8.

If you look at a histogram of those uint8 values, it will show a flat histogram, meaning a uniform distribution and will give maximal entropy.

That isn’t quite right though, because we know that the values have randomized high frequency components and no low frequency components.

That means that each roll is based on the previous rolls, and so the rolls aren’t uncorrelated, and they aren’t independent. That means they should have less entropy.

One idea to capture this relationship is to take each pair of uint8s in the data, make them into a uint16, and make a histogram of those uint16s (Thanks Nathan! https://twitter.com/Reedbeta). If the uint8s were uncorrelated and evenly distributed, then the uint16s should be evenly distributed. That would mean that a value didn’t care about what it’s previous value was.

Since we know there is correlation in the blue noise, the result in our case shouldn’t be evenly distributed, and should result in less entropy than white noise.

That’s the result we get in fact, and so for 100,000 uint8 blue noise values, we get an entropy of 0.92608, where white noise gives 0.965912.

So interestingly, blue noise is more predictable and less random than white noise, which shouldn’t come as a huge surprise, since we know the PMF isn’t a uniform distribution at each new value.

Making a histogram of 16 bit values of the 8 bit stream like the above is equivalent to making an order 1 Markov chain and only has a “memory” of the previous value. You could make higher order Markov chains by making 24 or 32 bit values and beyond, to find correlation that went beyond immediate neighbors, but keeping a histogram of 2^24 or 2^32 values and larger is a pretty large amount of memory. You’d also need a lot more source data, to give the histogram a chance to stabilize / converge.

There might also be a delay of correlation, like each value could be correlated with the value that is 100 values ago. At one byte per value, you’d need an order 100 Markov chain to capture that relationship, and so would need a 2^800 bytes of memory which is astronomically large.

Other Info

If you look at the csv in the output folder of the repository for this post (https://github.com/Atrix256/Entropy), you’ll see some oddities.

One is how when looking at 11 bits at a time, the plain text has more entropy than when looking at it with 8 or 12 bits. The reason for that specific thing is that the data inherently is 8 bit, which is coprime to the 11 bits we are looking at it with, so makes the entropy look a lot higher by making it seem like there are more unique bit patterns than there are. Ultimately, when calculating entropy in a variety of ways, you should take the minimum value you find as the final answer, because it’s related to the shortest program you can reproduce it with. There are plenty of longer programs, but you only care about the shortest. You should also be careful not to get too crazy in whatever code you use to make your “view” of the data (re-arranging values / looking forward & backward a far amount) because that stuff is part of the program needed to reproduce the data, but isn’t represented in the entropy calculation you are doing.

Something else is that if you don’t have enough data, the histogram / PMF won’t converge to the actual values. You can see this manifesting in the csv data in the difference between the “small white noise” which is 8 bytes of white noise, and the “white noise” which is 100,000 bytes of white noise. They really do have the same characteristics and entropy amount per bit, but the small white noise doesn’t have enough data available to show the correct values.

Links

A very related, but more formalized, write up on information theory.
https://www.blackhc.net/blog/2019/better-intuition-for-information-theory/

A stack exchange question about how to calculate entropy of correlated samples.
https://math.stackexchange.com/questions/3428693/how-to-calculate-entropy-from-a-set-of-correlated-samples

Here’s a neat post that shows how to visualize entropy in a binary file, to find where encryption keys are since they are high entropy. Thanks Ashley! (https://twitter.com/ACEfanatic02)
https://corte.si/posts/visualisation/entropy/index.html

The “Breaking Math” podcast episode 24 is really good and related to this stuff:

“The Complexity Barrier” blog post by John Baez
https://johncarlosbaez.wordpress.com/2011/10/28/the-complexity-barrier/

A thread showing how Oskar Stålberg is bitbacking data very efficiently, so the data is not compressing very well at all.

Some related twitter megathreads:

The difference between uncorrelated and independent:

Bayes’ Theorem Intuition

Bayes’ theorem is a way to calculate the probability of something happening based on evidence that may affect that probability. It’s a formula for combining probabilities together when they might affect each other. It’s also good for updating probabilities when you get new data.

The formula is:

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

The formula reads: “The probability of A happening, when B has happened is equal to the probability of B happening when A has happened, multiplied by the probability of A happening, divided by the probability of B happening.

Bayes’ theorem is a mouthful and looks formidable, especially if you aren’t used to the notation. Believe it or not, Bayes thought this was such a trivial formula that he himself didn’t even share it with anyone. After his death, the formula was found in his notes and made public. (https://en.wikipedia.org/wiki/Bayes%27_theorem#History)

The goal of this post is to give you the intuition for this formula such that you feel as Bayes did, that this is so trivial it isn’t even worth having a name. Despite that, it is very cool and very useful, so thanks for that, ghost of Bayes, wherever you may be.

I learned the intuition in backwards order, because I didn’t know what to look for in advance. Lucky for you, you don’t have to learn it in backwards order!

Joint Probability

The probability of event A happening is written as P(A).

The probability of event B happening is written as P(B).

The probability of both events happening, is P(A \cap B) and this is called the joint probability.

If the two events are independent of each other – where one event happening has no bearing on the other event happening – you can calculate the joint probability by multiplying the probabilities together:

P(A \cap B) = P(A) \cdot P(B)

An example of independent (unrelated) events would be if A was “getting struck by lightning” and B was “enjoying the song wooly bully”

Conditional Probability

If the events are not independent, we have to turn to conditional probabilities.

A conditional probability is written like as P(A|B) and it means “What is the probability of A happening, if B has happened?”

If we are working with a conditional probability, we can still calculate a joint probability, but it is calculated differently:

P(A \cap B) = P(A|B) \cdot P(B)

Which reads “The probability of A and B happening is the probability of A happening if B has happened, multiplied by the probability of B happening”.

You can see how this calculates the joint probability because it’s still calculating the probability of both events happening. It just so happens that in this case, the two events are related.

An example of events that are not independent would be if A was “getting struck by lightning” and B was “climbing a power line in a storm”.

Putting It Together

Are you ready for the intuition?

Firstly, as we said above, the joint probability for conditional probabilities is calculated like this:

P(A \cap B) = P(A|B) \cdot P(B)

Secondly, you can re-order A and B in the joint probability. You are calculating the probability of both things happening so it doesn’t matter which event is first and which one is second in the equation.

We can write that like this:

P(A \cap B) = P(B \cap A)

And if you expand it out, it looks like this:

P(A|B) \cdot P(B) = P(B|A) \cdot P(A)

Thirdly and lastly, let’s divide both sides by the probability of B happening. This is just a simple algebra operation.

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Oh my goodness, we derived Bayes’ theorem. What?!!!

Links

Joint Probability: https://www.investopedia.com/terms/j/jointprobability.asp

Bayes’ Theorem: https://www.mathsisfun.com/data/bayes-theorem.html?fbclid=IwAR2yh8rADGQHmjGISdN07souU9_h6lclvIH1eYYTwpHivF2Gb1zxXxINo50

An interactive visualization of Bayes’ things: https://seeing-theory.brown.edu/bayesian-inference/index.html

Bayesian Updating: http://www.statisticalengineering.com/bayesian.htm

Joint, Marginal and conditional probabilities: https://machinelearningmastery.com/joint-marginal-and-conditional-probability-for-machine-learning/

Bayesian Thinking: https://www.youtube.com/watch?v=BrK7X_XlGB8&feature=youtu.be&fbclid=IwAR2ViY4gKpkDHOKq7M1sc5kzZmgx77NRy7qqQ3fYOuV4nggQPDbiQ0xwxok

The Bayesian Trap: https://www.youtube.com/watch?v=R13BD8qKeTg&feature=youtu.be&fbclid=IwAR0YUwfpmPNdNvS2uwqmTnFBd_oQ2xOhqPxDHCiq00pnvBD8hP7Owj0JK9U