Calculating Discrete Sums With Umbral Calculus

Before we get into umbral calculus I want to talk about discrete calculus.

Discrete calculus lets you do calculus operations on discrete functions. These are functions like y=f(x) where x has to be an integer. Limits can no longer get arbitrarily close to 0 in discrete calculus since the smallest difference between integers is 1. One consequence of that is that derivatives (aka deltas aka \Delta) are calculated using finite differences, commonly forward differences. When the smallest valued difference is 1, it falls out naturally from the definition of the derivative, setting h to 1.

\frac{d}{dx} f(x) = \lim_{h \to 0} \frac{f(x+h)-f(x)}{h}

\Delta f = f(x+1) - f(x)

A nice thing is that there are lots of similarities between discrete calculus and regular calculus, such as the derivative of a constant being zero:

\frac{d}{dx} c = 0

\Delta c = 0

But there are plenty of differences. The product rule in calculus is different than the one in discrete calculus for example. (Note: this is because the extra term has a multiplier that approaches the limit, which is zero in regular calculus, but 1 in discrete calculus.)

\frac{d}{dx}(f(x)g(x)) = \frac{d}{dx}(f(x)) g(x) + f(x) \frac{d}{dx}(g(x))

\Delta(f(x)g(x)) = \Delta(f(x) )g(x) + f(x) \Delta(g(x)) + \Delta(f(x)) \Delta(g(x))

Here’s where umbral calculus comes in. It lets you turn a discrete calculus problem into a continuous calculus problem, and it lets you turn a continuous calculus problem into a discrete one.

In this post we are going to leverage umbral calculus to see how to craft functions that sum discrete functions. If you have a discrete quadratic polynomial f, you’ll get a cubic polynomial g that gives you the sum for all values f(n) for n ranging from 0 to x-1.

How To Do It

Firstly, we need a function that spits out the numbers we want to sum. The function input needs to be integers, but the function output doesn’t need to be.

Once we have our function, the process to get the function g(x) that calculates sums of f(x) is this:

  1. Turn the discrete function into a continuous function.
  2. Integrate the continuous function.
  3. Convert the result from a continuous function back into a discrete function.

If we use \Phi (phi) to denote converting a function from continuous to discrete, and \Phi^{-1} to denote converting a function from discrete to continuous, it looks like this mathematically:

g(x) = \Phi \int \Phi^{-1} f(x) dx

When you plug an integer x into g, you get out the sum of f(n) for all values of n from 0 to x-1.

If you want to get the sum of all values of f(x) from a to b, you can calculate that as g(b+1)-g(a). This works because if you wanted the sum of numbers from 10 to 20, you could sum up all the numbers from 0 to 20, and subtract out the numbers from 0 to 10. This is also a discrete calculus definite integral.

That’s the overview, let’s talk about how to calculate \Phi and \Phi^{-1}.

Calculating Φ

Phi is the operator for converting continuous functions to discrete functions. Phi of powers of x are real easy to compute: They turn powers of x into falling factorials of x which are denoted x_n. So, x^n becomes x_n.

The oddity of this mysterious conversion is what gave umbral (shadow) calculus it’s name. It took a while for people to understand why this works and formalize it. Until then they just knew it worked (although they had counter cases to an imperfect understanding so it didn’t always work), but didn’t know why, so was a bit mysterious.

x^n\Phi x^n \text{ aka } x_n\text{ Expanded}
xxx
x^2x(x-1)x^2 - x
x^3x(x-1)(x-2)x^3-3x^2+2x

The above makes it very easy to work with polynomials. In a couple sections will be a cheat sheet with a couple other transformations I scrounged up.

Some other rules of the phi operator include:

\Phi(a f(x)) = a \Phi f(x)

\Phi(f(x) + g(x)) = \Phi f(x) + \Phi g(x)

\Phi f(x) \neq f(x) \Phi

The last one is meant to show that left and right multiplication are not the same thing, so ordering needs to be carefully considered and conserved.

Here are a couple other useful properties that might be helpful, that I’ve scraped from other sources 🙂

\Delta x_n = n x_{n-1}

\Delta \Phi \sin(x) = \Phi \cos(x)

\Phi e^{-x} = 0

(\Phi \sin(x))^2 + (\Phi \cos(x))^2 = \Phi e^x

\Delta 2^x = 2^x

The last one implies that in discrete calculus, e is 2. That is strange, isn’t it? If you work it out with a few values, you can see that it’s true!

This section is far from exhaustive! Check out the links at the bottom of this post for some more information.

Let’s apply phi to a polynomial as an example of how it works…

f(x) = 3x^2 + \frac{1}{2}x

\Phi f(x) = \Phi (3x^2 + \frac{1}{2}x)

\Phi f(x) = \Phi 3x^2 + \Phi \frac{1}{2}x

\Phi f(x) = 3 \Phi x^2 + \frac{1}{2} \Phi x

\Phi f(x) = 3 (x^2-x) + \frac{1}{2} x

\Phi f(x) = 3x^2 - 3x + \frac{1}{2} x

\Phi f(x) = 3x^2 - 2.5x

Calculating Inverse Φ

Calculating inverse phi is less straight forward – at least as far as I know. There may be a more formulaic way to do it that I don’t know of. I’ll show the inverse phis I know and how they are derived, which should hopefully help if you need others.

x

Firstly, the inverse phi of x is just x. That’s an easy one.

x squared

Next up is the inverse phi of x squared. We know what phi of x squared is from the last section, so let’s start with that.

\Phi x^2 = x^2 - x

Since x is the same as the phi of x, we can replace the last term with phi of x.

\Phi x^2 = x^2 - \Phi x

Now let’s get the naked x squared term on the left side, and all the phi terms on the right side.

x^2 = \Phi x^2 + \Phi x

Lastly, we’ll left multiply both sides by inverse phi, and we are done!

\Phi^{-1} x^2 = x^2 + x

x cubed

Let’s start with the phi of x cubed.

\Phi x^3 = x^3 - 3x^2 + 2x

Let’s turn 2x into 3x-x.

\Phi x^3 = x^3 - 3x^2 + 3x - x

Next we factor -3 out of the middle two terms.

\Phi x^3 = x^3 - 3(x^2 - x) - x

The middle term is 3 times phi of x squared, so we can replace it with that. The -x on the end outside of the parens is also just negative phi of x so we’ll replace that too.

\Phi x^3 = x^3 - 3 \Phi x^2 - \Phi x

Like last time, let’s get the phi terms on the right and the non phi term on the left.

x^3 = \Phi x^3 + 3 \Phi x^2 + \Phi x

And again like last time, we can now left multiply by inverse phi to finish!

\Phi^{-1} x^3 = x^3 + 3 x^2 + x

Cheet Sheet

Here is a table of phi and inverse phi transforms.

f(x)\Phi f(x)\Phi^{-1} f(x)
xxx
x^2x^2-xx^2+x
x^3x^3-3x^2+2xx^3 + 3 x^2 + x
e^{ax}(a+1)^x\text{?}
\sin(x)\sqrt{2^x} \sin{\frac{x \pi}{4}}\text{?}
\cos(x)\sqrt{2^x} \cos{\frac{x \pi}{4}}\text{?}
\ln(x)H_{x-1} \text{ (Harmonic Series)}\text{?}

The coefficients on the polynomials made from powers of x are the Stirling numbers. Phi relates to signed Stirling numbers of the first kind, and inverse phi relates to unsigned Stirling numbers of the second kind. That fact helps if you need to work with higher powers of x.

Example 1: f(x) = x

Let’s use what we know to come up with a function g(x) which sums all integers from 0 to x-1.

Our recipe for making g from f starts out like this:

g(x) = \Phi \int \Phi^{-1} f(x) dx

Our function is just f(x) = x, so we start at:

g(x) = \Phi \int (\Phi^{-1} x) dx

The inverse phi of x is just x.

g(x) = \Phi \int (x) dx

Integrating x gives us one half x squared. We can move the half outside of the phi.

g(x) = \frac{1}{2} \Phi x^2

We know the phi of x squared so we can plug it in and be done!

g(x) = \frac{1}{2}(x^2 - x)

If we plug in 1 to get the sum of all integers from 0 to 0, we get 0.

If we plug in 2 to get the sum of all integers from 0 to 1, we get 1. That is 0 + 1.

Plugging in 3 for summing 0 to 2, we get 3. That is 0 + 1 + 2.

Plugging in 4 gives us 6. That is 0 + 1 + 2 + 3.

Plugging in 5 gives us 10. 10 gives us 45. 100 gives us 4950. 200 gives us 19900. Plugging in a million gives us 499,999,500,000!

If we wanted to sum the integers between 5 and 9, we can do that too! We subtract g(5) from g(10), which is 45-10 or 35. That is 5+6+7+8+9.

So, we know that summing the integers between 100 and 199 is 14,950, since we have g(100) and g(200) above, which are 4950 and 19900 respectively.

Example 2: f(x) = x^2

Let’s find the function to sum squares of integers.

g(x) = \Phi \int \Phi^{-1} x^2 dx

Applying inverse phi…

g(x) = \Phi \int (x^2 + x) dx

integrating…

g(x) = \Phi (\frac{1}{3}x^3 + \frac{1}{2}x^2)

Splitting it into two phi terms, and moving the fractions out of phi…

g(x) = \frac{1}{3} \Phi x^3 + \frac{1}{2} \Phi x^2

Applying phi…

g(x) = \frac{1}{3} (x^3-3x^2+2x) + \frac{1}{2} (x^2 - x)

g(x) = \frac{1}{6}(2x^3-6x^2+4x) + \frac{1}{6}(3x^2-3x)

g(x) = \frac{1}{6}(2x^3-3x^2+x)

Plugging in 2 gives 1, which is 0 + 1.

3 gives 5, which is 0 + 1 + 4.

4 gives 14, which is 0 + 1 + 4 + 9.

5 gives 30, which is 0 + 1 + 4 + 9 + 16.

if we want to get the sum of values for x between 2 and 4, we calculate g(5)-g(2), which is 30-1 = 29. That is 4+9+16.

It works!

Example 3: f(x) = x^2 – x/2

This last example is to show that while the function f has to take in integers, it doesn’t have to give out integers.

Let’s do the dance…

g(x) = \Phi \int \Phi^{-1} (x^2 - \frac{1}{2}x) dx

g(x) = \Phi \int (\Phi^{-1} x^2 - \frac{1}{2} \Phi^{-1} x) dx

g(x) = \Phi \int (x^2 + x - \frac{1}{2} x) dx

g(x) = \Phi \int (x^2 + \frac{1}{2} x) dx

g(x) = \Phi (\frac{1}{3} x^3 + \frac{1}{4} x^2)

g(x) = \frac{1}{3} \Phi x^3 +\frac{1}{4} \Phi x^2

g(x) = \frac{1}{3} (x^3-3x^2+2x) +\frac{1}{4}(x^2 - x)

g(x) = \frac{1}{12} (4x^3-12x^2+8x + 3x^2 - 3x)

g(x) = \frac{1}{12} (4x^3-9x^2+5x)

Plugging in 1 gives 0.

2 gives 1/2, which is 1 – 1/2.

3 gives 3.5, which is (1-1/2) + (4-1).

4 gives 11, which is (1-1/2) + (4-1) + (9-1.5).

5 gives 25, which is (1-1/2) + (4-1) + (9-1.5) + (16-2).

6 gives 47.5, which is (1-1/2) + (4-1) + (9-1.5) + (16-2) + (25-2.5).

It works!

Closing and Links

Finding formulas that sum ranges of discrete functions is pretty cool, but is it very useful?

Well, if you divide a sum by the count of values in the sum, that gives you an average.

That average is also equivalent to doing Monte Carlo integration using regular spaced samples. I’m not a fan of regular spaced samples, but for constant time sampling of arbitrary sample counts, I think I might make an exception!

How about converting problems between the continuous and discrete domain, is that very useful for other things? Not real sure.

Can this be extended to multi dimensional integrals? I bet so!

If you have more info about any of the above, I’d love to hear it!

This video is what first turned me onto this topic and it is quite good. I’ve watched it about 10 times, trying to scrape all the info out of it. https://www.youtube.com/watch?v=D0EUFP7-P1M

This video is also very good. https://www.youtube.com/watch?v=ytZkmV-7EbM

Thanks for reading!

Calculating the Similarity of Histograms or PDFs & Interpolating Them Using the p-Wasserstein Distance

The code that goes with this post is at https://github.com/Atrix256/OT1D.

The title of this post is a mouthful but the topic is pretty useful.

Let’s say that we have two histograms:

  • 1,2,3,1,0,0,0,0,0,0
  • 0,0,0,0,0,0,1,2,3,1

When you want to see how similar two histograms are, some common ways are:

  1. Treat the histograms as high dimensional vectors and dot product them. Higher values mean more similar.
  2. Treat the histograms as high dimensional points and calculate the distance between them, using the regular distance formula (aka L2 norm or Euclidian norm), taxi cab distance (aka L1 norm) or others. Lower values mean more similar.

Those aren’t bad methods, but they aren’t always the right choice depending on your needs.

  1. If you dot product these histograms as 10 dimensional vectors, you get a value of zero because there’s no overlap. If there are more or fewer zeros between them, you still get zero. This gives you no idea how similar they are other than they don’t overlap.
  2. If you calculate the distance between these 10 dimensional points, you will get a non zero value which is nice, but adding or removing zeros between them, the value doesn’t change. You can see this how the length of the vector (1,0,1) is the square root of 2, which is also the length of the vector (1,0,0,1) and (1,0,0,0,1). It doesn’t tell you really how close they are on the x axis.

In this case, let’s say we want a similarity metric where if we separate these histograms with more zeros, it tells us that they are less similar, and if we remove zeros or make them overlap, it tells us that they are more similar.

This is what the p-Wasserstein distance gives us. When we interpolate along this distance metric from the first histogram to the second, the first histogram slides to the right, and this is actually 1D optimal transport. This is in contrast to us just linearly interpolating the values of these histograms, where the first histogram would shrink as the second histogram grew.

The title of this post says that we will be working with both histograms and PDFs (probability distribution functions). A PDF is sort of like a continuous version of a histogram, but integrates to 1 and is limited to non negative values. In fact, if a histogram has only non negative values, and sums up to 1, it can be regarded as a PMF, or probability mass function, which is the discrete version of a PDF and is used for things like dice rolls, coin flips, and other discrete valued random events.

You can do similar operations with PDFs as you can with histograms to calculate similarity.

  1. Instead of calculating a dot product between vectors, you take the integral of the product of the two functions. Higher values means more similar, like with dot product. \int_0^1 f(x)g(x) \, dx. This is a a kind of continuous dot product.
  2. Instead of calculating a distance between points, you take the integral of the absolute value of the difference of the two functions raised to some power and then the result taken to one over the same power (\int_0^1 \left| f(x) - g(x) \right| ^n\, dx)^{1/n}. Lower values means more similar, like with distance. This is a kind of continuous distance calculation.

Also, any function that doesn’t have negative y values can be normalized to integrate to 1 over a range of x values and become a PDF. So, when we talk about PDFs in this post, we are really talking about any function that doesn’t have negative values over a specific range.

p-Wasserstein Distance Formula

“Optimal Transport: A Crash Course” by Kolouri et. al (https://www.imagedatascience.com/transport/OTCrashCourse.pdf) page 46 says that you can calculate p-Wasserstein distance like this (they use different symbols):

(\int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^p \, dx)^{1/p}

That looks a bit like the continuous distance integral we saw in the last section, but there are some important differences.

Let me explain the symbols in this function. First is p which is where the p in p-Wasserstein distance comes from. It can be 1, 2, 1.5, or any other number. If using the value 2, you’d call it 2-Wasserstein distance, and it would be dealing with the squares of differences between these functions F^{-1}(x) and G^{-1}(x).

Let’s say that we have a PDF called f. We can integrate that to get a CDF or cumulative distribution function that we’ll call F. We can then switch x and y and solve for y to get the inverse of that function, or F^{-1}. If we have another PDF called g, we can follow the same process to get G^{-1}.

When we calculate the integral above, either analytically / symbolically if we are able, or numerically otherwise, we get the p-Wasserstein distance which has the nice properties we described before.

A fun tangent: if you have the inverse CDF of a PDF, you can plug in a uniform random number in [0,1] to the inverse CDF and get a random number drawn from the original PDF. It isn’t always possible or easy to get the inverse CDF of a PDF, but this also works numerically – you can make a table out of the PDF, make a table for the CDF, and make a table for the inverse CDF from that, and do the same thing. More info on all of that at https://blog.demofox.org/2017/08/05/generating-random-numbers-from-a-specific-distribution-by-inverting-the-cdf/.

Calculating 2-Wasserstein Distance of Simple PDFs

Let’s start off by calculating some inverse CDFs of simple PDFs, then we’ll use them in the formula for calculating 2-Wasserstein distances.

Uniform PDF

First is the PDF y=1, which is for uniform random values. It integrates to 1 for x between 0 and 1 and is positive in that range as well.

If we integrate that to get the CDF, we get y=x.

If we flip x and y, and solve again for y to get the inverse CDF, we get y=x.

Linear PDF

The next PDF is y=2x. It integrates to 1 for x between 0 and 1 and is positive in that range as well.

If we integrate that, the CDF is y=x^2.

Flipping x and y and solving for y gives us the inverse CDF y=\sqrt{x}.

Quadratic PDF

The last PDF is y=3x^2. It integrates to 1 for x between 0 and 1 and is positive in that range as well.

If we integrate that, the CDF is y=x^3

Flipping x and y and solving for y gives us the inverse CDF y=\sqrt[3]{x}.

2-Wasserstein Distance From Uniform to Linear

The 2-Wasserstein distance formula looks like this:

(\int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^2 \, dx)^{1/2}

Let’s square the function for now and we’ll square root the answer later. Let’s also replace F^{-1} with the uniform PDF inverse CDF, and replace G^{-1} with the linear PDF inverse CDF. We can drop the absolute value since the square does that for us.

\int_0^1 (x - \sqrt{x})^2 \, dx

Expanding the equation, we get:

\int_0^1 x^2 -2x \sqrt{x} + x  \, dx

Luckily we can break that integral up into 3 integrals:

\int_0^1 x^2 \, dx - \int_0^1 2x \sqrt{x}\, dx + \int_0^1 x  \, dx

https://www.wolframalpha.com/ comes to the rescue here where you can ask it “integrate y=x^2 from 0 to 1” etc to get the answer to each integral.

1/3 - 4/5 +1/2 = 1/30

Remembering that we removed the square root from the formula at the beginning, we have to square root this to get our real answer.

\sqrt{1/30} = \sqrt{30}/30 = 0.18257418583

2-Wasserstein Distance From Uniform to Quadratic

We can do this again measuring the distance between uniform and quadratic.

\int_0^1 (x - \sqrt[3]{x})^2 \, dx

\int_0^1 x^2-2 x \sqrt[3]{x} + \sqrt[3]{x}^2 \, dx

1/3 - 6/7 +3/5 = 8/105

Square rooting that gives:

\sqrt{8/105} = (2 * \sqrt{210})/105 = 0.27602622373

2-Wasserstein Distance From Linear to Quadratic

We can do this again between linear and quadratic.

\int_0^1 (\sqrt{x} - \sqrt[3]{x})^2 \, dx

\int_0^1 x - 2 \sqrt{x} \sqrt[3]{x} + \sqrt[3]{x}^2 \, dx

1/2 - 12/11 + 3/5 = 1/110

Square rooting that gives:

\sqrt{110}/110 = 0.09534625892

These results tell us that uniform and quadratic are most different, and that linear and quadratic are most similar. Uniform and linear are in the middle. That matches what you’d expect when looking at the graphs.

Calculating p-Wasserstein Distance of More Complicated PDFs

If you have a more complex PDF that you can’t make an inverted CDF of, you can still calculate the distance metric numerically.

You start off by making a table of PDF values at evenly spaced intervals, then divide all entries by the sum of all entries to make it a tabular form of the PDF (it sums to 1). You then make a CDF table where each entry is the sum of PDF values at or before that location.

Now to evaluate the inverted CDF for a value x, you can do a binary search of the CDF table to find where that value x is. You also calculate what percentage the value is between the two indices it’s between so that you have a better approximation of the inverted CDF value. Alternately to a binary search, you could make an ICDF table instead.

If that sounds confusing, just imagine you have an array where the x axis is the array index and the y axis is the value in the array. This is like a function where if you have some index you can plug in as an x value, it gives you a y axis result. That is what the CDF table is. If you want to evaluate the inverted CDF or ICDF, you have a y value, and you are trying to find what x index it is at. Since the y value may be between two index values, you can give a fractional index answer to make the ICDF calculation more accurate.

Now that you can make an inverted CDF table of any PDF, we can use Monte Carlo integration to calculate our distance metric.

Let’s look at the formula again:

(\int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^p \, dx)^{1/p}

To calculate the inner integral, we pick a bunch of x values between 0 and 1, plug them into the equation and average the results that come out. After that, we take the answer to the 1/p power to get our final answer. The more samples you take of x, the more accurate the answer is.

\text{result}^p = \int_0^1 \left| F^{-1}(x) - G^{-1}(x) \right| ^p \, dx

Here is some C++ that implements that, with the ICDF functions being template parameters so you can pass any callable thing in (lambda, function pointer, etc) for evaluating the ICDF.

template <typename ICDF1, typename ICDF2>
float PWassersteinDistance(float p, const ICDF1& icdf1fn, const ICDF2& icdf2fn, int numSamples = 10000000)
{
    // https://www.imagedatascience.com/transport/OTCrashCourse.pdf page 45
    // Integral from 0 to 1 of abs(ICDF1(x) - ICDF2(x))^p
    // Then take the result to ^(1/p)
    double ret = 0.0;
    for (int i = 0; i < numSamples; ++i)
    {
        float x = RandomFloat01();
        float icdf1 = icdf1fn(x);
        float icdf2 = icdf2fn(x);
        double y = std::pow(std::abs((double)icdf1 - (double)icdf2), p);
        ret = Lerp(ret, y, 1.0 / double(i + 1));
    }
    return (float)std::pow(ret, 1.0f / p);
}

If both ICDFs are tables, that means they are both piecewise linear functions. You can get a noise free, deterministic calculation of the integral by integrating the individual sections of the tables, but the above is a generic way that will work for any (and mixed) methods for evaluating ICDFs.

Below is a table of the 2-Wasserstein distances calculated different ways. The analytic truth column is what we calculated in the last section and is the correct answer. The numeric function column is us running the above numerical integration code, using the inverted CDF we made symbolically in the last section. The last column “numeric table” is where we made a tabular inverted CDF by taking 10,000 PDF samples and making a 100 entry CDF, then using binary search to do inverse CDF evaluations.

Analytic TruthNumeric FunctionNumeric Table
Uniform To Linear0.182574185830.1825680.182596
Uniform To Quadratic0.276026223730.2760120.275963
Linear To Quadratic0.095346258920.0953270.095353

For kicks, here are p values of 1 and 3 to compare against 2. Calculated using the “numeric table” method.

123
Uniform To Linear0.1666090.1825960.192582
Uniform To Quadratic0.2499330.2759630.292368
Linear To Quadratic0.0833080.0953530.103187

Calculating p-Wasserstein Distance of PMFs

If you have a random variable with discrete random values – like a coin toss, or the roll of a 6 sided dice – you start with a PMF (probability mass function) instead of a PDF. A PMF says how likely each discrete event is, and all the events add up to 1.0. The PMF is probably already a table. You can then turn that into a CDF and continue on like we did in the last section. Let’s see how to make a CDF for a PMF.

When you evaluate y=CDF(x), the y value you get is the probability of getting the value x or lower. When you turn a PMF into a CDF, since it’s discrete, you end up with a piecewise function.

For instance, here is the the PMF 0.1, 0.3, 0.2, 0.4.

The CDF is 0.1, 0.4, 0.6, 1.0.

With a CDF you can now do the same steps as described in the last section.

Interpolating PDFs & Other Functions

Interpolating over p-Wassertstein distance is challenging in the general case. I’m going to dodge that for now, and just show the nice and easy way to interpolate over 1-Wasserstein distance.

First let’s see what happens if we interpolate two gaussian PDFs directly… by linearly interpolating PDF(x) for each x to make a new PDF. As you can see, the gaussian on the left shrinks as the gaussian on the right grows.

If we instead lerp ICDF(x) for each x to make a new ICDF, invert it to make a CDF, then differentiate it to get a PDF, we get this very interesting optimal transport interpolation:

For a second example here is a uniform PDF interpolating into Gaussian, by lerping the PDFs and then the ICDFs.

As a last set of examples, here is uniform to quadratic. First by lerping the PDFs, then by lerping the ICDFs.

It sounds like a lot of work interpolating functions by integrating the PDFs or PMFs to CDFs, inverting them to ICDFs, lerping, then inverting and differentiating, and it really is a lot of work.

However, like I mentioned before, if you a plug a uniform random value into an ICDF function, it will give you an unbiased random value from the PDF. This means that in practice, if you are generating non uniform random numbers, you often have an ICDF handy for the distribution already; either symbolically or numerically in a table. That makes this whole process a lot simpler: plug your uniform random x value into the two ICDFs, and return the result of lerping them. That’s all there is to it to generate random numbers that are from a PDF interpolated over 1-Wasserstein distance. That is extremely simple and very runtime friendly for graphics and game dev people. Again, more simply: just generate 2 random numbers and interpolate them.

If you are looking for some intuition as to why interpolating the ICDF does what it does, here are some bullet points.

  • PDF(x) or PMF(x) tells you how probable x is.
  • CDF(x) tells you how probable it is to get the value x or lower.
  • ICDF(x) tells you the value you should get, for a uniform random number x.
  • Interpolating PDF1(x) to PDF2(x) makes the values in PDF1 less likely and the values in PDF2 more likely. It’s shrinking PDF1 on the graph, and growing PDF2.
  • Interpolating CDF1(x) to CDF2(x) gets the values that the two PDFs would give you for probability x, and interpolates between them. If a random roll would get you a 3 from the first PDF, and a 5 from the second PDF, this interpolates between the 3 and the 5.

In this section we’ve talked about linear interpolation between two PDFs, but this can actually be generalized.

  1. A linear interpolation is just a degree 1 (linear) Bezier curve, which has 2 control points that are interpolated between. The formula for a linear Bezier curve is just A(1-t)+Bt which looks exactly like linear interpolation, because it is! With a quadratic Bezier curve, you have 3 control points, which means you could have three probability distributions you were interpolating between, but it would only pass through the two end points – the middle PDF would never be fully used, it would just influence the interpolation. You can do cubic and higher as well for more PDFs to interpolate between. Check this out for more info about Bezier curves. https://blog.demofox.org/2015/05/25/easy-binomial-expansion-bezier-curve-formulas/.
  2. A linear interpolation is a Bezier curve on a one dimensional 1-simplex aka a line with barycentric coordinates of (s,t) where s=(1-t). The formula for turning barycentric coordinates into a value is As+Bt which is again just linear interpolation. You can take this to higher dimensions though. In two dimensions, you have a 2-simple aka a triangle with barycentric coordinates of (u,v,w). The formula for getting a value on a (linear) Bezier triangle is Au+Bv+Cw where A, B and C would be 3 different PDFs in this case. Higher order Bezier triangles exist too. Have a look here for more information: https://blog.demofox.org/2019/12/07/bezier-triangles/
  3. Lastly, Bezier surfaces and (hyper) volumes exist, if you want to interpolate in rectangular domains aka “tensor product surfaces”. More info on that here: https://blog.demofox.org/2014/03/22/bezier-curves-part-2-and-bezier-surfaces/.

To see how to linearly interpolate in other p-Wasserstein distance metrics, give this link below a read. I’ll be doing a post on optimal transport in higher dimensions before too long and will revisit it then.

https://pythonot.github.io/auto_examples/plot_barycenter_1D.html

Thanks for reading!

Rounding Modes For Integer Division

This is a super quick blog post on how to get different rounding modes with integer division: floor to get an integer smaller than or equal to the real value, round to get the integer that closest to the real value, and ceiling to get an integer greater than or equal to the real value.

These operations are useful when you need to align memory allocations, calculate dispatch sizes for compute shaders, work with objects on a grid, and many other things.

Unsigned Integers

  • Floor: A / B
  • Round: (A+B/2) / B
  • Ceiling: (A+B-1) / B alternately: (A-1) / B+1 (thanks @clift_m!)
void RoundTest(unsigned int a, unsigned int b)
{
	printf("%i / %i = %0.2f\n", a, b, float(a) / float(b));
	printf("floor = %i\n", a / b);
	printf("round = %i\n", (a + b / 2) / b);
	printf("ceiling = %i\n\n", (a - 1) / b + 1);
}

Running the above with a couple different values:

Signed Integers

Here are routines that handle signed integers, and a testing program for them, courtesy of @insouris.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define N 30

static int div_floor(int a, int b) { return (a ^ b) < 0 && a ? (1 - abs(a)) / abs(b) - 1 : a / b; }
static int div_round(int a, int b) { return (a ^ b) < 0 ? (a - b / 2) / b : (a + b / 2) / b; }
static int div_ceil(int a, int b) { return (a ^ b) < 0 || !a ? a / b : (abs(a) - 1) / abs(b) + 1; }

int main()
{
    for (int a = -N; a <= N; a++) {
        for (int b = -N; b <= N; b++) {
            if (!b)
                continue;

            const float f = a / (float)b;

            const int ef = (int)floorf(f);
            const int er = (int)roundf(f); // X.5 to X+1, or -X.5 to -X
            const int ec = (int)ceilf(f);

            const int of = div_floor(a, b);
            const int orn = div_round(a, b);
            const int oc = div_ceil(a, b);

            const int df = ef != of;
            const int dr = er != orn;
            const int dc = ec != oc;

            if (df || dr || dc)
                fprintf(stderr, "%d/%d=%g%s\n", a, b, f, (a ^ b) < 0 ? " (diff sign)" : "");

            if (df) fprintf(stderr, "floor: %d != %d\n", of, ef);
            if (dr) fprintf(stderr, "round: %d != %d\n", orn, er);
            if (dc) fprintf(stderr, "ceil: %d != %d\n", oc, ec);
        }
    }
    return 0;
}

Calculating SVD and PCA in C++

The last post (https://blog.demofox.org/2022/07/10/programming-pca-from-scratch-in-c/) talked about calculating PCA from scratch, and gave some C++ code which did just that.

The post mentioned that the code was somewhat naïve and shouldn’t be used in situations that matter, because the operations done for PCA were notoriously complex numerically and difficult to make robust. It said that a library called Eigen should be used if doing things like that in C++. This post does just that – it uses the Eigen library to compute SVD (Singular Value Decomposition), and also shows how to calculate PCA from SVD, which is a more robust way of doing it than I showed in the last post. Specifically, PCA from SVD doesn’t need to calculate the covariance matrix, which can be a source of numerical problems.

SVD is pretty closely related to PCA, with the main differences being that the formula is different, and data doesn’t need to be centered for SVD, while it does for PCA.

The 190 lines of C++ code that goes along with this post is at https://github.com/Atrix256/SVD. The code uses the Eigen library, which is a header only library.

Calculating SVD

This video does a great job of explaining how to calculate SVD. Feel free to give it a watch, but i’ll also explain below. https://www.youtube.com/watch?v=cOUTpqlX-Xs.

SVD takes as input a matrix (C) sized MxN (it doesn’t have to be square) and breaks it apart into a rotation (U), a scale (\Sigma), and another rotation (V^T). Any matrix can be decomposed this way:

C = U \Sigma V^T

The scaling matrix \Sigma has zeros everywhere except the diagonal. The values on the diagonal are called the “Singular Values” and they also describe how “important” each row of U and column of V^T is. Just like PCA, you can get rid of the ones with the lowest singular values and thus reduce dimensionality of the data. This is called “Truncated SVD”.

So how do you calculate SVD?

First you multiply your (possibly rectangular) matrix C^T with C to make the square matrix C^TC.

The eigenvectors of C^TC get put together as column vectors to make the matrix V.

The matrix \Sigma has the square roots of the eigenvalues of C^TC along the diagonal.

The matrix U is made by multiplying C * V and normalizing the columns.

That’s all there is to it!

C++ Code

Using eigen to calculate SVD as described above is pretty straightforward:

Matrix2d C;
C <<
    5.0f, 5.0f,
    -1.0f, 7.0f;

// Calculate SVD as described in https://www.youtube.com/watch?v=cOUTpqlX-Xs
{
    Matrix2d CTC = C.transpose() * C;

    EigenSolver<Matrix2d> es(CTC);
    EigenSolver<Matrix2d>::EigenvalueType eigenValues = es.eigenvalues();
    EigenSolver<Matrix2d>::EigenvectorsType eigenVectors = es.eigenvectors();

    auto V = eigenVectors;

    MatrixXcd sigma;
    {
        sigma = MatrixXcd::Zero(eigenValues.rows(), eigenValues.rows());
        for (int i = 0; i < eigenValues.rows(); ++i)
            sigma(i, i) = sqrt(eigenValues[i].real());
    }

    MatrixXcd U = C * V;
    for (int i = 0; i < U.cols(); ++i)
        U.col(i).normalize();

    cout << "C = \n" << C << "\n\n";

    cout << "U = \n" << U.real() << "\n\n";
    cout << "sigma = \n" << sigma.real() << "\n\n";
    cout << "V = \n" << V.real() << "\n\n";

    cout << "U * sigma * VT = \n" << (U * sigma * V.transpose()).real() << "\n\n";
}

Running that gives this output. The final output of U*sigma*VT shows that putting the decomposed matrices back together gives you the original data.

While the Eigen library has interfaces to let you calculate eigenvalues and eigenvectors, it also has higher level functionality – including performing SVD itself! There are two ways to do SVD in Eigen… one is better for smaller matrices (JacobiSVD), the other is better for bigger matrices (BDCSVD). In this case, I use the one better for bigger matrices, because AFAIK it’s fine for smaller ones too, just a bit slower.

Matrix2d C;
C <<
	5.0f, 5.0f,
	-1.0f, 7.0f;

// Calculate SVD using the built in functionality
{
    BDCSVD<Matrix2d> svd(C, ComputeFullU | ComputeFullV);

    auto U = svd.matrixU();
    auto V = svd.matrixV();
    auto sigma = svd.singularValues().asDiagonal().toDenseMatrix();

    cout << "C = \n" << C << "\n\n";

    cout << "U = \n" << U << "\n\n";
    cout << "sigma = \n" << sigma << "\n\n";
    cout << "V = \n" << V << "\n\n";

    cout << "U * sigma * VT = \n" << U * sigma * V.transpose() << "\n\n";
}

Running that outputs this, which is the same as before, except the eigenvalues are switched, and one of the eigenvectors was multiplied by -1 (opposite magnitude but still the same eigenvector).

Calculating PCA With SVD

To calculate PCA using SVD, all you need to do is run SVD and then multiply C with V. Here’s some code where we use the 3×3 Gaussian kernel with sigma 0.3 from last blog post, and run PCA on it again.

// PCA!
{
    Matrix3d newC;
    newC <<
        0.002300, 0.043200, 0.002300,
        0.043200, 0.818000, 0.043200,
        0.002300, 0.043200, 0.002300;

    BDCSVD<Matrix3d> svd(newC, ComputeFullU | ComputeFullV);

    auto U = svd.matrixU();
    auto V = svd.matrixV();
    auto sigma = svd.singularValues().asDiagonal().toDenseMatrix();

    cout << "C = \n" << C << "\n\n";

    cout << "V = \n" << V << "\n\n";

    cout << "sigma = \n" << sigma << "\n\n";

    cout << "C * V = \n" << newC * V << "\n\n";

    cout << "Principle Direction = \n" << (newC * V).col(0).normalized() << "\n\n";
}

When we do that we get this. The sigma tells us that there is basically only one non zero component (the others are close enough to zero to be considered zero) and the principle direction matches what we found in the last post as well. Looks like it’s working!

SVD Of Image

Next up, we’ll take an image with size WxH, make a matrix that is Wx(3H) in size, and perform truncated SVD on it, with different numbers of singular values. The matrix is 3 times as tall because we are going to separate each row of RGB pixels into three rows of pixels; one row for each color channel. The code to do this is in the github repo, so go have a look and play around with it if you want to 🙂

First is the old man from legend of zelda 1 in NES. First is the original image, then 1 singular value, then 5%, 10%, 15%, 20%, 25% and 50%.

Next up is a different image, but the same percentages of singular values in the same order.

Using Truncated SVD In Practice

In the code I set singular values to zero to truncate SVD, but in reality if you wanted to truncate to N dimensions, you would shrink the sigma matrix to be NxN, and you would shrink the U and V matrix both to have N columns. In Eigen, you can do this using the “conservativeResize(rows, cols)” function.

Whether you have truncated or not, the U matrix can be considered your encoded data, and you can multiply sigma by V transpose to make the decoding matrix. Multiplying U by the decoding matrix will give you back your data, or an approximation to it if you truncated.

As a quick overview of what size of compressed data to expect, let’s say you have R rows of D dimensional data (an RxD matrix – R rows of data, and D columns of data).

When you SVD that, U will be RxR, Sigma will be RxR and V will be DxR. Multiplying sigma by V transpose to make the decoding matrix will be RxD. Multiplying U*Sigma*V transpose will be RxD.

Let’s say that you truncate the SVD data from being D dimensional to N dimensional. U will be RxN, Sigma will be NxN, and V will be DxN. Multiplying sigma by V transpose to make the decoding matrix will be NxD. Multiplying U*Sigma*V transpose will still be RxD.

When you have the U and sigma*V transpose matrices and you want to decode a single (scalar) value of data, you get the entire row from U that corresponds to your data’s R dimension, which will be M dimensional, and you get the entire column from the decoding matrix that corresponds to your data’s D dimension, which will also be M dimensional. You then dot product these vectors and have your decoded scalar value. This is useful if you want random access to your encoded data, such as in a shader on the GPU.

Knowing where to truncate to can be done by looking at the sigma values and finding where they taper off enough to your liking. Here is a graph of some SVD’d data where I divided all the singular values by the sum of the singular values to normalize them, and then I graphed the sum of the singular values at or before each dimension. You can see that this data is well fit by 3 or 4 dimensions.

The last 2 columns in this data show “Explained Variance” and “Culmulative Explained Variance” which is another metric to use to see how good a fit is for a certain number of dimensions. You calculate explained variance by dividing sigma squared by the sum of all sigmas squared. For culmulative explained variance, you sum up the sigmas squared at or before the dimension you are looking at, divided by the sum of all sigmas squared.

An intuitive perspective to SVD for data compression is that if you have some data that is AxB in size, SVD will break it into three matrices: an AxA matrix, an AxA scaling matrix, and an AxB matrix. You can multiply the middle scaling matrix into either the left or the right matrix and then end up with only two matrices: one is AxA and the other is AxB. At this point, our data is larger than it used to be, but that scaling matrix told us how important each column of the first matrix / row of the second matrix is. We can remove those for any place that the scaling matrix is zero or close enough to zero to be considered zero. That gives us two matrices AxN and NxB. At this point, if (A*N + N*B) is smaller than (A*B) then we’ve compressed our data in a nearly lossless way.

We don’t have to stop here though, and can start trimming more rows and columns off to make the matrices be AxM and MxB, such that (A*M + M*B) is smaller than (A*B) by as much as we want – but we introduce error into the process and our compression is no longer lossless. The error we are adding is minimized in the “least squares” sense, so it will be many small errors instead of a few large errors, which is often desired in practice.

A matrix multiply is all that is needed to get your data back, but individual values can be read back easily as well (random access). You get the a’th row from the first matrix, and the b’th column from the second matrix. The vectors will both be of size M. You dot product them, and your value is “decompressed”.

If doing something like compressing animation, the first matrix might be indexed by time, and the second matrix might be indexed by bone id*3, having 3 entries for the position of a bone starting at that bone id*3 location in the matrix. Then it’s just a vec3 * mat3x3 multiply (or alternately 3 dot products) to get the position back.

Shorter Intuition & Practical Concerns (Added 10/2/22)

Here is some shorter intuition for using truncated SVD for data compression.

If you start with data that is W * H in size, this lets you make it into data that is (W + H) * N in size, where N is tuneable for quality vs size. The U matrix will be W * N in size, and the SigmaV matrix will be H * N in size.

To decompress a single value at (X,Y), you read an N sized vector from U, an N sized vector from SigmaV, and dot product them to get the value back.

The downside to this technique is the memory (texture) reads, where for each value you want to decompress you need to read 2*N values to dot product them together.

Closing And Links

The Eigen library is pretty neat. You can grab the latest zip from https://eigen.tuxfamily.org/ and after unzipping, the “Eigen” folder within the zip has the files you want to include in your program.

Compilation times are pretty long when including files from this library, which i found surprising. It’d probably be best to put the includes to Eigen within a static library that didn’t get changed often to avoid the compilation times. I also had to turn on \bigobj in msvc after hitting an error regarding it.

Other than that, I’m a fan of Eigen and expect to use it when I need to do things like this, instead of writing from scratch, or calling into python scripts to do these numerical operations (hehe).

Here is a great use of SVD to compress PBR textures:
https://bartwronski.com/2020/05/21/dimensionality-reduction-for-image-and-texture-set-compression/

Here are some links I found useful when learning SVD:
https://jonathan-hui.medium.com/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491 
https://youtu.be/DG7YTlGnCEo
https://youtu.be/Jqg7JgCwQrk
https://towardsdatascience.com/simple-svd-algorithms-13291ad2eef2 
https://medium.com/swlh/truncated-singular-value-decomposition-svd-using-amazon-food-reviews-891d97af5d8d

Programming PCA From Scratch In C++

When running PCA (Principle Component Analysis) on a data set, it gives back a set of orthogonal axes for that data, as well scalar values telling you the importance of each axis.

When sorting the axes by importance, the first axis is the axis that has the most variation of the data, so is the best approximation of the data with a straight line. The second axis is perpendicular and is the axis that adds the most variation of the error from the first line. The third axis is perpendicular and adds on the most remaining variation, same with the fourth, fifth and so on, up to the dimensionality of your data set.

While you can transform your data to use these axes and reconstruct the data perfectly, PCA is useful because you can decide to throw away some number of the least important axes, thus reducing the dimensionality of the data!

That can be useful for data analysis (reasoning about 2D data is a lot easier than 4D data!), but it can also be a form of data compression, if the error introduced by throwing away one or more dimensions is acceptable.

PCA itself is pretty easy to explain, but to implement it yourself, you need to be able to calculate the eigen values and eigen vectors of a matrix. Those operations are notoriously difficult to do in robust ways, so the common wisdom is to not implement them yourself, but to use software like python or libraries like Eigen (https://eigen.tuxfamily.org/) to do the hard bits for you.

To crack open the black box a bit though, I implemented it in plain C++. It’s about 800 lines including comments, generated all the data in this blog post, and you can check it out here: https://github.com/Atrix256/PCA.

Calculating PCA

To calculate PCA you…

  1. Center the data
  2. Calculate the covariance matrix
  3. Calculate the covariance eigen values and eigen vectors
  4. Transform the data
  5. Recover the data

Let’s go through this step by step.

1. Center the Data

For this you just calculate the average of your data and subtract the average from all data points.

This makes the data centered on the origin which is important for PCA since it is meant to find the axes of importance in the data – NOT fit a line to the data.

Later on if you want to recover your uncentered data from the PCA encoded data, you’ll need to keep this average around to add it back in. Something to keep in mind if thinking about using PCA for data compression.

2. Calculate the Covariance Matrix

If you have N dimensional data, the covariance matrix will be NxN.

Each cell of the covariance matrix i,j will contain the covariance value between all the data in dimension i and dimension j. Since the covariance of a variable with itself is the variance, the diagonal of the matrix will have the variance values of the dimensions.

As a recap, covariance is calculated as:

cov(X,Y) = \mathbb{E}(XY) - \mathbb{E}(X) * \mathbb{E}(Y)

Variance is just covariance of a value with itself, so is calculated as:

var(X) = \mathbb{E}(X^2) - \mathbb{E}(X)^2

If your data represents a sample of the data instead of the entire data, you’ll need to use sample variance instead, which divides the sums by (n-1) instead of dividing by n.

3. Calculate the Covariance Eigen Values and Eigen Vectors

This is the part that is less easy, where we calculate the eigen values and eigen vectors of the covariance matrix. The eigen vectors are the axes of importance we mentioned before, and the eigen values are how important each one is.

First we need to calculate the eigen values.

I calculate them using the “QR Algorithm” (https://en.wikipedia.org/wiki/QR_algorithm) which is an algorithm that gives you approximate eigen values. It works by repeatedly doing QR decomposition (https://en.wikipedia.org/wiki/QR_decomposition) on the matrix and setting the matrix to R*Q for the next iteration. You do this process until a certain number of iterations has occurred, or all non diagonal elements of the matrix are close enough to zero for your liking. When finished, the approximate eigen values are the diagonal values of the matrix.

I loop until the largest non diagonal value is less than 0.0001, but didn’t play around with this very much. You can see this code at https://github.com/Atrix256/PCA/blob/main/vecmath.h#L321.

As part of the QR decomposition, I use the Gram-Schmidt process (https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process) to make orthogonal vectors.

There are other ways to get eigen values, and other choices you can make in the QR algorithm. I didn’t implement shifting, which helps convergence, and I’ve heard that Householder reflections are much better quality, and not that much more difficult to implement.

In any case, this works well enough for me to get good enough eigen values in my test cases.

Now that we have the eigen values, we need to get the eigen vector associated with each one. To do that, I use the inverse iteration algorithm (https://en.wikipedia.org/wiki/Inverse_iteration) which gives you an eigenvector associated with an approximate eigen value. It’s a pretty simple algorithm that just inverts a matrix and then repeatedly puts a randomly initialized vector through it. I do that for 10,000 iterations. The code for it is at https://github.com/Atrix256/PCA/blob/main/vecmath.h#L398.

I also normalize the eigen vectors and re-calculate the eigen values from the newly found eigen vectors to make them match up better. I had to switch from floats to doubles at one point as well, but may have just been doing something silly, like letting numbers get too small without clamping them to zero.

4. Transform the data

Sorting the eigen vectors from largest to smallest, you can then make them be columns of a matrix. This is the matrix you use to convert your data into “PCA space”. To reduce dimensionality you can either throw away columns from the right at this point, or you can wait until after transforming the data into the new space. It’s up to you to decide how many columns you can get away with throwing away for your usage case.

5. Recover the Data

To recover your data from “PCA space”, you make the eigenvectors for the dimensions that you didn’t throw out as rows in a matrix instead of columns. Multiply the PCA data by this matrix to get the data back minus the information contained in the lost columns.

Not too bad right? I want to re-iterate that while this code works for my handful of test cases on this blog post, it almost certainly is not production ready. It’s just a semi naive implementation for learning purposes.

Result: Gaussian Blur Kernel

First let’s PCA a 3×3 gaussian blur kernel with a sigma of 0.3.

The filter itself is:

[0.002300, 0.043200, 0.002300]
[0.043200, 0.818000, 0.043200]
[0.002300, 0.043200, 0.002300]

The eigen values and eigen vectors are:

[0.134147] [0.052641, 0.997225, 0.052641]
[0.000000] [0.000000, 0.000000, 0.000000]
[0.000000] [0.000000, 0.000000, 0.000000]

Interestingly, you can see that there’s only one non zero eigenvalue. That means that our data is in fact a rank 1 matrix in disguise! In other words, all the rows are just scalar multiples of each other. We could then replace each row with a single scalar value to multiply the eigen vector by which would reduce our data from 3D to 1D and 1/3 the storage size!

The two rows of zero eigen values is due to the fact that gaussian blurs are separable, so rows (or columns) really are just scalar multiples of each other. This test was inspired by a post from Bart Wronski on how to use SVD to separate non separable filters: https://bartwronski.com/2020/02/03/separate-your-filters-svd-and-low-rank-approximation-of-image-filters/

Result: 2D Data

Here’s the eigenvectors of some 2D data, then showing it reduced to 1 dimension along the larger eigenvector, then showing it including the smaller eigenvector, which makes it fit the data perfectly.

Here is the same with some different data points.

Result: 3D Data

I took some 3D data from a page that was real helpful when trying to understand PCA better (https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643):

[90.000000, 60.000000, 90.000000]
[90.000000, 90.000000, 30.000000]
[60.000000, 60.000000, 60.000000]
[60.000000, 60.000000, 90.000000]
[30.000000, 30.000000, 30.000000]

The eigen values and eigen vectors are:

[910.069953] [0.655802, 0.429198, 0.621058]
[629.110387] [0.385999, 0.516366, -0.764441]
[44.819660] [0.648790, -0.741050, -0.172964]

And here is the RMSE for the data when PCA’d to different numbers of dimensions:

[1] 25.960163
[2] 6.694749
[3] 0.000000

At 1 component the data is:

[88.540569, 74.751951, 81.346364]
[72.547174, 64.284878, 66.200303]
[63.419540, 58.311187, 57.556254]
[75.638271, 66.307884, 69.127633]
[29.854445, 36.344100, 25.769445]

At 2 components:

[83.264247, 67.693599, 91.795721]
[90.954764, 88.909466, 29.745464]
[62.525570, 57.115286, 59.326695]
[65.892097, 53.270028, 88.429194]
[27.363322, 33.011622, 30.702926]

At 3 components, it matches, since no dimensionality reduction occured.

[90.000000, 60.000000, 90.000000]
[90.000000, 90.000000, 30.000000]
[60.000000, 60.000000, 60.000000]
[60.000000, 60.000000, 90.000000]
[30.000000, 30.000000, 30.000000]

Closing and Links

When you fit PCA to data, it minimizes the distance from the data point to the fit line. This is in contrast to least squares which only minimizes the distance on the y axis. One reason I wanted to dive deep into PCA was to see if it could be used to fit piecewise non linear curves like least squares can. After learning more, that seems to be swimming against the current for PCA, which is more aimed at things like making 3d vectors be storable as 2d vectors. I’ll bet what I was thinking of is possible, but I couldn’t really find info on it. I did find out that “Total Least Squares” exists which does the same kind of fitting though, so I think I should look more into that. Also “non linear dimensional reduction” is a topic as well.

It turns out too that a better way to do PCA than what I’ve done above is to use SVD to do PCA. I plan on doing a post on SVD soon and should be able to talk more about that then.

These two links were super helpful while writing this code:
https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643
https://towardsdatascience.com/principal-component-analysis-explained-d404c34d76e7

Since rolling your own PCA is usually frowned on due to the numerical complexity of things involved, here’s some info about how to do PCA in python (also, in C++ you might want to use that Eigen library i mentioned before)
https://stats.stackexchange.com/questions/235882/pca-in-numpy-and-sklearn-produces-different-results

Check out this link for a neat usage case of PCA for texture compression on textures used in PBR (Physically Based Rendering):
https://bartwronski.com/2020/05/21/dimensionality-reduction-for-image-and-texture-set-compression/

Piecewise Least Squares Curve Fitting

This post will first talk about how to do equality constraints in least squares curve fitting before showing how to fit multiple piecewise curves to a single set of data. The equality constraints will be used to be able to make the curves c0 continuous, c1 continuous, or higher continuity, as desired.

Point Constraints

The previous blog post (https://blog.demofox.org/2022/06/06/fitting-data-points-with-weighted-least-squares/) showed how to give weights to points that were fit with least squares. If you want a curve to pass through a specific point, you could add it to the list of points to fit to, but give it a very large weighting compared to the other points, such as a weight of 1000, when the others have a weight of 1.

There is a more direct way to force a curve to go through a point though. I’ll link to some resources giving the derivation details at the end of the post, but for now I’ll just show how to do it.

For ordinary least squares, when making the augmented matrix to solve the system of linear equations it looks like this:

\begin{bmatrix}A^TA|A^TY\end{bmatrix}

You can use Gauss Jordan Elimination to solve it for instance (more on that here: https://blog.demofox.org/2017/04/10/solving-n-equations-and-n-unknowns-the-fine-print-gauss-jordan-elimination/)

Doing that solves for c in the equation below, where c is the vector of coefficients of the polynomial that minimizes the sum of squared error for all of the points.

A^TA*c = A^TY

For weighted least squares it became this:

\begin{bmatrix}A^TWA|A^TWY\end{bmatrix}

Which again solves for c in the equation below:

A^TWA*c = A^TWY

If you want to add an equality constraint to this, you make a vector C and a scalar value Z. The vector C is the values you multiply the polynomial coefficients by, and Z is the value you want to get out when you do that multiplication. You can add multiple rows, so C can be a matrix, but it needs to have as many columns as the A^TA matrix does. Z then becomes a vector, which should have as many rows as the C matrix does. The matrix to solve then becomes:

\left[\begin{array}{rr|r} A^TWA & C^T & A^TWY \\ C & 0 & Z \end{array}\right]

When solving that and getting the coefficients out, you can ignore the extra values on the right where Z is; those are the values of the Lagrange multipliers (see derivation for more details). It’s just the values where A^TWY is, that become the polynomial coefficients.

Let’s see an example of a point constraint for a linear equation.

A linear equation is y=ax+b where a and b are the coefficients that we are solving for.

in our case, our equation is more generalized into Z = aC_1 + bC_0 where our constraint specifies Z, C_0 and C_1.

To make a point constraint, we set C_0 to be 1 (aka x value to the 0th power), and we set C_1 to be the x value (aka x value to the 1st power). We are making the powers of x for our point just like we do when making the A matrix. Our Z value is the y value we want at that x location.

So, more specifically, if we wanted a linear fit to pass through the point (2, 4), our C vector would be [1, 2] and our Z value would be 4.

If we wanted to give the same point constraint to a quadratic function, the C vector would be [1, 2, 4] and the Z value would still be 4.

For a cubic, the C vector would be [1, 2, 4, 8] and the Z value would still be 4.

If you wanted to constrain a quadratic function to 2 specific points (2,4) and (3, 2), C would become a matrix and would be:

\begin{bmatrix}1 & 2 & 4 \\ 1 & 3 & 9\end{bmatrix}

Z would become a vector and would be \begin{bmatrix} 4 \\ 2\end{bmatrix}.

Hopefully makes sense.

There are limitations on the constraints you can give, based on the hard limitations of mathematics. For instance, you can’t tell a linear fit that it has to pass through 3 non colinear points.

Point Constraints vs Weights

Let’s see how weights work versus constraints. First in a linear equation where we fit a line to two points, and have a third point that starts off with a weight of 0, then goes to 1, 2, 4, and then 100. Finally we’ll show what the fit looks like with an actual constraint, instead of just a weighted point.

Here is the same with a quadratic curve.

The constraints let you say “This must be true” and then it does a least squares error solve for the points being fit, without violating the specified constraint. As the weight of a point becomes larger, it approaches the same effect that a constraint gives.

Derivative Constraint

A nice thing about how constraints are specified is that they aren’t limited to point constraints. You are just giving values (Z) that you want when multiplying the unknown polynomial coefficients by known values (C).

If we take the derivative of a quadratic function y=ax^2+bx+c, we get y'=2ax+b.

We can use this to specify that we want a specific derivative value at a specific x location. We plug the derivative we want in as the Z value, and our C constraint vector becomes [0, 1, 2x]. So, if we wanted our quadratic curve to have a slope of 5 at x=1, then our Z value is 5, and our C constraint vector becomes [0, 1, 2]. Below is an image which has 3 points to fit with a quadratic curve, but also has this exact slope constraint:

You can check the equation to verify that when x is 1, it has a slope of 5. Plugging in the coefficients and the value of 1 for x, you get 9.384618 + 2 * -2.192309 = 5.

It made sure the slope at x=1 is 5, and then did a quadratic curve fit to the points without violating that constraint, to find the minimal sum of squared errors. Pretty cool!

Piecewise Curves Without Constraints

Next, let’s see how to solve multiple piecewise curves at a time.

The first thing you do is break the data you want to fit up along the x axis into multiple groups. Each group will be fit by a curve independently, as if you did an independent curve fit for each group independently. We’ll put them all into one matrix though so that later on we can use constraints to make them connect to each other.

Let’s say that we have 6 points (1,3), (2,4), (3,2), (4, 6), (5,7), (6,5), and that we want to fit them with two linear curves. The first curve will fit the data points where x <= 3, and the second curve will fit the data points where x > 3. If we did that as described, there’d be a gap between the 3rd and 4th point though, so we’ll duplicate the (3,2) point, so that it can be the end of the first curve, and the start of the second curve.

For handling multiple curves, we change what is in the A matrix. Normally for a linear fit of 7 points, this matrix would be 2 columns and 7 rows. The first column would be x^0 for each data point, and the second column would be x^1 for each data point. Since we want to fit two linear curves to these 7 data points, our A matrix is instead going to be 4 columns (2 times the number of curves we want to fit) by 7 rows (the number of data points). The first 3 points use the first two columns, and the last four points use the second two columns, like this:

\begin{bmatrix}x_0^0 & x_0^1 & 0 & 0 \\ x_1^0 & x_1^1 & 0 & 0 \\ x_2^0 & x_2^1 & 0 & 0 \\ 0 & 0 & x_3^0 & x_3^1 \\ 0 & 0 & x_4^0 & x_4^1 \\ 0 & 0 & x_5^0 & x_5^1 \\ 0 & 0 & x_6^0 & x_6^1 \end{bmatrix}

When we plug in the x values for our points we get:

A = \begin{bmatrix}1 & 1 & 0 & 0 \\ 1 & 2 & 0 & 0 \\ 1 & 3 & 0 & 0 \\ 0 & 0 & 1 & 3 \\ 0 & 0 & 1 & 4 \\ 0 & 0 & 1 & 5 \\ 0 & 0 & 1 & 6 \end{bmatrix}

Multiplying that A matrix by it’s transpose to get A^TA, we get:

A^TA = \begin{bmatrix} 3 & 6 & 0 & 0 \\ 6 & 14 & 0 & 0 \\ 0 & 0 & 4 & 18 \\ 0 & 0 & 18 & 86 \end{bmatrix}

Multiplying the A matrix transposed by the y values of our points, we get:

A^TY = \begin{bmatrix} 9 \\ 17 \\ 20 \\ 95 \end{bmatrix}

Making that into an augmented matrix of [A^TA | A^TY] and then solving, we get coefficients of:

\begin{bmatrix} 4 \\ -1/2 \\ 1/2 \\ 1 \end{bmatrix}.

The top two coefficients are for the points where x <= 3, and the bottom two coefficients are for the points where x > 3.

y=-1/2x+4 (when x <= 3)

y= x+1/2 (when x > 3)

Piecewise Curves With Constraints

If we wanted those lines to touch at x=3 instead of being disjoint, we can do that with a constraint!

First remember we are solving for two sets of linear equation coefficients, which we’ll call c_a and c_b, each being a vector of the two coefficients needed by our linear equation. If we set the two linear equations equal to each other, we get this:

c_{a1} * x + c_{a0} = c_{b1} * x + c_{b0}

now we plug in 3 for x, and we’ll add an explicit multiplication by 1 to the constant terms:

c_{a1} * 3 + c_{a0} * 1 = c_{b1} * 3 + c_{b0} * 1

Now we move the right side equation over to the left side:

c_{a1} * 3 + c_{a0} * 1 + c_{b1} * -3 + c_{b0} * -1 = 0

We can now turn this into a constraint vector C and a value Z! When making the C vector, keep in mind that it goes constant term, then x1 term for the first curve, then constant term, then x1 term for the second curve. That gives us:

C = [1, 3, -1, -3]

Z = 0

This constraint says “I don’t care what the value of Y is when X is 3, but they should be equal to each other in both linear curves we are fitting”.

Even though the contents of our A matrix changed, the formulation for including equality constraints remains the same:

\left[\begin{array}{rr|r} A^TWA & C^T & A^TWY \\ C & 0 & Z \end{array}\right]

If you plug this in and solve it, you get this:

We can take this farther, and make it so the derivative at x=3 also has to match to make it C1 continuous. For a line with an equation of y=c_1*x+c_0, the derivative is just y' = c_1 so our constraint is:

c_{a1} = c_{b1}

c_{a1} - c_{b1}  = 0

As a constraint vector, this gives us:

C = [0, 1, 0, -1]

Z = 0

If we put both constraints in… that the y value must match at x=3 and also that the slope must match at x=3, we get this, which isn’t very interesting:

Since you can define a linear function as a point on the line and a slope, and we have a “must do this” constraint on both a specific point (y matches when x=3) and a slope (the slopes of the lines need to match), it came up with the same line for both parts. We’ve jumped through a lot of extra hoops to fit a single line to our data points. (Note that it is possible to say the slopes need to match and don’t say anything about the lines needing to have a matching y value. You’d get two parallel lines that fit their data points as well as they could, and would not be touching).

This gets more interesting though if we do a higher degree fit. let’s first fit the data points with two quadratic functions, with no continuity constraints. It looks like they are already C0 continuous, by random chance of the points I’ve picked, but they aren’t. Equation 1 gives y=1.999998 for x=3, while equation 2 gives 2.000089. They are remarkably close, but they are not equal.

Changing it to be C0 continuous doesn’t really change it very much, so let’s skip straight to C1 continuity (which also has the C0 constraint). Remembering that the derivative of a quadratic y=ax^2+bx+c, is y'=2ax+b, we can set the derivatives equal to each other and plug in x for 3.

2 * c_{a2} * x + 1 * c_{a1} = 2 * c_{b2} * x + 1 * c_{b1}

2 * c_{a2} * x + 1 * c_{a1} - 2 * c_{b2} * x - 1 * c_{b1} = 0

6 * c_{a2} + 1 * c_{a1} - 6 * c_{b2} - 1 * c_{b1} = 0

That makes our constraint be:

C = [0, 1, 6, 0, -1, -6]

Z = 0

Plugging that in and solving, we get this very tasty fit, where the quadratic curves are C1 continuous at x=3, but otherwise are optimized to minimize the sum of the squared error of the data points:

An Oddity

I wanted to take OLS to a non matrix algebra form to see if I could get any better intuition for why it worked.

What I did was plug the symbols into the augmented matrix, did Gauss Jordan elimination with the symbols, and then looked at the resulting equations to see if they made any sense from that perspective.

Well, when doing a constant function fit like y=a, a ends up being the sum of the y values, divided by the sum of the x values raised to the 0th power.

a = \frac{\sum{y_i}}{\sum{x_i^0}}

This makes a lot of sense intuitively, since the sum of x values to the 0th power is just the number of points of data. That makes a be the average y value, or the expected y value. That checks out and makes a lot of sense.

Moving onto a linear fit y=c_1 x + c_0, the c_1 term is calculated as:

c_1 = \frac{\sum{(x_i y_i)} \sum{x_i^0} - \sum{x_i} \sum{y_i}}{\sum{(x_i x_i)} \sum{x_i^0} - \sum{x_i} \sum{x_i}}

What’s strange about this formula, is that it ALMOST looks like the covariance of x and y, divided by the covariance of x and x (aka the variance of x). Which, that kinda does look like a slope / derivative: rise over run. It isn’t exactly covariance though, which divides by the number of points to get EXPECTED values, instead of sums.

Then, the c_0 term is calculated as this:

c_0 = \mathbb{E}(y) - \mathbb{E}(x) * c_1

It again ALMOST makes sense to me, in that we have the expected value of y for the constant term, like we did in the constant function, but then we need to account for the c_1 term, so we multiply it by the expected x value and subtract it out.

This almost makes sense but not quite. I don’t think taking it up to quadratic would make it make more sense at this point, so I left it at that.

If you have any insights or intuition, I’d love to hear them! (hit me up at https://twitter.com/Atrix256)

Closing

The simple standalone C++ code that goes with this post, which made all the data for the graphs, and the python script that graphed them, can be found at https://github.com/Atrix256/PiecewiseLeastSquares.

A nice, more formal description of constrained least squares can be found at http://www.seas.ucla.edu/~vandenbe/133A/lectures/cls.pdf.

Here is another nice one by https://twitter.com/sandwichmaker : https://demofox2.files.wordpress.com/2022/06/constrained_point_fitting.pdf

Here is a nice read about how to determine where to put the breaks of piecewise fits, and how many breaks to make: https://towardsdatascience.com/piecewise-linear-regression-model-what-is-it-and-when-can-we-use-it-93286cfee452

Thanks for reading!

Fitting Data Points With Weighted Least Squares

About 6 years ago I wrote about how to fit curves, surfaces, volumes, and hypervolumes to data points in an “online algorithm” version of least squares – that is, you could update the fit as new data points come in.

In this post, I’m going to show how to adjust the least squares equations to be able to give weights for how important data points are in the fit. We’ll be looking at curves only, and not focusing on the “online algorithm” aspect, but this will work for both surfaces/volumes as well as the online updating.

The simple standalone C++ code that goes with this post can be found at https://github.com/Atrix256/WeightedLeastSquares

If you google weighted least squares, or look for it on youtube, almost every example you are likely to find talks about it being used when the variance of data points are different in different parts of the graph, and that is used to make points contribute more or less to the final fit.

That is a pretty neat usage case, but if you are just trying to fit a curve to data points as a way to compress data or optimize calculations, you probably are not really concerned with the variance of the data points you are trying to fit. Luckily, weighting data points in least squares is pretty simple.

Here is the equation you solve in ordinary least squares (OLS):

A^TAx = A^Ty

where A is a matrix that contains all the powers of x of each data point, y is a vector that contains the y values of each data point, and x is polynomial coefficients that we are trying to solve for.

A pretty straightforward way to solve this equation for x is to make an augmented matrix where the left side is A^TA, and the right side is A^Ty.

\left[\begin{array}{r|r} A^TA & A^Ty \end{array}\right]

From there you can do Gauss-Jordan elimination to solve for x (https://blog.demofox.org/2017/04/10/solving-n-equations-and-n-unknowns-the-fine-print-gauss-jordan-elimination/). At the end, the left side matrix will be the identity matrix, and the right side vector will be the value for x.

\left[\begin{array}{r|r} I & x \end{array}\right]

Adjusting this formula for weighting involves introducing a weight matrix called W. W has a zero in every element, except along the main diagonal where the weights of each data point are. W_{ii} = \text{weight}_i.

There is something called “generalized least squares” (GLS) where the weight matrix is a covariance matrix, allowing more complex correlations to be expressed with non zeroes off of the main diagonal, but for weighted least squares, the main diagonal gets the weights of the data points.

The equation to solve for weighted least squares (WLS) becomes this:

A^TWAx = A^TWy

You can make an augmented matrix like before:

\left[\begin{array}{r|r} A^TWA & A^TWy \end{array}\right]

And use Gauss-Jordan elimination again to solve for x:

\left[\begin{array}{r|r} I & x \end{array}\right]

Examples

Here are a few runs of the program which again is at https://github.com/Atrix256/WeightedLeastSquares

Starting with 3 data points (0,0) (1,10), (2,2) all with the same weighting, we fit them with a line and get y = x+3. It’s interesting to see that data points 0 and 2 each have an error of +3 while data point 1 has an error of -6.

If we set the weighting of the 2nd point to 0.0001, we can see that it absorbs most of the error, because the weighting says that it is 10,000 times less important than the other data points.

Alternately, we can give it a weight of 10.0 to make it ten times as important as the other two data points. The error becomes a lot smaller for that data point due to it being more important.

If we switch to a quadratic curve, all three points can be well fit, with the same weights.

The example code works in any degree and also shows how to convert from these power basis polynomials to Bernstein basis polynomials, which are otherwise known as Bezier curve control points (more info on that here: https://blog.demofox.org/2016/12/08/evaluating-polynomials-with-the-gpu-texture-sampler/). The weighting you see in curves is not the same as the weighting here. When you convert the polynomial to a Bezier curve, it’s an unweighted curve. Weighted Bezier curves are made by dividing one Bezier curve by another and are called “Rational” curves, where unweighted curves are called “Integral” curves. Rational (weighted) curves are able to express more shapes than integral curves can, even when those integral curves are made by weighted least squares.

Closing

While we used weighted least squares to give different weightings to different data points, the main motivation you find on the internet seems to be about having different levels of confidence in the data points you get. This makes me wonder if we could use this in graphics, using 1/PDF(x) as the weighting in the least squares fitting like we do in importance sampled monte carlo integration. This seems especially useful if using this as an “online algorithm” that was improving a fit as it got more samples, either spatially or temporally.

Sampling Importance Resampling

The last post was on weighted reservoir sampling, one of the main techniques used in the 2020 paper “Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting” (https://research.nvidia.com/sites/default/files/pubs/2020-07_Spatiotemporal-reservoir-resampling/ReSTIR.pdf). This post is on the other main technique used, which is sampling importance resampling or SIR.

SIR allows you to have a list of items from probability distribution A, and draw items from it from probability distribution B.

The C++ code that goes with this post is at https://github.com/Atrix256/ResampledImportanceSampling

SIR Algorithm

The algorithm starts with a list of values drawn from probability distribution A, assigning each value a weight of 1.

Next, you divide the weight of each item by the PDF for that item in distribution A.

You then multiply the weight of each item by the PDF for that item in distribution B, the target distribution.

Then you normalize the weights of all items in the list so they sum to 1.0.

When you use that normalized weight as a probability for choosing each item, you are drawing values from probability distribution B, even though probability distribution A was used to generate the list.

Example and Intuition

An example really helps understand this I think.

We’ll start with a small list:

4, 5, 5

Choosing an item from the list is twice as likely to give you a 5 than a 4 because there are twice as many of them. When we do uniform sampling on that list, it’s like we are doing weighted sampling of the list, where all the weights are 1. let’s put those weights in the list.

4 (1), 5 (1), 5 (1)

The probability for choosing each value in that list is this:

4: 1/3

5: 2/3

You divide the weights by those probabilities to get:

4 (3), 5 (3/2), 5 (3/2)

Or

4 (3), 5 (1.5), 5 (1.5)

Which we can normalize to this:

4 (0.5), 5 (0.25), 5 (0.25)

If you do a weighted random selection from this list using those weights, you can see that you are now just as likely to get a 4, as you are a 5. So, dividing the weights by the probabilities gave us a uniform distribution.

Now, let’s say our target distribution is that choosing a 4 should be three times as likely as choosing a 5. Here are the probabilities for that:

4: 0.75

5: 0.25

We now multiply the weights by these probabilities and get this:

4 (0.375), 5 (0.0625), 5 (0.0625)

Normalizing those weights, we get this:

4 (0.75), 5 (0.125), 5 (0.125)

If you sample from this list using those weights, there is a 75% chance you’ll get a 4, and a 25% chance you’ll get a 5.

So, even though we made this list by sampling such that a 5 was twice as likely as a 4, we can now sample from it so that a 4 is three times more likely than a 5.

The intuition here is that by dividing the weights by the probabilities that generated the list, we made the list uniform. When we multiplied the weights by the probabilities of the new distribution, it made the list take on that distribution. Not too complicated right?

Results

Here we use this method to generate 100,000 uniform random float values between -1 and 1, and use this technique to sample 10000 sigma 0.1 Gaussian random numbers from it. This is the starting histogram, and the sampling importance resampled values histogram.

Here we start with a sigma 1.0 Gaussian distributed list of values, and resample it into a y=3x^2 PDF.

It doesn’t always work out though. This method can only output values that are in the starting list, so if the starting list doesn’t cover the full range of the target PDF, you won’t get values in the full range of the target PDF. Here we generate uniform random values between -0.1 and 0.3, and then try to resample that into a sigma 0.1 Gaussian. We can only output numbers that exist between -0.1 and 0.3, which isn’t the full range of the Gaussian.

This technique can also hit problems if there aren’t enough unique values in the source list. Here we generate only 500 uniform random samples, instead of the 100,000 that we did for the previous tests, and try to turn it into a Gaussian.

Here we do the non failure uniform to gaussian test in reverse with the same number of start and end samples. It seems as though the rare events at the tail ends of the Gaussian really explode…

Other Stuff

When doing the SIR algorithm, we normalized the weights to be able to a random selection from the list. The normalized weight of an item was it’s probability of being chosen.

This means that you have to do multiple passes through the list: Once to get the sum of the weights, then again to normalize them. Now you can select items from the list randomly.

The last blog post showed us how to do a fair random selection of a weighted list in one pass though, where the weights didn’t need to be normalized. So yep, if you use reservoir sampling to sample from the weighted list, you don’t need to normalize the weights, and you can do the random weighted selection without any preprocessing. You can even calculate the weight for the item the very first time you see it (like as the result of a ray trace?), as you (fairly) consider it for weighted random selection. How cool is that?

Further, the list of samples you choose to take before resampling the importance sampling might be uniform, but it doesn’t need to be. Maybe you could do something like MIS, where you select items using one distribution, then of that distribution, you find the best item using a second distribution? Not sure if that makes sense or not.

Lastly, while talking about uniform and non uniform distributions, white noise is implied, but it need not be white noise. Gaussian blue noise is a thing, or uniform red noise. There should also be ways to use different noise colors on the input and the output, or do weighting (scoring) to promote low discrepancy or other scoring functions that have some sort of memory or awareness of other samples.

That got off onto a tangent a bit, but the point is, there is a lot of possibility here.

If you understood this post and the last, and want more, give the paper a read. You’ll be able to understand what it’s talking about, and it goes a bit deeper into some related topics and implementation details.

“Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting” https://research.nvidia.com/sites/default/files/pubs/2020-07_Spatiotemporal-reservoir-resampling/ReSTIR.pdf

Thanks for reading!

Video: https://www.youtube.com/watch?v=bZ2sCqj3ACU

Stack Exchange: https://stats.stackexchange.com/questions/229036/sampling-importance-resampling-why-resample/229086

Picking Fairly From a List of Unknown Size With Reservoir Sampling

There was an awesome graphics paper in 2020 that put together some fun and interesting statistics ideas to try and solve the “many lights” problem of ray tracing. The many lights problem is that when you are trying to shade a pixel on a surface, you technically need to shoot a ray at every light to see if it is shining onto the surface or not. This is relatively ok for a handful of lights, but becomes a big problem when you have 100,000 lights. The method in the paper has each pixel try a couple samples, share what was found with neighbors, and share with itself next frame to have some memory between frames so that it can improve where it samples over time.

The paper is a good read and you should check it out: “Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting” https://research.nvidia.com/sites/default/files/pubs/2020-07_Spatiotemporal-reservoir-resampling/ReSTIR.pdf

In this post, we are going to be talking about one of the methods the paper uses, which is reservoir sampling. This allows you to stream a list of data and fairly (uniform randomly) choose an item from the list as you go. This also works when you want to choose items in the list with different probabilities. You can also just choose a random subset of the list to look at, do the same process, and the results aren’t bad.

This is useful in the many lights problem because it allows a random subset of lights to be considered by each pixel, each frame, and over time they find the best lights to sample with limited ray counts.

There are lots of other uses for reservoir sampling though, within rendering and game development, and outside of it too.

The 237 line single file C++ program that made the data for this post can be found at: https://github.com/Atrix256/WeightedReservoirSampling

Uniform Reservoir Sampling

Choosing an item uniform randomly from a streaming list of unknown size is surprisingly easy. Starting at index 1, the chance of selecting an item is 1/index. This means that index 1 is chosen when seen, then there is a 50% chance (1/2) to select index 2 instead. After that, there is a 33% chance (1/3) to select index 3 instead, and a 25% chance (1/4) to select index 4, and so on.

The mathematics of why this works is illustrated well in this youtube video: https://www.youtube.com/watch?v=A1iwzSew5QY

int index = 0;
int chosenValue = 0;

while(MoreValuesAvailable())
{
  index++;
  int nextValue = GetNextValue();
  float chance = 1.0f / float(index);
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue;
}

Weighted Reservoir Sampling

Adding non uniform sampling to the algorithm is not much more difficult. The chance to choose a specific item becomes the weight of that item, divided by the sum of the weights of the items seen so far. The weights don’t have to be a normalized PDF or PMF, you can use weights of any scale. If you use a weight of “1” for all the items, it becomes the same as uniform reservoir sampling.

float weightSum = 0.0f;
int chosenValue = 0;

while(MoreValuesAvailable())
{
  item nextValue = GetNextValue();
  weightSum += nextValue.weight;
  float chance = nextValue.weight / weightSum;
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue.value;
}
PDF(x) =3x^2

Subset Uniform Reservoir Sampling

If you are trying to uniform randomly choose an item from a list that you know the size of, but is just too big, you can do uniform reservoir sampling on a random subset of the list. The larger the random subset, the better the results, but smaller subsets can work too – especially if you are amortizing a search of a larger subset over time.

int chosenValue = 0;

for (int index = 1; index <= sampleCount; ++index)
{
  int nextValue = GetRandomListValue();
  float chance = 1.0f / float(index);
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue;
}
10% of the list was sampled in each test

Subset Weighted Reservoir Sampling

Adding weighting to the subset reservoir sampling is pretty easy too.

float weightSum = 0.0f;
int chosenValue = 0;

for (int index = 1; index <= sampleCount; ++index)
{
  item nextValue = GetRandomListValue();
  weightSum += nextValue.weight;
  float chance = nextValue.weight / weightSum;
  if (GetRandomFloat01() < chance)
    chosenValue = nextValue;
}
10% of the list was sampled in each test. PDF(x) =3x^2

Combining Reservoirs

If you have multiple reservoirs, they can be combined. The paper mentioned uses this to combine last frame reservoir samples with current frame reservoir samples.

void CombineReservoirs(int chosenValueA, float weightSumA, int chosenValueB, float weightSumB, int& newChosenValue, float& newWeightSum)
{
  newWeightSum = weightSumA + weightSumB;
  float chanceA = weightSumA  / newWeightSum;
  if (GetRandomFloat01() < chanceA)
    newChosenValue = chosenValueA;
  else
    newChosenValue = chosenValueB;
}

For uniform sampling in the above, weightSum is the number of items seen. It’s as if every item had a weight of 1.

Choosing Multiple Items

If you want to choose N items instead of just 1, you could run the single item selection code N times in parallel to choose those N items (Note: you could have the same item chosen multiple times). If you do that, you’ll notice that the weight sum / count will be the same for each of them, so you don’t actually need to store that per each and it can be shared. That will give you something like this for uniform reservoir sampling.

int index = 0;
int chosenValues[N];

while(MoreValuesAvailable())
{
  index++;
  int nextValue = GetNextValue();
  float chance = 1.0f / float(index);
  for (int i = 0; i < N; ++i)
  {
    if (GetRandomFloat01() < chance)
      chosenValue[i] = nextValue;
  }
}

Image Sharpening Convolution Kernels

Many people know that when you blur an image, you are applying a low pass filter that removes high frequencies.

From this, it’d be reasonable to expect that applying a high pass filter would sharpen an image, right?

Well, it turns out that is not the case! High pass filtering an image gives you everything that a low pass filter would remove from the picture, and it gives you ONLY that. Because of this, high pass filtering an image looks quite a bit different than you’d expect.

Our original test image, from NIST https://www.nist.gov/image/03ts001peanutbsrmcsjpg
High pass filtering by subtracting a low pass filter result (gaussian blur) from the original image.

So when you sharpen an image in something like photoshop, what is it doing?

It does do a high pass filter and then adds those high-frequency details to the original image, thus boosting the high-frequency content. It’s doing an “Unsharp mask” https://en.wikipedia.org/wiki/Unsharp_masking. You may need to open the original image and the one below in separate browser tabs to flip back and forth and see the difference.

The high pass filtered result added to the original image.

The algorithm for sharpening an image is then:

  1. Blur an image using whatever blur you wish (e.g., Box, Tent, Gaussian)
  2. Subtract the blurred result from the original image to get high-frequency details.
  3. Add the high-frequency details to the original image.

Algebraically, this can be expressed as:

image + (image – blurred)

or:

2 * image – blurred

Blurring is most commonly done by convolving an image with a low frequency kernel that sums to 1. If we are assuming that path to blurring, we can actually build a sharpening kernel which encodes the equation we just derived. For “image”, we’ll just use the identity matrix for convolution which is all zeros except a 1 in the center. That gives us this:

2 * identity – blur

If we wanted to make a 3×3 box blur into a sharpening filter we would calculate it this way:

2 * \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} - \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} * \frac{1}{9} = \begin{bmatrix} -1 & -1 & -1 \\ -1 & 17 & -1 \\ -1 & -1 & -1 \end{bmatrix} * \frac{1}{9}

That makes this result:

You could also get a Gaussian blur kernel, like this one of Sigma 0.3 (calculated from http://demofox.org/gauss.html, it’s already normalized so adds up to 1.0) and calculate a sharpening filter from that:

2 * \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} - \begin{bmatrix} 0.0023 & 0.0432 & 0.0023 \\ 0.0432 & 0.8180 & 0.0432 \\ 0.0023 & 0.0432 & 0.0023 \end{bmatrix} = \begin{bmatrix} -0.0023 & -0.0432 & -0.0023 \\ -0.0432 & 1.182 & -0.0432 \\ -0.0023 & -0.0432 & -0.0023 \end{bmatrix}

That makes this result:

If you are wondering why a blur kernel has to add to 1, it technically doesn’t, but whatever it adds to is the brightness multiplier of the image it is being applied to. You can even use this to your advantage if you want to adjust the brightness while doing a blur. For instance, this kernel is a 3×3 box blur which also doubles the image brightness because it adds to 2.0.

\begin{bmatrix} 2/9 & 2/9 & 2/9 \\ 2/9 & 2/9 & 2/9 \\ 2/9 & 2/9 & 2/9 \end{bmatrix}

When using the formulation of 2 * identity – blur to calculate our sharpening filter, if blur sums to 1, and of course identity sums to 1, our equation becomes 2 * 1 – 1 = 1, so our sharpening filter also sums to 1, which means it doesn’t make the image brighter or darker. You could of course multiply the sharpening filter by a constant to make it brighten or darken the image at the same time, just like with a blur.

You might have noticed that the blur kernels only had values between 0 and 1 which meant that if we used it to filter values between 0 and 1 that our results would also be between 0 and 1 (so long as the blur kernel summed to 1, and we weren’t adjusting brightness).

In contrast, our sharpening filters had values that were negative, AND had values that were greater than 1. This is a problem because now if we filter values between 0 and 1, our results could be less than 0, or greater than 1. We need to deal with that by clamping and/or remapping the range of output to valid values (tone mapping). In the examples shown here, I just clamped the results.

This problem can come up in low pass filters too (like Lanczos which is based on sinc), but doesn’t with box or gaussian.

You might be like me and think it’s weird that a low pass filter (blur kernel) sums to 1, while a high pass filter sums to 0. Some good intuition I got from this on twitter (thanks Bart!) is that the sum of the filter is the filter response for a constant signal. For a low pass filter, you’d pass the constant (0 hz aka DC) signal through, while for a high pass filter, you’d filter it out.

The 170 lines of C++ code that made the images in this post can be found at: https://github.com/Atrix256/ImageSharpening